For loops & DataFrames

pprod · Feb-17-2021, 10:54 AM

I'm running RegExs on a 13-page PDF file in a Jyputer notebook and I want to display the result in a DataFrame. However, when I execute the code below the DataFrame shows only the result for the last page of the PDF.

Is it possible to make the DataFrame show the RegExs results for all 13 pages keeping the code in different cells as below? (sorry, I can't share the PDF as it's confidential).

import PyPDF2
import re
import pandas as pd

#new cell

file = open(r'C:\file.pdf', 'rb')
doc = PyPDF2.PdfFileReader(file)

#new cell

for i in range(0,13):
    text = doc.getPage(i).extractText()
    #print(text)                            
    
    loc_re = re.compile(r'S\d+_\d+_DOG', re.IGNORECASE)
    loc = loc_re.findall(text)
    #print(cpt)
       
    easting_re = re.compile(r'E[ ]*\d{6}')
    easting = easting_re.findall(text)
    #print(easting)
    
    northing_re = re.compile(r'N[ ]*\d{7}')
    northing = northing_re.findall(text)
    #print(northing)

#new cell

df = {'LOC': loc, 'Easting':easting, 'Northing': northing}
df = pd.DataFrame(df)
df.head()

maurom82 · Feb-18-2021, 02:51 PM

I think you should insert inside the for loop the dataframe generation, in order to generate a df for each page and appending each df generated to a new one.
It is normal that the code above shows you only the last one, because you generate the df outside the for loop.

pprod · (This post was last modified: Feb-22-2021, 05:03 PM by pprod.)

Thanks, maurom82. I generated the df inside the for loop and it worked. However, I read that df.append() copies all the data with every append and so, it makes the process inefficient when looping through files with many pages. My file has 130p and it took 46s to append the data frames generated in each loop which is fine but I'd like to ask whether there is a better/more efficient way of doing this? Any suggestions? Thanks!

Here's my code after moving the df generation inside the for loop:

df_all = pd.DataFrame()

for i in range(0,13):
    text = doc.getPage(i).extractText()
    #print(text)                            
     
    loc_re = re.compile(r'S\d+_\d+_DOG', re.IGNORECASE)
    loc = loc_re.findall(text)
    #print(cpt)
        
    easting_re = re.compile(r'E[ ]*\d{6}')
    easting = easting_re.findall(text)
    #print(easting)
     
    northing_re = re.compile(r'N[ ]*\d{7}')
    northing = northing_re.findall(text)
    #print(northing)

    df = pd.DataFrame({'LOC': loc, 'Easting':easting, 'Northing': northing})
    df_all = df_all.append(df, ignore_index=True)
print(df_all)

**nilamo** · Feb-24-2021, 07:12 PM

Are you actually using the dataframes for anything? Or are you creating them purely to print them out in the notebook?

If it's just visual, then what if you put the print() inside the for loop, and remove the append() part?

#df_all = pd.DataFrame()
 
for i in range(0,13):
    text = doc.getPage(i).extractText()
    #print(text)                            
      
    loc_re = re.compile(r'S\d+_\d+_DOG', re.IGNORECASE)
    loc = loc_re.findall(text)
    #print(cpt)
         
    easting_re = re.compile(r'E[ ]*\d{6}')
    easting = easting_re.findall(text)
    #print(easting)
      
    northing_re = re.compile(r'N[ ]*\d{7}')
    northing = northing_re.findall(text)
    #print(northing)
 
    df = pd.DataFrame({'LOC': loc, 'Easting':easting, 'Northing': northing})
    print(df)
    #df_all = df_all.append(df, ignore_index=True)
#print(df_all)

**nilamo** · Feb-24-2021, 07:16 PM

OR!

If you do want the big dataframe, but you don't need one for each page, then maybe a list will help? That way you only construct one dataframe, one time, instead of a new one per page and appending it along the way.

data = []
 
for i in range(0,13):
    text = doc.getPage(i).extractText()
    #print(text)                            
      
    loc_re = re.compile(r'S\d+_\d+_DOG', re.IGNORECASE)
    loc = loc_re.findall(text)
    #print(cpt)
         
    easting_re = re.compile(r'E[ ]*\d{6}')
    easting = easting_re.findall(text)
    #print(easting)
      
    northing_re = re.compile(r'N[ ]*\d{7}')
    northing = northing_re.findall(text)
    #print(northing)
 
    #df = pd.DataFrame({'LOC': loc, 'Easting':easting, 'Northing': northing})
    data.append({'LOC': loc, 'Easting':easting, 'Northing': northing})
    #df_all = df_all.append(df, ignore_index=True)
df_all = pd.DataFrame(data)
print(df_all)

pprod · Feb-25-2021, 08:23 AM

Thanks, nilamo. You're right, since I only need the big dataframe your suggestion is a more elegant way of doing it. Cheers.

For loops & DataFrames

User Panel Messages

Announcements