Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
For loops & DataFrames
#1
I'm running RegExs on a 13-page PDF file in a Jyputer notebook and I want to display the result in a DataFrame. However, when I execute the code below the DataFrame shows only the result for the last page of the PDF.

Is it possible to make the DataFrame show the RegExs results for all 13 pages keeping the code in different cells as below? (sorry, I can't share the PDF as it's confidential).


import PyPDF2
import re
import pandas as pd
#new cell
file = open(r'C:\file.pdf', 'rb')
doc = PyPDF2.PdfFileReader(file)
#new cell
for i in range(0,13):
    text = doc.getPage(i).extractText()
    #print(text)                            
    
    loc_re = re.compile(r'S\d+_\d+_DOG', re.IGNORECASE)
    loc = loc_re.findall(text)
    #print(cpt)
       
    easting_re = re.compile(r'E[ ]*\d{6}')
    easting = easting_re.findall(text)
    #print(easting)
    
    northing_re = re.compile(r'N[ ]*\d{7}')
    northing = northing_re.findall(text)
    #print(northing)
#new cell
df = {'LOC': loc, 'Easting':easting, 'Northing': northing}
df = pd.DataFrame(df)
df.head()
Reply
#2
I think you should insert inside the for loop the dataframe generation, in order to generate a df for each page and appending each df generated to a new one.
It is normal that the code above shows you only the last one, because you generate the df outside the for loop.
Reply
#3
Thanks, maurom82. I generated the df inside the for loop and it worked. However, I read that df.append() copies all the data with every append and so, it makes the process inefficient when looping through files with many pages. My file has 130p and it took 46s to append the data frames generated in each loop which is fine but I'd like to ask whether there is a better/more efficient way of doing this? Any suggestions? Thanks!

Here's my code after moving the df generation inside the for loop:

df_all = pd.DataFrame()

for i in range(0,13):
    text = doc.getPage(i).extractText()
    #print(text)                            
     
    loc_re = re.compile(r'S\d+_\d+_DOG', re.IGNORECASE)
    loc = loc_re.findall(text)
    #print(cpt)
        
    easting_re = re.compile(r'E[ ]*\d{6}')
    easting = easting_re.findall(text)
    #print(easting)
     
    northing_re = re.compile(r'N[ ]*\d{7}')
    northing = northing_re.findall(text)
    #print(northing)

    df = pd.DataFrame({'LOC': loc, 'Easting':easting, 'Northing': northing})
    df_all = df_all.append(df, ignore_index=True)
print(df_all)
Reply
#4
Are you actually using the dataframes for anything? Or are you creating them purely to print them out in the notebook?

If it's just visual, then what if you put the print() inside the for loop, and remove the append() part?

#df_all = pd.DataFrame()
 
for i in range(0,13):
    text = doc.getPage(i).extractText()
    #print(text)                            
      
    loc_re = re.compile(r'S\d+_\d+_DOG', re.IGNORECASE)
    loc = loc_re.findall(text)
    #print(cpt)
         
    easting_re = re.compile(r'E[ ]*\d{6}')
    easting = easting_re.findall(text)
    #print(easting)
      
    northing_re = re.compile(r'N[ ]*\d{7}')
    northing = northing_re.findall(text)
    #print(northing)
 
    df = pd.DataFrame({'LOC': loc, 'Easting':easting, 'Northing': northing})
    print(df)
    #df_all = df_all.append(df, ignore_index=True)
#print(df_all)
Reply
#5
OR!

If you do want the big dataframe, but you don't need one for each page, then maybe a list will help? That way you only construct one dataframe, one time, instead of a new one per page and appending it along the way.

data = []
 
for i in range(0,13):
    text = doc.getPage(i).extractText()
    #print(text)                            
      
    loc_re = re.compile(r'S\d+_\d+_DOG', re.IGNORECASE)
    loc = loc_re.findall(text)
    #print(cpt)
         
    easting_re = re.compile(r'E[ ]*\d{6}')
    easting = easting_re.findall(text)
    #print(easting)
      
    northing_re = re.compile(r'N[ ]*\d{7}')
    northing = northing_re.findall(text)
    #print(northing)
 
    #df = pd.DataFrame({'LOC': loc, 'Easting':easting, 'Northing': northing})
    data.append({'LOC': loc, 'Easting':easting, 'Northing': northing})
    #df_all = df_all.append(df, ignore_index=True)
df_all = pd.DataFrame(data)
print(df_all)
Reply
#6
Thanks, nilamo. You're right, since I only need the big dataframe your suggestion is a more elegant way of doing it. Cheers.
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020