Feb-17-2021, 10:54 AM
I'm running RegExs on a 13-page PDF file in a Jyputer notebook and I want to display the result in a DataFrame. However, when I execute the code below the DataFrame shows only the result for the last page of the PDF.
Is it possible to make the DataFrame show the RegExs results for all 13 pages keeping the code in different cells as below? (sorry, I can't share the PDF as it's confidential).
Is it possible to make the DataFrame show the RegExs results for all 13 pages keeping the code in different cells as below? (sorry, I can't share the PDF as it's confidential).
import PyPDF2 import re import pandas as pd#new cell
file = open(r'C:\file.pdf', 'rb') doc = PyPDF2.PdfFileReader(file)#new cell
for i in range(0,13): text = doc.getPage(i).extractText() #print(text) loc_re = re.compile(r'S\d+_\d+_DOG', re.IGNORECASE) loc = loc_re.findall(text) #print(cpt) easting_re = re.compile(r'E[ ]*\d{6}') easting = easting_re.findall(text) #print(easting) northing_re = re.compile(r'N[ ]*\d{7}') northing = northing_re.findall(text) #print(northing)#new cell
df = {'LOC': loc, 'Easting':easting, 'Northing': northing} df = pd.DataFrame(df) df.head()