Mar-05-2019, 03:13 PM
Hi!
I have the following code. I don't know how PDFminer works so the first part is somebody else code which I modified a bit. It seems to work but does everything 8 or more times instead of just one and gets slower with every line I add. The point of it would be that there are a lot of PDF-s in a folder. I want to open some which have specific words in their titles and extract some words into an excel file. The concept seems to work. The problem is what i described above. I think there is a wring indentation. Or is it something else? I am still learning. Thank you for your help.
I also had to change the variables to fruits, I hope it doesn't look that silly. It is far from finished so most fruits are not added in the end yet.
I have the following code. I don't know how PDFminer works so the first part is somebody else code which I modified a bit. It seems to work but does everything 8 or more times instead of just one and gets slower with every line I add. The point of it would be that there are a lot of PDF-s in a folder. I want to open some which have specific words in their titles and extract some words into an excel file. The concept seems to work. The problem is what i described above. I think there is a wring indentation. Or is it something else? I am still learning. Thank you for your help.
I also had to change the variables to fruits, I hope it doesn't look that silly. It is far from finished so most fruits are not added in the end yet.
import glob, os from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.pdfpage import PDFPage from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from io import StringIO #first was cStringIO import re filename = "Data_Aquisition.csv" f = open(filename, "a", encoding="utf-8") headers = "company_name, apple_date, apple_designation, banana_date, banana_Designation, orange_date, orange_designation, pear, cherry_date, cheery_company\n" f.write(headers) pdflist = glob.glob("*.pdf") for file in pdflist: if "apple" or "banana" or "orange" or "Orange" or "pear" or "cherry" or "Cherry" or "Apple" or "Banana" or "Pear" in str(file): def convert(fname, pages=None): if not pages: pagenums = set() else: pagenums = set(pages) output = StringIO() manager = PDFResourceManager() converter = TextConverter(manager, output, laparams=LAParams()) interpreter = PDFPageInterpreter(manager, converter) infile = open(fname, 'rb') for page in PDFPage.get_pages(infile, pagenums): interpreter.process_page(page) infile.close() converter.close() text = output.getvalue() output.close return text def convertMultiple(pdfDir, txtDir): if pdfDir == "": pdfDir = os.getcwd() + "\\" #if no pdfDir passed in for pdf in os.listdir(pdfDir): #iterate through pdfs in pdf directory fileExtension = pdf.split(".")[-1] if fileExtension == "pdf": pdfFilename = pdfDir + pdf text = convert(pdfFilename) #get string of text content of pdf text1line = str(text).replace("\n", " ") if "apple" or "Apple" in str(file): appledate = re.search('enquiry dated (.*), we can confirm', text) applecompanysource = re.search(' shares of (.*) registered in', text) applecompany = str(applecompanysource).split('PLC')[0] + " PLC" appledesignation = re.search('registered in the name of (.*)Voting', text1line) print(statestreetdate) print(statestreetcompany) print(statestreetdesignation) f.write(str(statestreetdate) + ",") elif "banana" or "Banana" in str(file): bananadate = re.search('- As at (.*)', text) print(bananadate) f.write(str(bananadate) + ",") elif "cherry" or "Cherry" in str(file): cherrydate = re.search('DATE : (.*)', text) print(cherrydate) f.write(str(cherrydate) + ",") else: continue else: continue pdfDir = "C:/Users/thisisme/Desktop/DataAquisition/" txtDir = "C:/Users/thisisme/Desktop/DataAquisition/" convertMultiple(pdfDir, txtDir) else: continue