Jan-25-2023, 09:03 PM
(Feb-02-2018, 05:03 PM)alias5000 Wrote: Hi,
thank you for your reply. There is a data interface to get most data in a much more readable format which I am using. However, there is some information, which I can only get through PDFs, (or in some stupid cases, scans of printouts of those PDFs - I'm not looking at them, yet. This would be a case of using tesseract, I guess).
I think, I have been confusing myself a bit. Tabula-py does not seem to detect anything table like in the part I am extracting (using read_pdf). The text is written in a way that one could see a table in there, but it does not have to. Using PyPDF2, I can get a text similar to this:
import PyPDF2 pdfFileObj = open('filename.pdf', 'rb') pdfReader = PyPDF2.PdfFileReader(pdfFileObj) page1 = pdfReader.getPage(0) page2 = pdfReader.getPage(1) page2.extractText()The output is (numbers changed to 9's for privacy reasons, preserving text structure):
PyPDF2 seems to ignore line breaks. Sometimes, these bills change their structure in that some lines are added or removed, depending on what exact service is provided. Is there a convenient and flexible tool to extract these numbers with the right context/meaning?
Output:'Page 2 of 2How we calculated your chargesBalance forwardAmount of your last bill$999.99Amount we received on July 01, 2099 - thank you$999.99CRBalance forward$0.00Your electricity chargesYour service type is Residential Service - EnergyElectricity used this billing periodWe estimated your meter A9999999 on August 30, 2099 99999We read your meter on September 01, 2099- 999999Difference in meter readings 999999Metered usage in kilowatt-hours (99 x 999) = 9,999 kWh Demand - kWWe estimated your meter A9999999 on August 30, 2099 999Demand used in kilowatts (999 x 999 ÷ 9,999) = 99 kW Total demand in kilowatts = 99 kW (may be used for billing demand)Demand - kVAWe estimated your meter A9999999 on August 30, 2099 999Demand used in kVA (999 x 999 ÷ 9,999) = 99 kVA 99 x 99% = 99 kVA Total demand in kVA = 99 kVAElectricity: 999 kWh @ 99.9999 ¢$999.99Electricity: 9,999 kWh @ 99.9999 ¢$999.99Delivery $999.99Regulatory Charges $99.99Total of your electricity charges$999.99 {...rest removed because irrelevant and just text}
I know that could probably use plain regexes. I would be hoping for something easier/more user-friendly/faster to use, because there are quite a few different bill types where I would have to write extraction code for (and bills change their layout from time to time).
Thank you for any hints.
alias5000
Hey there— I created an account on this forum to ask if, in the 3 years since you posted this, you were able to create a tool that successfully accomplishes consistent & accurate data extraction from utility bill PDFs?
Would love to hear your thoughts!