Python Forum
Extracting information from bills
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Extracting information from bills
#4
(Feb-02-2018, 05:03 PM)alias5000 Wrote: Hi,
thank you for your reply. There is a data interface to get most data in a much more readable format which I am using. However, there is some information, which I can only get through PDFs, (or in some stupid cases, scans of printouts of those PDFs - I'm not looking at them, yet. This would be a case of using tesseract, I guess).

I think, I have been confusing myself a bit. Tabula-py does not seem to detect anything table like in the part I am extracting (using read_pdf). The text is written in a way that one could see a table in there, but it does not have to. Using PyPDF2, I can get a text similar to this:
import PyPDF2                                                                            
pdfFileObj = open('filename.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)    
page1 = pdfReader.getPage(0)
page2 = pdfReader.getPage(1)
page2.extractText()
The output is (numbers changed to 9's for privacy reasons, preserving text structure):
Output:
'Page 2 of 2How we calculated your chargesBalance forwardAmount of your last bill$999.99Amount we received on July 01, 2099 - thank you$999.99CRBalance forward$0.00Your electricity chargesYour service type is Residential Service - EnergyElectricity used this billing periodWe estimated your meter A9999999 on August 30, 2099 99999We read your meter on September 01, 2099- 999999Difference in meter readings 999999Metered usage in kilowatt-hours (99 x 999) = 9,999 kWh Demand - kWWe estimated your meter A9999999 on August 30, 2099 999Demand used in kilowatts (999 x 999 ÷ 9,999) = 99 kW Total demand in kilowatts = 99 kW (may be used for billing demand)Demand - kVAWe estimated your meter A9999999 on August 30, 2099 999Demand used in kVA (999 x 999 ÷ 9,999) = 99 kVA 99 x 99% = 99 kVA Total demand in kVA = 99 kVAElectricity: 999 kWh @ 99.9999 ¢$999.99Electricity: 9,999 kWh @ 99.9999 ¢$999.99Delivery $999.99Regulatory Charges $99.99Total of your electricity charges$999.99 {...rest removed because irrelevant and just text}
PyPDF2 seems to ignore line breaks. Sometimes, these bills change their structure in that some lines are added or removed, depending on what exact service is provided. Is there a convenient and flexible tool to extract these numbers with the right context/meaning?
I know that could probably use plain regexes. I would be hoping for something easier/more user-friendly/faster to use, because there are quite a few different bill types where I would have to write extraction code for (and bills change their layout from time to time).

Thank you for any hints.
alias5000

Hey there— I created an account on this forum to ask if, in the 3 years since you posted this, you were able to create a tool that successfully accomplishes consistent & accurate data extraction from utility bill PDFs?

Would love to hear your thoughts!
Reply


Messages In This Thread
Extracting information from bills - by alias5000 - Jan-27-2018, 05:48 PM
RE: Extracting information from bills - by PG_Archipelago - Jan-25-2023, 09:03 PM

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020