Python Forum
Extracting information from bills
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Extracting information from bills
#3
Hi,
thank you for your reply. There is a data interface to get most data in a much more readable format which I am using. However, there is some information, which I can only get through PDFs, (or in some stupid cases, scans of printouts of those PDFs - I'm not looking at them, yet. This would be a case of using tesseract, I guess).

I think, I have been confusing myself a bit. Tabula-py does not seem to detect anything table like in the part I am extracting (using read_pdf). The text is written in a way that one could see a table in there, but it does not have to. Using PyPDF2, I can get a text similar to this:
import PyPDF2                                                                            
pdfFileObj = open('filename.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)    
page1 = pdfReader.getPage(0)
page2 = pdfReader.getPage(1)
page2.extractText()
The output is (numbers changed to 9's for privacy reasons, preserving text structure):
Output:
'Page 2 of 2How we calculated your chargesBalance forwardAmount of your last bill$999.99Amount we received on July 01, 2099 - thank you$999.99CRBalance forward$0.00Your electricity chargesYour service type is Residential Service - EnergyElectricity used this billing periodWe estimated your meter A9999999 on August 30, 2099 99999We read your meter on September 01, 2099- 999999Difference in meter readings 999999Metered usage in kilowatt-hours (99 x 999) = 9,999 kWh Demand - kWWe estimated your meter A9999999 on August 30, 2099 999Demand used in kilowatts (999 x 999 ÷ 9,999) = 99 kW Total demand in kilowatts = 99 kW (may be used for billing demand)Demand - kVAWe estimated your meter A9999999 on August 30, 2099 999Demand used in kVA (999 x 999 ÷ 9,999) = 99 kVA 99 x 99% = 99 kVA Total demand in kVA = 99 kVAElectricity: 999 kWh @ 99.9999 ¢$999.99Electricity: 9,999 kWh @ 99.9999 ¢$999.99Delivery $999.99Regulatory Charges $99.99Total of your electricity charges$999.99 {...rest removed because irrelevant and just text}
PyPDF2 seems to ignore line breaks. Sometimes, these bills change their structure in that some lines are added or removed, depending on what exact service is provided. Is there a convenient and flexible tool to extract these numbers with the right context/meaning?
I know that could probably use plain regexes. I would be hoping for something easier/more user-friendly/faster to use, because there are quite a few different bill types where I would have to write extraction code for (and bills change their layout from time to time).

Thank you for any hints.
alias5000
Reply


Messages In This Thread
Extracting information from bills - by alias5000 - Jan-27-2018, 05:48 PM
RE: Extracting information from bills - by alias5000 - Feb-02-2018, 05:03 PM

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020