Extracting information from bills

alias5000 · Jan-27-2018, 05:48 PM

Hi everyone,
I am hoping to write some code that can extract certain information from utility bills. I am hoping that the result will be simple, reliable and easily adaptable to new bill templates.

Getting the text out of my bill (e.g. from PDFs) isn't my concern - I have found sufficient tools to make that happen (e.g. using tabula-py). However, I am looking for a library & possibly tutorial (or similar) to extract information from that text, e.g. amount of electricity used, account number, bill date, billing amount, etc. The format of that bill is mostly static, though there can be some minor variations in the way billing components are added/existent or how that affects the layout.

I understand that I could just write my own regular expressions and try to adapt them to all sorts of bills. However, I think that in the end this will require quite a bit of development time, if minor variations in the bill occur, as well as I will have to completely recreate those regexes if I want to parse a bills from a different company. [I should say that I am coming from a 'I really don't like writing regexes' standpoint].

Are there more flexible/adaptive and user-friendly ways of extracting that information?

I would greatly appreciate pointers to resources that would support this effort.

Thank you (and hello to this forum)!
alias5000

***metulburr*** · Jan-27-2018, 07:47 PM

My first question would be where are you getting this pdf from of your data?

If you log on to a website that has such bills info, they may format the information in HTML, JSON, or XML, all of which is easier to extract info from than regular text. In fact each of those formats has a specific library to do just that in python. The website would be updated constantly, where the PDF would not.

However looking at tabula, it looks like you can convert the PDF to a JSON, CSV, or Pandas DAtaFrame directly with that library. There would be no need in that case to scrape the PDF text manually then.
https://pypi.python.org/pypi/tabula-py

The tutorial you are looking for would depend on what you convert the PDF to.

CSV
https://pymotw.com/3/csv/

JSON
https://pymotw.com/3/json/

Pandas
https://pandas.pydata.org/pandas-docs/stable/10min.html

alias5000 · Feb-02-2018, 05:03 PM

Hi,
thank you for your reply. There is a data interface to get most data in a much more readable format which I am using. However, there is some information, which I can only get through PDFs, (or in some stupid cases, scans of printouts of those PDFs - I'm not looking at them, yet. This would be a case of using tesseract, I guess).

I think, I have been confusing myself a bit. Tabula-py does not seem to detect anything table like in the part I am extracting (using read_pdf). The text is written in a way that one could see a table in there, but it does not have to. Using PyPDF2, I can get a text similar to this:

import PyPDF2                                                                            
pdfFileObj = open('filename.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)    
page1 = pdfReader.getPage(0)
page2 = pdfReader.getPage(1)
page2.extractText()

The output is (numbers changed to 9's for privacy reasons, preserving text structure):

Output:
'Page 2 of 2How we calculated your chargesBalance forwardAmount of your last bill$999.99Amount we received on July 01, 2099 - thank you$999.99CRBalance forward$0.00Your electricity chargesYour service type is Residential Service - EnergyElectricity used this billing periodWe estimated your meter A9999999 on August 30, 2099 99999We read your meter on September 01, 2099- 999999Difference in meter readings 999999Metered usage in kilowatt-hours (99 x 999) = 9,999 kWh Demand - kWWe estimated your meter A9999999 on August 30, 2099 999Demand used in kilowatts (999 x 999 ÷ 9,999) = 99 kW Total demand in kilowatts = 99 kW (may be used for billing demand)Demand - kVAWe estimated your meter A9999999 on August 30, 2099 999Demand used in kVA (999 x 999 ÷ 9,999) = 99 kVA 99 x 99% = 99 kVA Total demand in kVA = 99 kVAElectricity:  999 kWh @ 99.9999 ¢$999.99Electricity:  9,999 kWh @ 99.9999 ¢$999.99Delivery  $999.99Regulatory Charges  $99.99Total of your electricity charges$999.99 {...rest removed because irrelevant and just text}

PyPDF2 seems to ignore line breaks. Sometimes, these bills change their structure in that some lines are added or removed, depending on what exact service is provided. Is there a convenient and flexible tool to extract these numbers with the right context/meaning?
I know that could probably use plain regexes. I would be hoping for something easier/more user-friendly/faster to use, because there are quite a few different bill types where I would have to write extraction code for (and bills change their layout from time to time).

Thank you for any hints.
alias5000

PG_Archipelago · Jan-25-2023, 09:03 PM

(Feb-02-2018, 05:03 PM)alias5000 Wrote: Hi,
thank you for your reply. There is a data interface to get most data in a much more readable format which I am using. However, there is some information, which I can only get through PDFs, (or in some stupid cases, scans of printouts of those PDFs - I'm not looking at them, yet. This would be a case of using tesseract, I guess).

I think, I have been confusing myself a bit. Tabula-py does not seem to detect anything table like in the part I am extracting (using read_pdf). The text is written in a way that one could see a table in there, but it does not have to. Using PyPDF2, I can get a text similar to this:
import PyPDF2                                                                            
pdfFileObj = open('filename.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)    
page1 = pdfReader.getPage(0)
page2 = pdfReader.getPage(1)
page2.extractText()
The output is (numbers changed to 9's for privacy reasons, preserving text structure):
Output:
'Page 2 of 2How we calculated your chargesBalance forwardAmount of your last bill$999.99Amount we received on July 01, 2099 - thank you$999.99CRBalance forward$0.00Your electricity chargesYour service type is Residential Service - EnergyElectricity used this billing periodWe estimated your meter A9999999 on August 30, 2099 99999We read your meter on September 01, 2099- 999999Difference in meter readings 999999Metered usage in kilowatt-hours (99 x 999) = 9,999 kWh Demand - kWWe estimated your meter A9999999 on August 30, 2099 999Demand used in kilowatts (999 x 999 ÷ 9,999) = 99 kW Total demand in kilowatts = 99 kW (may be used for billing demand)Demand - kVAWe estimated your meter A9999999 on August 30, 2099 999Demand used in kVA (999 x 999 ÷ 9,999) = 99 kVA 99 x 99% = 99 kVA Total demand in kVA = 99 kVAElectricity:  999 kWh @ 99.9999 ¢$999.99Electricity:  9,999 kWh @ 99.9999 ¢$999.99Delivery  $999.99Regulatory Charges  $99.99Total of your electricity charges$999.99 {...rest removed because irrelevant and just text} 
PyPDF2 seems to ignore line breaks. Sometimes, these bills change their structure in that some lines are added or removed, depending on what exact service is provided. Is there a convenient and flexible tool to extract these numbers with the right context/meaning?
I know that could probably use plain regexes. I would be hoping for something easier/more user-friendly/faster to use, because there are quite a few different bill types where I would have to write extraction code for (and bills change their layout from time to time).

Thank you for any hints.
alias5000

Hey there— I created an account on this forum to ask if, in the 3 years since you posted this, you were able to create a tool that successfully accomplishes consistent & accurate data extraction from utility bill PDFs?

Would love to hear your thoughts!

Extracting information from bills

User Panel Messages

Announcements