Jan-27-2018, 05:48 PM
Hi everyone,
I am hoping to write some code that can extract certain information from utility bills. I am hoping that the result will be simple, reliable and easily adaptable to new bill templates.
Getting the text out of my bill (e.g. from PDFs) isn't my concern - I have found sufficient tools to make that happen (e.g. using tabula-py). However, I am looking for a library & possibly tutorial (or similar) to extract information from that text, e.g. amount of electricity used, account number, bill date, billing amount, etc. The format of that bill is mostly static, though there can be some minor variations in the way billing components are added/existent or how that affects the layout.
I understand that I could just write my own regular expressions and try to adapt them to all sorts of bills. However, I think that in the end this will require quite a bit of development time, if minor variations in the bill occur, as well as I will have to completely recreate those regexes if I want to parse a bills from a different company. [I should say that I am coming from a 'I really don't like writing regexes' standpoint].
Are there more flexible/adaptive and user-friendly ways of extracting that information?
I would greatly appreciate pointers to resources that would support this effort.
Thank you (and hello to this forum)!
alias5000
I am hoping to write some code that can extract certain information from utility bills. I am hoping that the result will be simple, reliable and easily adaptable to new bill templates.
Getting the text out of my bill (e.g. from PDFs) isn't my concern - I have found sufficient tools to make that happen (e.g. using tabula-py). However, I am looking for a library & possibly tutorial (or similar) to extract information from that text, e.g. amount of electricity used, account number, bill date, billing amount, etc. The format of that bill is mostly static, though there can be some minor variations in the way billing components are added/existent or how that affects the layout.
I understand that I could just write my own regular expressions and try to adapt them to all sorts of bills. However, I think that in the end this will require quite a bit of development time, if minor variations in the bill occur, as well as I will have to completely recreate those regexes if I want to parse a bills from a different company. [I should say that I am coming from a 'I really don't like writing regexes' standpoint].
Are there more flexible/adaptive and user-friendly ways of extracting that information?
I would greatly appreciate pointers to resources that would support this effort.
Thank you (and hello to this forum)!
alias5000