Python Forum
how to extract financial data from photocopy of document - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Data Science (https://python-forum.io/forum-44.html)
+--- Thread: how to extract financial data from photocopy of document (/thread-24395.html)



how to extract financial data from photocopy of document - angela1 - Feb-12-2020

I have a lot of company annual reports in PDF format, and they are scanned copies (an example is in link 1 below). I need to extract data from the financial statements from the PDF, such as 'revenue' and other items on page 13 of the example.

Another task is to extract names of shareholders and their shareholding figures from documents (an example is in link 2 below).

Can anybody help or tell me what is the most efficient way to do that? Thank you very much.

Link 1:
https://beta.companieshouse.gov.uk/company/08167130/filing-history/MzI0NjEwMTM5NGFkaXF6a2N4/document?format=pdf&download=0

Link 2:
https://beta.companieshouse.gov.uk/company/08167130/filing-history/MzI0MzAwMDM4N2FkaXF6a2N4/document?format=pdf&download=0


RE: how to extract financial data from photocopy of document - jim2007 - Feb-12-2020

Is it an actual PDF document or just an image embedded in a PDF? If so there is not much you can do.

Is there a reason why you can’t use the XBRL format instead?


RE: how to extract financial data from photocopy of document - DeaD_EyE - Feb-13-2020

This are embedded images. You need OCR to solve this problem. pytesseract is a wrapper around Tesseract. But the results are very worse (maybe my own mistake?) and you get noisy data back. Maybe a prepossessing of the images may help.

There is expensive software you can buy specially to do OCR for invoices etc.
OCR stands for Optical Character Recognition.

You should look deeper in the Tesseract document: https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html

So yes, pre processing of images are needed.
You can also train new languages: https://tesseract-ocr.github.io/tessdoc/Training-Tesseract.html

I guess it's a lot of work to get good results back, without do manual corrections afterwards.


RE: how to extract financial data from photocopy of document - angela1 - Feb-14-2020

(Feb-12-2020, 11:31 PM)jim2007 Wrote: Is it an actual PDF document or just an image embedded in a PDF? If so there is not much you can do.

Is there a reason why you can’t use the XBRL format instead?

it's an actual pdf document.

Where can I get XBRL format document?


RE: how to extract financial data from photocopy of document - angela1 - Feb-14-2020

(Feb-12-2020, 11:31 PM)jim2007 Wrote: Is it an actual PDF document or just an image embedded in a PDF? If so there is not much you can do.

Is there a reason why you can’t use the XBRL format instead?

I found XBRL format of the documents!

Thanks!

However, it seems there are some companies not having XBRL format document.


RE: how to extract financial data from photocopy of document - angela1 - Feb-14-2020

(Feb-13-2020, 12:18 AM)DeaD_EyE Wrote: This are embedded images. You need OCR to solve this problem. pytesseract is a wrapper around Tesseract. But the results are very worse (maybe my own mistake?) and you get noisy data back. Maybe a prepossessing of the images may help.

There is expensive software you can buy specially to do OCR for invoices etc.
OCR stands for Optical Character Recognition.

You should look deeper in the Tesseract document: https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html

So yes, pre processing of images are needed.
You can also train new languages: https://tesseract-ocr.github.io/tessdoc/Training-Tesseract.html

I guess it's a lot of work to get good results back, without do manual corrections afterwards.

Thank you! I'm reading the documents now.

is there any commercial OCR software that you might recommend?


RE: how to extract financial data from photocopy of document - jim2007 - Feb-15-2020

(Feb-14-2020, 07:52 AM)angela1 Wrote: However, it seems there are some companies not having XBRL format document.

That is unfortunate. Any chance they might have them on their own websites by any chance..