Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 how to extract financial data from photocopy of document
#1
I have a lot of company annual reports in PDF format, and they are scanned copies (an example is in link 1 below). I need to extract data from the financial statements from the PDF, such as 'revenue' and other items on page 13 of the example.

Another task is to extract names of shareholders and their shareholding figures from documents (an example is in link 2 below).

Can anybody help or tell me what is the most efficient way to do that? Thank you very much.

Link 1:
https://beta.companieshouse.gov.uk/compa...download=0

Link 2:
https://beta.companieshouse.gov.uk/compa...download=0
Quote
#2
Is it an actual PDF document or just an image embedded in a PDF? If so there is not much you can do.

Is there a reason why you can’t use the XBRL format instead?
There is no passion to be found playing small - in settling for a life that is less than the one you are capable of living.
Quote
#3
This are embedded images. You need OCR to solve this problem. pytesseract is a wrapper around Tesseract. But the results are very worse (maybe my own mistake?) and you get noisy data back. Maybe a prepossessing of the images may help.

There is expensive software you can buy specially to do OCR for invoices etc.
OCR stands for Optical Character Recognition.

You should look deeper in the Tesseract document: https://tesseract-ocr.github.io/tessdoc/...ality.html

So yes, pre processing of images are needed.
You can also train new languages: https://tesseract-ocr.github.io/tessdoc/...eract.html

I guess it's a lot of work to get good results back, without do manual corrections afterwards.
My code examples are always for Python >=3.6.0
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Quote
#4
(Feb-12-2020, 11:31 PM)jim2007 Wrote: Is it an actual PDF document or just an image embedded in a PDF? If so there is not much you can do.

Is there a reason why you can’t use the XBRL format instead?

it's an actual pdf document.

Where can I get XBRL format document?
Quote
#5
(Feb-12-2020, 11:31 PM)jim2007 Wrote: Is it an actual PDF document or just an image embedded in a PDF? If so there is not much you can do.

Is there a reason why you can’t use the XBRL format instead?

I found XBRL format of the documents!

Thanks!

However, it seems there are some companies not having XBRL format document.
Quote
#6
(Feb-13-2020, 12:18 AM)DeaD_EyE Wrote: This are embedded images. You need OCR to solve this problem. pytesseract is a wrapper around Tesseract. But the results are very worse (maybe my own mistake?) and you get noisy data back. Maybe a prepossessing of the images may help.

There is expensive software you can buy specially to do OCR for invoices etc.
OCR stands for Optical Character Recognition.

You should look deeper in the Tesseract document: https://tesseract-ocr.github.io/tessdoc/...ality.html

So yes, pre processing of images are needed.
You can also train new languages: https://tesseract-ocr.github.io/tessdoc/...eract.html

I guess it's a lot of work to get good results back, without do manual corrections afterwards.

Thank you! I'm reading the documents now.

is there any commercial OCR software that you might recommend?
Quote
#7
(Feb-14-2020, 07:52 AM)angela1 Wrote: However, it seems there are some companies not having XBRL format document.

That is unfortunate. Any chance they might have them on their own websites by any chance..
There is no passion to be found playing small - in settling for a life that is less than the one you are capable of living.
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  Financial Modeling MarkHaversham 2 1,573 Feb-11-2020, 10:55 AM
Last Post: Mikhail_Shi
  How to extract data between two strings SriMekala 2 354 Aug-08-2019, 01:54 PM
Last Post: SriMekala
  How to extract different data groups from multiple CSV files using python Rafiz 3 581 Jun-04-2019, 05:20 PM
Last Post: jefsummers
  Extract data between two dates from a .csv file using Python 2.7 sujai_banerji 1 5,323 Nov-15-2017, 09:48 PM
Last Post: snippsat
  I'm working onn below code to extract data from excel using python kiran 1 1,252 Oct-24-2017, 01:42 PM
Last Post: kiran
  Google Financial Client ian 7 2,836 Sep-21-2017, 07:23 PM
Last Post: Larz60+

Forum Jump:


Users browsing this thread: 1 Guest(s)