Python Forum
how to extract financial data from photocopy of document
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
how to extract financial data from photocopy of document
#1
I have a lot of company annual reports in PDF format, and they are scanned copies (an example is in link 1 below). I need to extract data from the financial statements from the PDF, such as 'revenue' and other items on page 13 of the example.

Another task is to extract names of shareholders and their shareholding figures from documents (an example is in link 2 below).

Can anybody help or tell me what is the most efficient way to do that? Thank you very much.

Link 1:
https://beta.companieshouse.gov.uk/compa...download=0

Link 2:
https://beta.companieshouse.gov.uk/compa...download=0
Reply
#2
Is it an actual PDF document or just an image embedded in a PDF? If so there is not much you can do.

Is there a reason why you can’t use the XBRL format instead?
There is no passion to be found playing small - in settling for a life that is less than the one you are capable of living.
Reply
#3
This are embedded images. You need OCR to solve this problem. pytesseract is a wrapper around Tesseract. But the results are very worse (maybe my own mistake?) and you get noisy data back. Maybe a prepossessing of the images may help.

There is expensive software you can buy specially to do OCR for invoices etc.
OCR stands for Optical Character Recognition.

You should look deeper in the Tesseract document: https://tesseract-ocr.github.io/tessdoc/...ality.html

So yes, pre processing of images are needed.
You can also train new languages: https://tesseract-ocr.github.io/tessdoc/...eract.html

I guess it's a lot of work to get good results back, without do manual corrections afterwards.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#4
(Feb-12-2020, 11:31 PM)jim2007 Wrote: Is it an actual PDF document or just an image embedded in a PDF? If so there is not much you can do.

Is there a reason why you can’t use the XBRL format instead?

it's an actual pdf document.

Where can I get XBRL format document?
Reply
#5
(Feb-12-2020, 11:31 PM)jim2007 Wrote: Is it an actual PDF document or just an image embedded in a PDF? If so there is not much you can do.

Is there a reason why you can’t use the XBRL format instead?

I found XBRL format of the documents!

Thanks!

However, it seems there are some companies not having XBRL format document.
Reply
#6
(Feb-13-2020, 12:18 AM)DeaD_EyE Wrote: This are embedded images. You need OCR to solve this problem. pytesseract is a wrapper around Tesseract. But the results are very worse (maybe my own mistake?) and you get noisy data back. Maybe a prepossessing of the images may help.

There is expensive software you can buy specially to do OCR for invoices etc.
OCR stands for Optical Character Recognition.

You should look deeper in the Tesseract document: https://tesseract-ocr.github.io/tessdoc/...ality.html

So yes, pre processing of images are needed.
You can also train new languages: https://tesseract-ocr.github.io/tessdoc/...eract.html

I guess it's a lot of work to get good results back, without do manual corrections afterwards.

Thank you! I'm reading the documents now.

is there any commercial OCR software that you might recommend?
Reply
#7
(Feb-14-2020, 07:52 AM)angela1 Wrote: However, it seems there are some companies not having XBRL format document.

That is unfortunate. Any chance they might have them on their own websites by any chance..
There is no passion to be found playing small - in settling for a life that is less than the one you are capable of living.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Filling NaNs in a financial dataset larzz 11 1,888 Jun-07-2023, 03:40 PM
Last Post: snippsat
  Training a model to identify specific SMS types and extract relevant data? lord_of_cinder 0 955 Oct-10-2022, 04:35 AM
Last Post: lord_of_cinder
  extract and plot data from a txt file usercat123 2 1,208 Apr-20-2022, 06:50 PM
Last Post: usercat123
  How to extract data from paragraph using Machine Learning with python? bccsthilina 2 3,006 Jul-27-2020, 07:02 AM
Last Post: hussainmujtaba
  Financial Modeling MarkHaversham 2 4,744 Feb-11-2020, 10:55 AM
Last Post: Mikhail_Shi
  How to extract data between two strings SriMekala 2 2,342 Aug-08-2019, 01:54 PM
Last Post: SriMekala
  How to extract different data groups from multiple CSV files using python Rafiz 3 3,197 Jun-04-2019, 05:20 PM
Last Post: jefsummers
  Extract data between two dates from a .csv file using Python 2.7 sujai_banerji 1 10,307 Nov-15-2017, 09:48 PM
Last Post: snippsat
  I'm working onn below code to extract data from excel using python kiran 1 3,234 Oct-24-2017, 01:42 PM
Last Post: kiran
  Google Financial Client ian 7 6,298 Sep-21-2017, 07:23 PM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020