Feb-14-2020, 11:22 AM
(Feb-13-2020, 12:18 AM)DeaD_EyE Wrote: This are embedded images. You need OCR to solve this problem. pytesseract is a wrapper around Tesseract. But the results are very worse (maybe my own mistake?) and you get noisy data back. Maybe a prepossessing of the images may help.
There is expensive software you can buy specially to do OCR for invoices etc.
OCR stands for Optical Character Recognition.
You should look deeper in the Tesseract document: https://tesseract-ocr.github.io/tessdoc/...ality.html
So yes, pre processing of images are needed.
You can also train new languages: https://tesseract-ocr.github.io/tessdoc/...eract.html
I guess it's a lot of work to get good results back, without do manual corrections afterwards.
Thank you! I'm reading the documents now.
is there any commercial OCR software that you might recommend?