Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
PDF readers
#9
Hi,
Using a suite of tools, you can make some progress, but not all the way Confused
1) rotate document so it becomes landscape : use pypdf2
2) make an image(png) for every pdf page: use mupdf (fitz)
3) read the image to text as if there were gridlines: Tesseract.
And this way you get there, almost...
Because the number of blancs that separate the "columns" confuse the OCR, if the variation is great.
eg. if one person is called "john" and immediately underneath, "Henricus Baptist Adrianus",
the OCR thinks something is missing after "John", causing a slight shift in the list index.
You can program for that in a 10 page document, not in 20.000 pages.
The only way to get a grid is to draw one, on every page. Avoiding the column headers, and
assuming that the document has been scanned straight, with minimum tolerance.
Tell me if there is another way Cool (Don't try MSword, "open pdf" -> soup and not beautiful)
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply


Messages In This Thread
PDF readers - by DPaul - Dec-29-2022, 07:15 AM
RE: PDF readers - by Larz60+ - Dec-29-2022, 11:19 AM
RE: PDF readers - by DPaul - Dec-30-2022, 06:41 AM
RE: PDF readers - by Gribouillis - Dec-30-2022, 08:18 AM
RE: PDF readers - by DPaul - Dec-30-2022, 03:53 PM
RE: PDF readers - by DPaul - Dec-31-2022, 07:24 AM
RE: PDF readers - by Gribouillis - Dec-31-2022, 07:47 AM
RE: PDF readers - by DPaul - Dec-31-2022, 10:16 AM
RE: PDF readers - by DPaul - Jan-03-2023, 08:29 AM
RE: PDF readers - by Gribouillis - Jan-03-2023, 11:29 AM
RE: PDF readers - by DPaul - Jan-03-2023, 04:12 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  .doc (word) readers DPaul 0 1,543 Jan-10-2023, 04:28 PM
Last Post: DPaul

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020