Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
reading pdfs
#1
I often get handed pdfs generated with an app like excel, or word.
If there are gridlines , separating rows and columns, these pdfs can be "OCR"red
with pdfplumber (and others).

Sometimes they forgot to print the gridlines, and you end up with clear rows
and columns, but no lines.
There is a solution for this eg. FITZ will allow you to read a page where start and end of the
columns are defined using a "bbox" approach with pixel coordinates. Works fine!

But these are transcriptions of very old data, so now and then a name or a date is missing.
With gridlines, an empty cell is considered "missing data".
But without gridlines, the blanks are ignored. The column vector of data just shiifts one up,
causing incompatibility with other columns on the same page. (different length).

Anybody aware of a trick to make "BBOX" recignize those empty spaces ?
thx,
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#2
Have you tried adjusting the OCR settings to be more sensitive to whitespace? Increasing the sensitivity can help it pick up on those empty spaces. Alternatively, you could preprocess the PDF with an image editor to add faint gridlines before running OCR. It may be tedious but could solve the issue.
Reply
#3
(Jun-27-2024, 10:47 PM)AdamHensley Wrote: Have you tried adjusting the OCR settings to be more sensitive to whitespace? Increasing the sensitivity can help it pick up on those empty spaces. Alternatively, you could preprocess the PDF with an image editor to add faint gridlines before running OCR. It may be tedious but could solve the issue.
Yes I thought about those.
a) I can measure the pixel distance between 2 entries, and thus detect empty spaces.
b) I can add (graphically) gridlines to pages. Fitz will convert every page into an image.
The volumes I am dealing with prevent those two options from being realistic.

Very recently I have had some success with BBOX pixel settings. This seems to be a grid-replacement solution,
but it does not solve the blancs issue.
thx,
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020