Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
reading pdfs
#1
I often get handed pdfs generated with an app like excel, or word.
If there are gridlines , separating rows and columns, these pdfs can be "OCR"red
with pdfplumber (and others).

Sometimes they forgot to print the gridlines, and you end up with clear rows
and columns, but no lines.
There is a solution for this eg. FITZ will allow you to read a page where start and end of the
columns are defined using a "bbox" approach with pixel coordinates. Works fine!

But these are transcriptions of very old data, so now and then a name or a date is missing.
With gridlines, an empty cell is considered "missing data".
But without gridlines, the blanks are ignored. The column vector of data just shiifts one up,
causing incompatibility with other columns on the same page. (different length).

Anybody aware of a trick to make "BBOX" recignize those empty spaces ?
thx,
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020