Python Forum
reading pdfs - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Forum & Off Topic (https://python-forum.io/forum-23.html)
+--- Forum: Bar (https://python-forum.io/forum-27.html)
+--- Thread: reading pdfs (/thread-42190.html)



reading pdfs - DPaul - May-24-2024

I often get handed pdfs generated with an app like excel, or word.
If there are gridlines , separating rows and columns, these pdfs can be "OCR"red
with pdfplumber (and others).

Sometimes they forgot to print the gridlines, and you end up with clear rows
and columns, but no lines.
There is a solution for this eg. FITZ will allow you to read a page where start and end of the
columns are defined using a "bbox" approach with pixel coordinates. Works fine!

But these are transcriptions of very old data, so now and then a name or a date is missing.
With gridlines, an empty cell is considered "missing data".
But without gridlines, the blanks are ignored. The column vector of data just shiifts one up,
causing incompatibility with other columns on the same page. (different length).

Anybody aware of a trick to make "BBOX" recignize those empty spaces ?
thx,
Paul