PDF readers

DPaul · (This post was last modified: Dec-29-2022, 07:15 AM by DPaul.)

Hi,
In the realm of genealogy, people have often turned their life's work into a pdf.
It usually can be read using pdfplumber and sometimes pdfminer.
Now I came across a huge legacy pdf (2014) with 10.000+ pages.
It is well presented,with 11 columns of information on 19th century mariages.
I can read it allright, but the columns and rows are not delimited (eg. with a grid).
All goes well until an information is missing, that creates a huge blank
and messes up the rest of the line.Big time.
Hence 2 questions:
1) Does anybody know a piece of software that will tell me what software created the pdf in the first place?
(Old version of excel, Quatro Pro, access, Word....?) EDIT: pypdf2 found that : it is GPL Ghostscript 8.15. Wow!
2) There are other pdf reading python modules, from somebody's experience,
which one handles blank spaces in a non - gridded row the best. It should be something
that has the notion of CRLF at the end of a line.
thx,
Paul

**Larz60+** · Dec-29-2022, 11:19 AM

try PyPdf4.

pdfminer.six: may work for you. It's not easy to use, but is quite flexable. Docs here.

DPaul · Dec-30-2022, 06:41 AM

(Dec-29-2022, 11:19 AM)Larz60+ Wrote: pdfminer.six: may work for you. It's not easy to use, but is quite flexable. Docs here.

Hi Larz,
tried it, as well as many others.
The problem is the blanc spaces: eg.

Output:aaa bbb ccc ddd
xxx yyy     zzz

No module seems to be able to assign zzz to the 4th column, if a vertical grid is absent.
I'm considering to turn the pdf pages into jpgs, then i have more pixel counting possibilities.
Paul

**Gribouillis** · Dec-30-2022, 08:18 AM

(Dec-30-2022, 06:41 AM)DPaul Wrote: I'm considering to turn the pdf pages into jpgs, then i have more pixel counting possibilities.

Output:$ ls -l /usr/bin/*pdf*
-rwxr-xr-x 1 root root   1007 sept. 26 16:05 /usr/bin/dvipdf
-rwxr-xr-x 1 root root    698 sept. 26 16:05 /usr/bin/pdf2dsc
-rwxr-xr-x 1 root root    909 sept. 26 16:05 /usr/bin/pdf2ps
-rwxr-xr-x 1 root root  18824 sept.  6 11:32 /usr/bin/pdfattach
-rwxr-xr-x 1 root root  23032 sept.  6 11:32 /usr/bin/pdfdetach
-rwxr-xr-x 1 root root  23064 sept.  6 11:32 /usr/bin/pdffonts
-rwxr-xr-x 1 root root  39448 sept.  6 11:32 /usr/bin/pdfimages
-rwxr-xr-x 1 root root  59928 sept.  6 11:32 /usr/bin/pdfinfo
-rwxr-xr-x 1 root root  22920 sept.  6 11:32 /usr/bin/pdfseparate
-rwxr-xr-x 1 root root  35608 sept.  6 11:32 /usr/bin/pdfsig
-rwxr-xr-x 1 root root 137712 sept.  6 11:32 /usr/bin/pdftocairo
-rwxr-xr-x 1 root root 108968 sept.  6 11:32 /usr/bin/pdftohtml
-rwxr-xr-x 1 root root  35240 sept.  6 11:32 /usr/bin/pdftoppm
-rwxr-xr-x 1 root root  35360 sept.  6 11:32 /usr/bin/pdftops
-rwxr-xr-x 1 root root  43544 sept.  6 11:32 /usr/bin/pdftotext
-rwxr-xr-x 1 root root  31112 sept.  6 11:32 /usr/bin/pdfunite
-rwxr-xr-x 1 root root    272 sept. 26 16:05 /usr/bin/ps2pdf
-rwxr-xr-x 1 root root    215 sept. 26 16:05 /usr/bin/ps2pdf12
-rwxr-xr-x 1 root root    215 sept. 26 16:05 /usr/bin/ps2pdf13
-rwxr-xr-x 1 root root    215 sept. 26 16:05 /usr/bin/ps2pdf14
-rwxr-xr-x 1 root root   1078 sept. 26 16:05 /usr/bin/ps2pdfwr

you could try pdftohtml for example, it may turn your pdf array into a html array.

DPaul · Dec-30-2022, 03:53 PM

Hi,
a) I tried pymupdf (to html) -> clean read, but , same thing, it cannot see large blancs as missing data.
b) pdftohtml, seems to require pdftotree -> not tested yet, have not found a straightforward example yet.
Paul

DPaul · Dec-31-2022, 07:24 AM

There is no way, a module can reconstitute a table without (especially vertical) gridlines
where some lines have (multiple) missing values.

Output:aaa bbb ccc ddd eee
    ppp         qqq
    xxx yyy zzz fff

A table may look perfect, but the pdf modules do not count the number of spaces,
to discover something is missing.
We'll need to go to the pixel level.
Paul

**Gribouillis** · Dec-31-2022, 07:47 AM

(Dec-31-2022, 07:24 AM)DPaul Wrote: to discover something is missing.
We'll need to go to the pixel level.

You could perhaps look directly in the pdf file to see if there is a way to count the columns, or you could also try and convert the code to Postscript, perhaps using the Ghostscript that created the Pdf and see if you can separate the columns in the postscript file.

DPaul · Dec-31-2022, 10:16 AM

(Dec-31-2022, 07:47 AM)Gribouillis Wrote: using the Ghostscript that created the Pdf and see if you can separate the columns in the postscript file.

I installed ghostscript => strange thing that seems to live in a console application on its own.
Managed to open the pdf, unfortunately the image presented by ghostscript 10.0 is very much degraded,
as opposed to viewing it with acrobat. Degraded to the point that letters are blurred.
And working with a 10.000 page pdf in this console environment is not funny.
I still think pdf to image is the option.
Paul

DPaul · Jan-03-2023, 08:29 AM

Hi,
Using a suite of tools, you can make some progress, but not all the way Confused

1) rotate document so it becomes landscape : use pypdf2
2) make an image(png) for every pdf page: use mupdf (fitz)
3) read the image to text as if there were gridlines: Tesseract.
And this way you get there, almost...
Because the number of blancs that separate the "columns" confuse the OCR, if the variation is great.
eg. if one person is called "john" and immediately underneath, "Henricus Baptist Adrianus",
the OCR thinks something is missing after "John", causing a slight shift in the list index.
You can program for that in a 10 page document, not in 20.000 pages.
The only way to get a grid is to draw one, on every page. Avoiding the column headers, and
assuming that the document has been scanned straight, with minimum tolerance.
Tell me if there is another way Cool

(Don't try MSword, "open pdf" -> soup and not beautiful)
Paul

**Gribouillis** · Jan-03-2023, 11:29 AM

(Jan-03-2023, 08:29 AM)DPaul Wrote: Tell me if there is another way

If we had a part of the pdf file, we could perhaps try tools by our own...

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	.doc (word) readers	DPaul	0	1,424	Jan-10-2023, 04:28 PM Last Post: DPaul

PDF readers

User Panel Messages

Announcements