Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
PDF readers
#1
Hi,
In the realm of genealogy, people have often turned their life's work into a pdf.
It usually can be read using pdfplumber and sometimes pdfminer.
Now I came across a huge legacy pdf (2014) with 10.000+ pages.
It is well presented,with 11 columns of information on 19th century mariages.
I can read it allright, but the columns and rows are not delimited (eg. with a grid).
All goes well until an information is missing, that creates a huge blank
and messes up the rest of the line.Big time.
Hence 2 questions:
1) Does anybody know a piece of software that will tell me what software created the pdf in the first place?
(Old version of excel, Quatro Pro, access, Word....?) EDIT: pypdf2 found that : it is GPL Ghostscript 8.15. Wow!
2) There are other pdf reading python modules, from somebody's experience,
which one handles blank spaces in a non - gridded row the best. It should be something
that has the notion of CRLF at the end of a line.
thx,
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#2
try PyPdf4.

pdfminer.six: may work for you. It's not easy to use, but is quite flexable. Docs here.
Reply
#3
(Dec-29-2022, 11:19 AM)Larz60+ Wrote: pdfminer.six: may work for you. It's not easy to use, but is quite flexable. Docs here.
Hi Larz,
tried it, as well as many others.
The problem is the blanc spaces: eg.
Output:
aaa bbb ccc ddd xxx yyy zzz
No module seems to be able to assign zzz to the 4th column, if a vertical grid is absent.
I'm considering to turn the pdf pages into jpgs, then i have more pixel counting possibilities.
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#4
(Dec-30-2022, 06:41 AM)DPaul Wrote: I'm considering to turn the pdf pages into jpgs, then i have more pixel counting possibilities.
Output:
$ ls -l /usr/bin/*pdf* -rwxr-xr-x 1 root root 1007 sept. 26 16:05 /usr/bin/dvipdf -rwxr-xr-x 1 root root 698 sept. 26 16:05 /usr/bin/pdf2dsc -rwxr-xr-x 1 root root 909 sept. 26 16:05 /usr/bin/pdf2ps -rwxr-xr-x 1 root root 18824 sept. 6 11:32 /usr/bin/pdfattach -rwxr-xr-x 1 root root 23032 sept. 6 11:32 /usr/bin/pdfdetach -rwxr-xr-x 1 root root 23064 sept. 6 11:32 /usr/bin/pdffonts -rwxr-xr-x 1 root root 39448 sept. 6 11:32 /usr/bin/pdfimages -rwxr-xr-x 1 root root 59928 sept. 6 11:32 /usr/bin/pdfinfo -rwxr-xr-x 1 root root 22920 sept. 6 11:32 /usr/bin/pdfseparate -rwxr-xr-x 1 root root 35608 sept. 6 11:32 /usr/bin/pdfsig -rwxr-xr-x 1 root root 137712 sept. 6 11:32 /usr/bin/pdftocairo -rwxr-xr-x 1 root root 108968 sept. 6 11:32 /usr/bin/pdftohtml -rwxr-xr-x 1 root root 35240 sept. 6 11:32 /usr/bin/pdftoppm -rwxr-xr-x 1 root root 35360 sept. 6 11:32 /usr/bin/pdftops -rwxr-xr-x 1 root root 43544 sept. 6 11:32 /usr/bin/pdftotext -rwxr-xr-x 1 root root 31112 sept. 6 11:32 /usr/bin/pdfunite -rwxr-xr-x 1 root root 272 sept. 26 16:05 /usr/bin/ps2pdf -rwxr-xr-x 1 root root 215 sept. 26 16:05 /usr/bin/ps2pdf12 -rwxr-xr-x 1 root root 215 sept. 26 16:05 /usr/bin/ps2pdf13 -rwxr-xr-x 1 root root 215 sept. 26 16:05 /usr/bin/ps2pdf14 -rwxr-xr-x 1 root root 1078 sept. 26 16:05 /usr/bin/ps2pdfwr
you could try pdftohtml for example, it may turn your pdf array into a html array.
Reply
#5
Hi,
a) I tried pymupdf (to html) -> clean read, but , same thing, it cannot see large blancs as missing data.
b) pdftohtml, seems to require pdftotree -> not tested yet, have not found a straightforward example yet.
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#6
There is no way, a module can reconstitute a table without (especially vertical) gridlines
where some lines have (multiple) missing values.
Output:
aaa bbb ccc ddd eee ppp qqq xxx yyy zzz fff
A table may look perfect, but the pdf modules do not count the number of spaces,
to discover something is missing.
We'll need to go to the pixel level.
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#7
(Dec-31-2022, 07:24 AM)DPaul Wrote: to discover something is missing.
We'll need to go to the pixel level.
You could perhaps look directly in the pdf file to see if there is a way to count the columns, or you could also try and convert the code to Postscript, perhaps using the Ghostscript that created the Pdf and see if you can separate the columns in the postscript file.
Reply
#8
(Dec-31-2022, 07:47 AM)Gribouillis Wrote: using the Ghostscript that created the Pdf and see if you can separate the columns in the postscript file.
I installed ghostscript => strange thing that seems to live in a console application on its own.
Managed to open the pdf, unfortunately the image presented by ghostscript 10.0 is very much degraded,
as opposed to viewing it with acrobat. Degraded to the point that letters are blurred.
And working with a 10.000 page pdf in this console environment is not funny.
I still think pdf to image is the option.
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#9
Hi,
Using a suite of tools, you can make some progress, but not all the way Confused
1) rotate document so it becomes landscape : use pypdf2
2) make an image(png) for every pdf page: use mupdf (fitz)
3) read the image to text as if there were gridlines: Tesseract.
And this way you get there, almost...
Because the number of blancs that separate the "columns" confuse the OCR, if the variation is great.
eg. if one person is called "john" and immediately underneath, "Henricus Baptist Adrianus",
the OCR thinks something is missing after "John", causing a slight shift in the list index.
You can program for that in a 10 page document, not in 20.000 pages.
The only way to get a grid is to draw one, on every page. Avoiding the column headers, and
assuming that the document has been scanned straight, with minimum tolerance.
Tell me if there is another way Cool (Don't try MSword, "open pdf" -> soup and not beautiful)
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#10
(Jan-03-2023, 08:29 AM)DPaul Wrote: Tell me if there is another way
If we had a part of the pdf file, we could perhaps try tools by our own...
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  .doc (word) readers DPaul 0 1,424 Jan-10-2023, 04:28 PM
Last Post: DPaul

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020