![]() |
PDF readers - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Forum & Off Topic (https://python-forum.io/forum-23.html) +--- Forum: Bar (https://python-forum.io/forum-27.html) +--- Thread: PDF readers (/thread-39070.html) Pages:
1
2
|
PDF readers - DPaul - Dec-29-2022 Hi, In the realm of genealogy, people have often turned their life's work into a pdf. It usually can be read using pdfplumber and sometimes pdfminer. Now I came across a huge legacy pdf (2014) with 10.000+ pages. It is well presented,with 11 columns of information on 19th century mariages. I can read it allright, but the columns and rows are not delimited (eg. with a grid). All goes well until an information is missing, that creates a huge blank and messes up the rest of the line.Big time. Hence 2 questions: 1) Does anybody know a piece of software that will tell me what software created the pdf in the first place? (Old version of excel, Quatro Pro, access, Word....?) EDIT: pypdf2 found that : it is GPL Ghostscript 8.15. Wow! 2) There are other pdf reading python modules, from somebody's experience, which one handles blank spaces in a non - gridded row the best. It should be something that has the notion of CRLF at the end of a line. thx, Paul RE: PDF readers - Larz60+ - Dec-29-2022 try PyPdf4. pdfminer.six: may work for you. It's not easy to use, but is quite flexable. Docs here. RE: PDF readers - DPaul - Dec-30-2022 (Dec-29-2022, 11:19 AM)Larz60+ Wrote: pdfminer.six: may work for you. It's not easy to use, but is quite flexable. Docs here.Hi Larz, tried it, as well as many others. The problem is the blanc spaces: eg. No module seems to be able to assign zzz to the 4th column, if a vertical grid is absent.I'm considering to turn the pdf pages into jpgs, then i have more pixel counting possibilities. Paul RE: PDF readers - Gribouillis - Dec-30-2022 (Dec-30-2022, 06:41 AM)DPaul Wrote: I'm considering to turn the pdf pages into jpgs, then i have more pixel counting possibilities. you could try pdftohtml for example, it may turn your pdf array into a html array.
RE: PDF readers - DPaul - Dec-30-2022 Hi, a) I tried pymupdf (to html) -> clean read, but , same thing, it cannot see large blancs as missing data. b) pdftohtml, seems to require pdftotree -> not tested yet, have not found a straightforward example yet. Paul RE: PDF readers - DPaul - Dec-31-2022 There is no way, a module can reconstitute a table without (especially vertical) gridlines where some lines have (multiple) missing values. A table may look perfect, but the pdf modules do not count the number of spaces, to discover something is missing. We'll need to go to the pixel level. Paul RE: PDF readers - Gribouillis - Dec-31-2022 (Dec-31-2022, 07:24 AM)DPaul Wrote: to discover something is missing.You could perhaps look directly in the pdf file to see if there is a way to count the columns, or you could also try and convert the code to Postscript, perhaps using the Ghostscript that created the Pdf and see if you can separate the columns in the postscript file. RE: PDF readers - DPaul - Dec-31-2022 (Dec-31-2022, 07:47 AM)Gribouillis Wrote: using the Ghostscript that created the Pdf and see if you can separate the columns in the postscript file.I installed ghostscript => strange thing that seems to live in a console application on its own. Managed to open the pdf, unfortunately the image presented by ghostscript 10.0 is very much degraded, as opposed to viewing it with acrobat. Degraded to the point that letters are blurred. And working with a 10.000 page pdf in this console environment is not funny. I still think pdf to image is the option. Paul RE: PDF readers - DPaul - Jan-03-2023 Hi, Using a suite of tools, you can make some progress, but not all the way ![]() 1) rotate document so it becomes landscape : use pypdf2 2) make an image(png) for every pdf page: use mupdf (fitz) 3) read the image to text as if there were gridlines: Tesseract. And this way you get there, almost... Because the number of blancs that separate the "columns" confuse the OCR, if the variation is great. eg. if one person is called "john" and immediately underneath, "Henricus Baptist Adrianus", the OCR thinks something is missing after "John", causing a slight shift in the list index. You can program for that in a 10 page document, not in 20.000 pages. The only way to get a grid is to draw one, on every page. Avoiding the column headers, and assuming that the document has been scanned straight, with minimum tolerance. Tell me if there is another way ![]() Paul RE: PDF readers - Gribouillis - Jan-03-2023 (Jan-03-2023, 08:29 AM)DPaul Wrote: Tell me if there is another wayIf we had a part of the pdf file, we could perhaps try tools by our own... |