Jun-02-2021, 11:43 AM
(This post was last modified: Jun-02-2021, 11:43 AM by CaptainCsaba.)
I am fairly used the pdfminer and have also worked with PyPDF2 and pdfrw.
I have to extract each paragraph in the pdf, with as many data as possible: font, size, color, bold, italic etc, as much as possible. This is so later I can filter out the ones I want. Let's say I want to select all the text in this pdf that is size 16 with the Times new Roman Font, that has an rgb color of (255,255,255) and is bold etc. I have already created this part, I just have to include the headers somehow.
Pdfminer is almost good, but you can't extract the color and many other properties. Would love for it to work but there is just not enough data that you can extract from the LTTextLine and LTChar objects. The same is true with PyPDF2 and pdfrw.
Converting it to XML and HTML causes data loss thus those are not good alternatives either.
This is why docx is what I am looking for, it is easy to navigate and contains the most data in an easy to get format.
I have tried 3 methods of conversion:
1. pdf2docx pip package is slow and unreliable.
2. Calling Word in win32com is good but really slow.
3. Calling Adobe in win32com is good and also much faster as the others. This is the one I am using.
The problem is that all of them randomly create headers from the top of each page. This is what I am trying to solve. Is there no way to include the headers as part of the body text? Even if it is a way in Word for example, then it can be called and automated. Does nobody have any idea?
EDIT: I am only using non-scanned pdfs. So these things should be extractable.
I have to extract each paragraph in the pdf, with as many data as possible: font, size, color, bold, italic etc, as much as possible. This is so later I can filter out the ones I want. Let's say I want to select all the text in this pdf that is size 16 with the Times new Roman Font, that has an rgb color of (255,255,255) and is bold etc. I have already created this part, I just have to include the headers somehow.
Pdfminer is almost good, but you can't extract the color and many other properties. Would love for it to work but there is just not enough data that you can extract from the LTTextLine and LTChar objects. The same is true with PyPDF2 and pdfrw.
Converting it to XML and HTML causes data loss thus those are not good alternatives either.
This is why docx is what I am looking for, it is easy to navigate and contains the most data in an easy to get format.
I have tried 3 methods of conversion:
1. pdf2docx pip package is slow and unreliable.
2. Calling Word in win32com is good but really slow.
3. Calling Adobe in win32com is good and also much faster as the others. This is the one I am using.
The problem is that all of them randomly create headers from the top of each page. This is what I am trying to solve. Is there no way to include the headers as part of the body text? Even if it is a way in Word for example, then it can be called and automated. Does nobody have any idea?
EDIT: I am only using non-scanned pdfs. So these things should be extractable.