Hi, can anyone suggest code that I can use that will return all the raw data in a PDF (including any special tags/mark up applied to text).
Appreciate you all.
-Jim
I've been looking at some of the PDF libraries myself, and from what I know so far, I'd suggest you take a look at
PyPDF2
(Nov-30-2022, 06:58 PM)rob101 Wrote: [ -> ]I've been looking at some of the PDF libraries myself, and from what I know so far, I'd suggest you take a look at PyPDF2
Yes, I tried this one already, and when I used:
import PyPDF2
import fitz
import re
#Assign File
file_name = "STRIVE December Schedule -A.pdf"
doc = PyPDF2.PdfFileReader(file_name)
#Number of pages
pages = doc.getNumPages()
for page in doc:
current_page = doc.getPage(i)
text = current_page.extractText()
print(text)
The text returned was the "readable" text from the PDF. What I want is a level BELOW that, where I can see the raw markup/tags applied to all the text.
Ah, okay. Well the only other one I've used is
pdfrw 0.4
I've not used it for what you're tying to do, but you may find something there that will work for you.