Python Forum
Reading All The RAW Data Inside a PDF - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Reading All The RAW Data Inside a PDF (/thread-38840.html)



Reading All The RAW Data Inside a PDF - NBAComputerMan - Nov-30-2022

Hi, can anyone suggest code that I can use that will return all the raw data in a PDF (including any special tags/mark up applied to text).

Appreciate you all.

-Jim


RE: Reading All The RAW Data Inside a PDF - rob101 - Nov-30-2022

I've been looking at some of the PDF libraries myself, and from what I know so far, I'd suggest you take a look at PyPDF2


RE: Reading All The RAW Data Inside a PDF - NBAComputerMan - Nov-30-2022

(Nov-30-2022, 06:58 PM)rob101 Wrote: I've been looking at some of the PDF libraries myself, and from what I know so far, I'd suggest you take a look at PyPDF2

Yes, I tried this one already, and when I used:

import PyPDF2
import fitz 
import re


#Assign File
file_name = "STRIVE December Schedule -A.pdf"

doc = PyPDF2.PdfFileReader(file_name)

#Number of pages
pages = doc.getNumPages()

for page in doc:
    current_page = doc.getPage(i)
    text = current_page.extractText()

    print(text)
The text returned was the "readable" text from the PDF. What I want is a level BELOW that, where I can see the raw markup/tags applied to all the text.


RE: Reading All The RAW Data Inside a PDF - rob101 - Nov-30-2022

Ah, okay. Well the only other one I've used is pdfrw 0.4

I've not used it for what you're tying to do, but you may find something there that will work for you.


RE: Reading All The RAW Data Inside a PDF - Larz60+ - Nov-30-2022

if you really want to get down to the nitty-gritty, see: https://opensource.adobe.com/dc-acrobat-sdk-docs/pdflsdk/gettingstarted.html