Python Forum

Full Version: Reading All The RAW Data Inside a PDF
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi, can anyone suggest code that I can use that will return all the raw data in a PDF (including any special tags/mark up applied to text).

Appreciate you all.

-Jim
I've been looking at some of the PDF libraries myself, and from what I know so far, I'd suggest you take a look at PyPDF2
(Nov-30-2022, 06:58 PM)rob101 Wrote: [ -> ]I've been looking at some of the PDF libraries myself, and from what I know so far, I'd suggest you take a look at PyPDF2

Yes, I tried this one already, and when I used:

import PyPDF2
import fitz 
import re


#Assign File
file_name = "STRIVE December Schedule -A.pdf"

doc = PyPDF2.PdfFileReader(file_name)

#Number of pages
pages = doc.getNumPages()

for page in doc:
    current_page = doc.getPage(i)
    text = current_page.extractText()

    print(text)
The text returned was the "readable" text from the PDF. What I want is a level BELOW that, where I can see the raw markup/tags applied to all the text.
Ah, okay. Well the only other one I've used is pdfrw 0.4

I've not used it for what you're tying to do, but you may find something there that will work for you.
if you really want to get down to the nitty-gritty, see: https://opensource.adobe.com/dc-acrobat-...arted.html