Python Forum
Reading All The RAW Data Inside a PDF
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Reading All The RAW Data Inside a PDF
#1
Hi, can anyone suggest code that I can use that will return all the raw data in a PDF (including any special tags/mark up applied to text).

Appreciate you all.

-Jim
Reply
#2
I've been looking at some of the PDF libraries myself, and from what I know so far, I'd suggest you take a look at PyPDF2
Sig:
>>> import this

The UNIX philosophy: "Do one thing, and do it well."

"The danger of computers becoming like humans is not as great as the danger of humans becoming like computers." :~ Konrad Zuse

"Everything should be made as simple as possible, but not simpler." :~ Albert Einstein
Reply
#3
(Nov-30-2022, 06:58 PM)rob101 Wrote: I've been looking at some of the PDF libraries myself, and from what I know so far, I'd suggest you take a look at PyPDF2

Yes, I tried this one already, and when I used:

import PyPDF2
import fitz 
import re


#Assign File
file_name = "STRIVE December Schedule -A.pdf"

doc = PyPDF2.PdfFileReader(file_name)

#Number of pages
pages = doc.getNumPages()

for page in doc:
    current_page = doc.getPage(i)
    text = current_page.extractText()

    print(text)
The text returned was the "readable" text from the PDF. What I want is a level BELOW that, where I can see the raw markup/tags applied to all the text.
Larz60+ write Nov-30-2022, 10:55 PM:
Please post all code, output and errors (it it's entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.
Fixed for you this time. Please use BBCode tags on future posts.
Reply
#4
Ah, okay. Well the only other one I've used is pdfrw 0.4

I've not used it for what you're tying to do, but you may find something there that will work for you.
NBAComputerMan likes this post
Sig:
>>> import this

The UNIX philosophy: "Do one thing, and do it well."

"The danger of computers becoming like humans is not as great as the danger of humans becoming like computers." :~ Konrad Zuse

"Everything should be made as simple as possible, but not simpler." :~ Albert Einstein
Reply
#5
if you really want to get down to the nitty-gritty, see: https://opensource.adobe.com/dc-acrobat-...arted.html
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Reading Data from JSON tpolim008 2 1,031 Sep-27-2022, 06:34 PM
Last Post: Larz60+
  Help reading data from serial RS485 korenron 8 13,594 Nov-14-2021, 06:49 AM
Last Post: korenron
  Help with WebSocket reading data from anoter function korenron 0 1,300 Sep-19-2021, 11:08 AM
Last Post: korenron
  Fastest Way of Writing/Reading Data JamesA 1 2,138 Jul-27-2021, 03:52 PM
Last Post: Larz60+
  Reading data to python: turn into list or dataframe hhchenfx 2 5,272 Jun-01-2021, 10:28 AM
Last Post: Larz60+
  Reading data from mysql. stsxbel 2 2,162 May-23-2021, 06:56 PM
Last Post: stsxbel
  reading canbus data as hex korenron 9 6,155 Dec-30-2020, 01:52 PM
Last Post: korenron
  Reading Serial data Moris526 6 5,280 Dec-26-2020, 04:04 PM
Last Post: Moris526
  wrong data reading on uart fahri 6 3,301 Sep-29-2020, 03:07 PM
Last Post: Larz60+
  Reading serial data and saving to a file Mohan 1 7,438 May-25-2020, 04:18 PM
Last Post: pyzyx3qwerty

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020