Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Extract text from PDF
#1
   

Hi. Newbie to python.

I have this PDF that i review each week to make sure the records are setup correctly in excel.

I am wanting to have a script that will pull out the description of the items, the sell price and the save price into each individual row.

Can this be done through python?

Thanking you in advance for this.
Reply
#2
Hi, yes it can.
But, there are many kinds of pdf, and also many python modules that will do the job.
We need to know more.
- How are the pdfs created (Word, photocopies...?)
- How are the data arranged ( plain text, columns, grids ...?) -> always the same as in the example ?
"PDF" is just a type of document, with may possible approaches.
Paul

PS. I seem to recall that a similar question has been asked some months ago, if not longer.
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#3
You can do this easily. I recommend fitz aka pymupdf.

Go to this link for great information.

Here a little example of how to get text and image blocks from the manual I got with my induction cooker.

import fitz
# import pymupdf aka fitz
from pprint import pprint

# go here for very good information on using fitz
# https://pymupdf.readthedocs.io/en/latest/tutorial.html
path2pdf = '/home/pedro/pdfs/pdfs/user_manual_ce208.pdf'

doc = fitz.open(path2pdf)
num_pages = doc.page_count # 36

# the first page of the pdf is page 0
# loop to manipulate each page in turn
# for page in doc:
# for now just get 1 page
page = doc.load_page(0)
d = page.get_text("dict") # big because contains images
blocks = d["blocks"]  # the list of block dictionaries
# text blocks are type 0
textblocks = [b for b in blocks if b["type"] == 0]
# image blocks are type 1
imgblocks = [b for b in blocks if b["type"] == 1]
pprint(textblocks[0])
print(textblocks[0]['lines'][0]['spans'][0]['text'])
text = page.get_text('text')
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  extract only text strip byte array Pir8Radio 7 7,090 Nov-29-2022, 10:24 PM
Last Post: Pir8Radio
  Extract only certain text which are needed Calli 26 13,130 Oct-10-2022, 03:58 PM
Last Post: deanhystad
  Extract text rektcol 6 2,834 Jun-28-2022, 08:57 AM
Last Post: Gribouillis
  Extract a string between 2 words from a text file OscarBoots 2 2,766 Nov-02-2021, 08:50 AM
Last Post: ibreeden
  Extract text based on postion and pattern guddu_12 2 2,455 Sep-27-2021, 08:32 PM
Last Post: guddu_12
  Extract specific sentences from text file Bubly 3 5,106 May-31-2021, 06:55 PM
Last Post: Larz60+
  extract color text from PDF Maha 0 2,622 May-31-2021, 04:05 PM
Last Post: Maha
Question How to extract multiple text from a string? chatguy 2 3,329 Feb-28-2021, 07:39 AM
Last Post: bowlofred
  How to extract a single word from a text file buttercup 7 7,128 Jul-22-2020, 04:45 AM
Last Post: bowlofred
  How to extract specific rows and columns from a text file with Python Farhan 0 4,158 Mar-25-2020, 09:18 PM
Last Post: Farhan

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020