Python Forum
Python Code Help - pip install PyMuPDF python-docx pillow
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Python Code Help - pip install PyMuPDF python-docx pillow
#1
I have this code below. The purpose of the code is to: extract the paragraphs that include an asterisk and its associated photos, from a PDF document, into a Word Document.

The code works and it exports the paragraphs that include an asterisk into a Word Doc, but it is not grabbing only the photos associated with the paragraph, it sometimes exports photos that are for another paragraph. How can I modify the code to make sure it ONLY exports the images directly below the paragraph?
import fitz  # PyMuPDF
from docx import Document
from docx.shared import Inches
import io
from PIL import Image

# Load the PDF document
pdf_document = fitz.open("Sample Home.pdf")

# Create a Word document
word_document = Document()

# Iterate through each page of the PDF
for page_num in range(pdf_document.page_count):
    page = pdf_document.load_page(page_num)
    blocks = page.get_text("blocks")

    for block in blocks:
        block_text = block[4]

        # Check if the paragraph includes an asterisk
        if '*' in block_text:
            # Add the paragraph to the Word document
            word_document.add_paragraph(block_text)

            # Extract images associated with this paragraph
            image_list = page.get_images(full=True)
            for image_index, img in enumerate(image_list):
                xref = img[0]
                base_image = pdf_document.extract_image(xref)
                image_bytes = base_image["image"]

                # Load image using PIL
                image = Image.open(io.BytesIO(image_bytes))
                image_filename = f"image_{page_num}_{image_index}.png"
                image.save(image_filename)

                # Add image to the Word document
                word_document.add_picture(image_filename, width=Inches(5))

# Save the Word document
word_document.save("Extracted_Paragraphs_and_Images.docx")
Gribouillis write May-31-2024, 09:28 PM:
Please post all code, output and errors (it it's entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.
Reply
#2
Got a sample PDF to experiment on?
Reply
#3
I do, I tried uploading the PDF to my post but the file size is too large.
Reply
#4
Just a little bit of the PDF will do, say 4 or 5 pages, as long as they contain the type of data you are looking for.

I tried your code on the PDF manual for my new induction cooker. It worked ok!

pdf_document = fitz.open("/home/pedro/pdfs/pdfs/user_manual_ce208.pdf")
page = pdf_document.load_page(0)
# this returns tuples of the block coordinates and the text
blocks = page.get_text("blocks")
for block in blocks:
    if 'Instruction manual' in str(block):
        print(type(block), block)
The above prints:

Output:
<class 'tuple'> (320.31451416015625, 132.03631591796875, 445.85235595703125, 147.72772216796875, 'Instruction manual\n', 3, 0) <class 'tuple'> (118.06179809570312, 468.69854736328125, 195.3158721923828, 478.35479736328125, 'Instruction manual\n', 5, 0)
Not sure what the last two integers in each tuple represent!

Just looked them up, the tuple is:

Quote:(x0, y0, x1, y1, "lines in block", block_no, block_type)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  I am getting an IndentError on my python code in VS Code and i dont know why jcardenas1980 11 3,183 Mar-22-2025, 09:49 AM
Last Post: Pedroski55
Question Install Python Using ShellScript Sudheer 1 988 Mar-12-2025, 03:50 AM
Last Post: Tishat73
  I'm trying to install python 3.11.11 on windows 10 - it doesn't work Petonique 2 1,419 Feb-04-2025, 05:42 PM
Last Post: snippsat
  Install a module to a specific to Python Installation (one of many)) tester_V 2 1,717 Oct-29-2024, 03:25 PM
Last Post: snippsat
  Python install issue redreign83 2 757 Oct-04-2024, 07:59 AM
Last Post: Larz60+
  SOLVED: Install mailer module in Python 3.11.2? Calab 3 2,077 Jul-03-2024, 02:03 PM
Last Post: Calab
  Merge Python code with Micro Python code? adzy 2 899 Jul-03-2024, 11:41 AM
Last Post: kkinder
  Install python 2.7 in jupyter lab raman 28 4,970 Jun-01-2024, 01:53 PM
Last Post: snippsat
  Im at square one even with trying to install python origen 1 931 Jan-12-2024, 05:39 AM
Last Post: ndc85430
  no module named 'docx' when importing docx MaartenRo 1 5,179 Dec-31-2023, 11:21 AM
Last Post: deanhystad

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020