Python Forum

I have this code below. The purpose of the code is to: extract the paragraphs that include an asterisk and its associated photos, from a PDF document, into a Word Document.

The code works and it exports the paragraphs that include an asterisk into a Word Doc, but it is not grabbing only the photos associated with the paragraph, it sometimes exports photos that are for another paragraph. How can I modify the code to make sure it ONLY exports the images directly below the paragraph?

import fitz  # PyMuPDF
from docx import Document
from docx.shared import Inches
import io
from PIL import Image

# Load the PDF document
pdf_document = fitz.open("Sample Home.pdf")

# Create a Word document
word_document = Document()

# Iterate through each page of the PDF
for page_num in range(pdf_document.page_count):
    page = pdf_document.load_page(page_num)
    blocks = page.get_text("blocks")

    for block in blocks:
        block_text = block[4]

        # Check if the paragraph includes an asterisk
        if '*' in block_text:
            # Add the paragraph to the Word document
            word_document.add_paragraph(block_text)

            # Extract images associated with this paragraph
            image_list = page.get_images(full=True)
            for image_index, img in enumerate(image_list):
                xref = img[0]
                base_image = pdf_document.extract_image(xref)
                image_bytes = base_image["image"]

                # Load image using PIL
                image = Image.open(io.BytesIO(image_bytes))
                image_filename = f"image_{page_num}_{image_index}.png"
                image.save(image_filename)

                # Add image to the Word document
                word_document.add_picture(image_filename, width=Inches(5))

# Save the Word document
word_document.save("Extracted_Paragraphs_and_Images.docx")

Got a sample PDF to experiment on?

I do, I tried uploading the PDF to my post but the file size is too large.

Just a little bit of the PDF will do, say 4 or 5 pages, as long as they contain the type of data you are looking for.

I tried your code on the PDF manual for my new induction cooker. It worked ok!

pdf_document = fitz.open("/home/pedro/pdfs/pdfs/user_manual_ce208.pdf")
page = pdf_document.load_page(0)
# this returns tuples of the block coordinates and the text
blocks = page.get_text("blocks")
for block in blocks:
    if 'Instruction manual' in str(block):
        print(type(block), block)

The above prints:

Output:<class 'tuple'> (320.31451416015625, 132.03631591796875, 445.85235595703125, 147.72772216796875, 'Instruction manual\n', 3, 0)
<class 'tuple'> (118.06179809570312, 468.69854736328125, 195.3158721923828, 478.35479736328125, 'Instruction manual\n', 5, 0)

Not sure what the last two integers in each tuple represent!

Just looked them up, the tuple is:

Quote:(x0, y0, x1, y1, "lines in block", block_no, block_type)

Splishsplash92

Pedroski55

Splishsplash92

Pedroski55