May-31-2024, 09:04 PM
(This post was last modified: May-31-2024, 09:28 PM by Gribouillis.)
I have this code below. The purpose of the code is to: extract the paragraphs that include an asterisk and its associated photos, from a PDF document, into a Word Document.
The code works and it exports the paragraphs that include an asterisk into a Word Doc, but it is not grabbing only the photos associated with the paragraph, it sometimes exports photos that are for another paragraph. How can I modify the code to make sure it ONLY exports the images directly below the paragraph?
The code works and it exports the paragraphs that include an asterisk into a Word Doc, but it is not grabbing only the photos associated with the paragraph, it sometimes exports photos that are for another paragraph. How can I modify the code to make sure it ONLY exports the images directly below the paragraph?
import fitz # PyMuPDF from docx import Document from docx.shared import Inches import io from PIL import Image # Load the PDF document pdf_document = fitz.open("Sample Home.pdf") # Create a Word document word_document = Document() # Iterate through each page of the PDF for page_num in range(pdf_document.page_count): page = pdf_document.load_page(page_num) blocks = page.get_text("blocks") for block in blocks: block_text = block[4] # Check if the paragraph includes an asterisk if '*' in block_text: # Add the paragraph to the Word document word_document.add_paragraph(block_text) # Extract images associated with this paragraph image_list = page.get_images(full=True) for image_index, img in enumerate(image_list): xref = img[0] base_image = pdf_document.extract_image(xref) image_bytes = base_image["image"] # Load image using PIL image = Image.open(io.BytesIO(image_bytes)) image_filename = f"image_{page_num}_{image_index}.png" image.save(image_filename) # Add image to the Word document word_document.add_picture(image_filename, width=Inches(5)) # Save the Word document word_document.save("Extracted_Paragraphs_and_Images.docx")
Gribouillis write May-31-2024, 09:28 PM:
Please post all code, output and errors (it it's entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.
Please post all code, output and errors (it it's entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.