Isolate all images from a pdf document

Isolate all images from a pdf document - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Isolate all images from a pdf document (/thread-40877.html)

Isolate all images from a pdf document - cybertooth - Oct-07-2023

Hi everyone. I am trying to isolate images from a pdf file. The problem I am facing is that all the images are not isolated, out of the 3 images the code isolates them as 2 images . 1&2 as one image , and 3 as 2nd image. Can some one help me solve this issue. Please find the pdf file as attachment.

import fitz
import io
import tkinter as tk
from tkinter import filedialog
from PIL import Image, ImageTk
import pydicom
import os
import matplotlib.pyplot as plt
from pathlib import Path
 
 
def select_pdf_file():
    """Load images from a PDF file"""
    global images
 
    if file := filedialog.askopenfilename(filetypes=[("PDF", "*.pdf")]):
        images = []
        with fitz.open(file) as doc:
            for page in doc:
                for xref, *_ in page.get_images():
                    image = doc.extract_image(xref)
                    images.append(Image.open(io.BytesIO(image["image"])))
        show_image(0)
def show_image(index):
    """Display selected image"""
    global img_index
 
    canvas.delete("all")
    size = (canvas.winfo_width(), canvas.winfo_height())
    if images:
        img_index = index % len(images)
        image = images[img_index].copy()  # Keep original image
        image.thumbnail(size)
        image = ImageTk.PhotoImage(image)
        canvas.create_image(
            size[0] / 2, size[1] / 2, anchor=tk.CENTER, image=image
        )
        canvas.image = image
    else:
        canvas.create_text(
            size[0] / 2, size[1] / 2, anchor=tk.CENTER, text="No Images",
            fill="white", font=('Helvetica 15 bold')
        )
images = []
img_index = 0
root = tk.Tk()
canvas = tk.Canvas(root, bg="black")
canvas.bind("<Configure>", lambda event: show_image(img_index))
canvas.pack(padx=10, pady=10, side=tk.TOP, expand=True, fill=tk.BOTH)
 
bbar = tk.Frame(root)
bbar.pack(side=tk.TOP, fill=tk.X, padx=10, pady=(0, 10))
button = tk.Button(bbar, text="<<", command=lambda: show_image(img_index-1))
button.pack(side=tk.LEFT)
button = tk.Button(bbar, text="Select PDF", command=select_pdf_file)
button.pack(side=tk.LEFT, expand=True, fill=tk.X)

button = tk.Button(bbar, text=">>", command=lambda: show_image(img_index+1))
button.pack(side=tk.LEFT)

root.mainloop()

[attachment=2594]

RE: Isolate all images from a pdf document - Pedroski55 - Oct-07-2023

I would say, your source pdf file only contains 2 images, because this also only gives me 2 images.

path2pdf = '/home/pedro/pdfs/pdfs/doc3.pdf'
savepath = '/home/pedro/pdfExtractedPages/pdf2jpg/'
reader = PdfReader(path2pdf)
page = reader.pages[0]
count = 0
for image_file_object in page.images:
    with open(savepath + str(count) + image_file_object.name, "wb") as fp:
        fp.write(image_file_object.data)
        count += 1

Make a pdf which definitely contains 3 images and try that!

RE: Isolate all images from a pdf document - Gribouillis - Oct-07-2023

I also tried to extract the images using the pdfimages command that comes with poppler-utils and it only extracts two images.

RE: Isolate all images from a pdf document - cybertooth - Oct-08-2023

(Oct-07-2023, 02:48 PM)Pedroski55 Wrote: I would say, your source pdf file only contains 2 images, because this also only gives me 2 images.
path2pdf = '/home/pedro/pdfs/pdfs/doc3.pdf'
savepath = '/home/pedro/pdfExtractedPages/pdf2jpg/'
reader = PdfReader(path2pdf)
page = reader.pages[0]
count = 0
for image_file_object in page.images:
    with open(savepath + str(count) + image_file_object.name, "wb") as fp:
        fp.write(image_file_object.data)
        count += 1
Make a pdf which definitely contains 3 images and try that!

Yes that is the exact problem .I get only two images while this pdf was made using 3 images and they are numbered 1,2,3. I am looking for a solution as how to get all the three .

RE: Isolate all images from a pdf document - Gribouillis - Oct-08-2023

(Oct-08-2023, 07:14 AM)cybertooth Wrote: while this pdf was made using 3 images

How was the pdf made exactly? because apparently it only contains 2 images.

RE: Isolate all images from a pdf document - DPaul - Oct-08-2023

Hi,
I'm interested in this post , (but not contributing to a solution).
I rarely handle pdf's with pictures.
But I am wondering what kind of pictures can the Pdfreader retrieve?
i.e. how did the pictures get on to the pdf page?
From photoshop or even MS word "save as pdf" ? ...
Surely, if it's a pdf made of photocopies with pictures on them, it can't isolate them ?
Just curious, as I might encounter such a pdf, one day.
thx,
Paul

RE: Isolate all images from a pdf document - Gribouillis - Oct-08-2023

(Oct-08-2023, 08:18 AM)DPaul Wrote: Surely, if it's a pdf made of photocopies with pictures on them, it can't isolate them ?

The pdf contains only 2 images. The first one is a montage of the flowers numbered 1 and 2 and the second image is a single flower.

For example the following image is a single image, it was created with the linux command

Output:
montage -mode concatenate b.jpg b.jpg m.jpg

RE: Isolate all images from a pdf document - DPaul - Oct-08-2023

(Oct-08-2023, 08:27 AM)Gribouillis Wrote: For example the following image is a single image, it was created with the linux command

Sure, an image is a (jpg,png,tif...)-file with x_pixels / y_pixels, and what is on it, can be anything.
My question is:
a) Paste this picture in eg. MS Word, and "save as" pdf -> I assume the pdfReader can isolate it. (as one pic)
b) Print this image on a white page, scan the page as pdf -> I assume the pdfReader cannot see the picture.
Am I correct ? Or is pdfReader equipped with some magical algirithms.

thx,
Paul