Isolate all images from a pdf document

cybertooth · Oct-07-2023, 07:46 AM

Hi everyone. I am trying to isolate images from a pdf file. The problem I am facing is that all the images are not isolated, out of the 3 images the code isolates them as 2 images . 1&2 as one image , and 3 as 2nd image. Can some one help me solve this issue. Please find the pdf file as attachment.

import fitz
import io
import tkinter as tk
from tkinter import filedialog
from PIL import Image, ImageTk
import pydicom
import os
import matplotlib.pyplot as plt
from pathlib import Path
 
 
def select_pdf_file():
    """Load images from a PDF file"""
    global images
 
    if file := filedialog.askopenfilename(filetypes=[("PDF", "*.pdf")]):
        images = []
        with fitz.open(file) as doc:
            for page in doc:
                for xref, *_ in page.get_images():
                    image = doc.extract_image(xref)
                    images.append(Image.open(io.BytesIO(image["image"])))
        show_image(0)
def show_image(index):
    """Display selected image"""
    global img_index
 
    canvas.delete("all")
    size = (canvas.winfo_width(), canvas.winfo_height())
    if images:
        img_index = index % len(images)
        image = images[img_index].copy()  # Keep original image
        image.thumbnail(size)
        image = ImageTk.PhotoImage(image)
        canvas.create_image(
            size[0] / 2, size[1] / 2, anchor=tk.CENTER, image=image
        )
        canvas.image = image
    else:
        canvas.create_text(
            size[0] / 2, size[1] / 2, anchor=tk.CENTER, text="No Images",
            fill="white", font=('Helvetica 15 bold')
        )
images = []
img_index = 0
root = tk.Tk()
canvas = tk.Canvas(root, bg="black")
canvas.bind("<Configure>", lambda event: show_image(img_index))
canvas.pack(padx=10, pady=10, side=tk.TOP, expand=True, fill=tk.BOTH)
 
bbar = tk.Frame(root)
bbar.pack(side=tk.TOP, fill=tk.X, padx=10, pady=(0, 10))
button = tk.Button(bbar, text="<<", command=lambda: show_image(img_index-1))
button.pack(side=tk.LEFT)
button = tk.Button(bbar, text="Select PDF", command=select_pdf_file)
button.pack(side=tk.LEFT, expand=True, fill=tk.X)

button = tk.Button(bbar, text=">>", command=lambda: show_image(img_index+1))
button.pack(side=tk.LEFT)

root.mainloop()

doc3.pdf (Size: 201.99 KB / Downloads: 168)

Pedroski55 · Oct-07-2023, 02:48 PM

I would say, your source pdf file only contains 2 images, because this also only gives me 2 images.

path2pdf = '/home/pedro/pdfs/pdfs/doc3.pdf'
savepath = '/home/pedro/pdfExtractedPages/pdf2jpg/'
reader = PdfReader(path2pdf)
page = reader.pages[0]
count = 0
for image_file_object in page.images:
    with open(savepath + str(count) + image_file_object.name, "wb") as fp:
        fp.write(image_file_object.data)
        count += 1

Make a pdf which definitely contains 3 images and try that!

**Gribouillis** · Oct-07-2023, 04:18 PM

I also tried to extract the images using the pdfimages command that comes with poppler-utils and it only extracts two images.

cybertooth · Oct-08-2023, 07:14 AM

(Oct-07-2023, 02:48 PM)Pedroski55 Wrote: I would say, your source pdf file only contains 2 images, because this also only gives me 2 images.
path2pdf = '/home/pedro/pdfs/pdfs/doc3.pdf'
savepath = '/home/pedro/pdfExtractedPages/pdf2jpg/'
reader = PdfReader(path2pdf)
page = reader.pages[0]
count = 0
for image_file_object in page.images:
    with open(savepath + str(count) + image_file_object.name, "wb") as fp:
        fp.write(image_file_object.data)
        count += 1
Make a pdf which definitely contains 3 images and try that!

Yes that is the exact problem .I get only two images while this pdf was made using 3 images and they are numbered 1,2,3. I am looking for a solution as how to get all the three .

**Gribouillis** · Oct-08-2023, 07:31 AM

(Oct-08-2023, 07:14 AM)cybertooth Wrote: while this pdf was made using 3 images

How was the pdf made exactly? because apparently it only contains 2 images.

DPaul · Oct-08-2023, 08:18 AM

Hi,
I'm interested in this post , (but not contributing to a solution).
I rarely handle pdf's with pictures.
But I am wondering what kind of pictures can the Pdfreader retrieve?
i.e. how did the pictures get on to the pdf page?
From photoshop or even MS word "save as pdf" ? ...
Surely, if it's a pdf made of photocopies with pictures on them, it can't isolate them ?
Just curious, as I might encounter such a pdf, one day.
thx,
Paul

**Gribouillis** · (This post was last modified: Oct-08-2023, 08:37 AM by Gribouillis.)

(Oct-08-2023, 08:18 AM)DPaul Wrote: Surely, if it's a pdf made of photocopies with pictures on them, it can't isolate them ?

The pdf contains only 2 images. The first one is a montage of the flowers numbered 1 and 2 and the second image is a single flower.

For example the following image is a single image, it was created with the linux command

Output:
montage -mode concatenate b.jpg b.jpg m.jpg

DPaul · Oct-08-2023, 08:55 AM

(Oct-08-2023, 08:27 AM)Gribouillis Wrote: For example the following image is a single image, it was created with the linux command

Sure, an image is a (jpg,png,tif...)-file with x_pixels / y_pixels, and what is on it, can be anything.
My question is:
a) Paste this picture in eg. MS Word, and "save as" pdf -> I assume the pdfReader can isolate it. (as one pic)
b) Print this image on a white page, scan the page as pdf -> I assume the pdfReader cannot see the picture.
Am I correct ? Or is pdfReader equipped with some magical algirithms.

thx,
Paul

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Isolate a word from a long string	nicocorico	2	2,343	Feb-25-2022, 01:12 PM Last Post: nicocorico
	How to get first 5 images form the document using Python BeautifulSoup	sarath_unrelax	0	2,029	Dec-19-2019, 07:13 AM Last Post: sarath_unrelax

Isolate all images from a pdf document

User Panel Messages

Announcements