Python Forum
Isolate all images from a pdf document
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Isolate all images from a pdf document
#1
Hi everyone. I am trying to isolate images from a pdf file. The problem I am facing is that all the images are not isolated, out of the 3 images the code isolates them as 2 images . 1&2 as one image , and 3 as 2nd image. Can some one help me solve this issue. Please find the pdf file as attachment.

import fitz
import io
import tkinter as tk
from tkinter import filedialog
from PIL import Image, ImageTk
import pydicom
import os
import matplotlib.pyplot as plt
from pathlib import Path
 
 
def select_pdf_file():
    """Load images from a PDF file"""
    global images
 
    if file := filedialog.askopenfilename(filetypes=[("PDF", "*.pdf")]):
        images = []
        with fitz.open(file) as doc:
            for page in doc:
                for xref, *_ in page.get_images():
                    image = doc.extract_image(xref)
                    images.append(Image.open(io.BytesIO(image["image"])))
        show_image(0)
def show_image(index):
    """Display selected image"""
    global img_index
 
    canvas.delete("all")
    size = (canvas.winfo_width(), canvas.winfo_height())
    if images:
        img_index = index % len(images)
        image = images[img_index].copy()  # Keep original image
        image.thumbnail(size)
        image = ImageTk.PhotoImage(image)
        canvas.create_image(
            size[0] / 2, size[1] / 2, anchor=tk.CENTER, image=image
        )
        canvas.image = image
    else:
        canvas.create_text(
            size[0] / 2, size[1] / 2, anchor=tk.CENTER, text="No Images",
            fill="white", font=('Helvetica 15 bold')
        )
images = []
img_index = 0
root = tk.Tk()
canvas = tk.Canvas(root, bg="black")
canvas.bind("<Configure>", lambda event: show_image(img_index))
canvas.pack(padx=10, pady=10, side=tk.TOP, expand=True, fill=tk.BOTH)
 
bbar = tk.Frame(root)
bbar.pack(side=tk.TOP, fill=tk.X, padx=10, pady=(0, 10))
button = tk.Button(bbar, text="<<", command=lambda: show_image(img_index-1))
button.pack(side=tk.LEFT)
button = tk.Button(bbar, text="Select PDF", command=select_pdf_file)
button.pack(side=tk.LEFT, expand=True, fill=tk.X)

button = tk.Button(bbar, text=">>", command=lambda: show_image(img_index+1))
button.pack(side=tk.LEFT)

root.mainloop()

.pdf   doc3.pdf (Size: 201.99 KB / Downloads: 70)
Reply
#2
I would say, your source pdf file only contains 2 images, because this also only gives me 2 images.

path2pdf = '/home/pedro/pdfs/pdfs/doc3.pdf'
savepath = '/home/pedro/pdfExtractedPages/pdf2jpg/'
reader = PdfReader(path2pdf)
page = reader.pages[0]
count = 0
for image_file_object in page.images:
    with open(savepath + str(count) + image_file_object.name, "wb") as fp:
        fp.write(image_file_object.data)
        count += 1
Make a pdf which definitely contains 3 images and try that!
Gribouillis likes this post
Reply
#3
I also tried to extract the images using the pdfimages command that comes with poppler-utils and it only extracts two images.
Reply
#4
(Oct-07-2023, 02:48 PM)Pedroski55 Wrote: I would say, your source pdf file only contains 2 images, because this also only gives me 2 images.

path2pdf = '/home/pedro/pdfs/pdfs/doc3.pdf'
savepath = '/home/pedro/pdfExtractedPages/pdf2jpg/'
reader = PdfReader(path2pdf)
page = reader.pages[0]
count = 0
for image_file_object in page.images:
    with open(savepath + str(count) + image_file_object.name, "wb") as fp:
        fp.write(image_file_object.data)
        count += 1
Make a pdf which definitely contains 3 images and try that!

Yes that is the exact problem .I get only two images while this pdf was made using 3 images and they are numbered 1,2,3. I am looking for a solution as how to get all the three .
Reply
#5
(Oct-08-2023, 07:14 AM)cybertooth Wrote: while this pdf was made using 3 images
How was the pdf made exactly? because apparently it only contains 2 images.
Reply
#6
Hi,
I'm interested in this post , (but not contributing to a solution).
I rarely handle pdf's with pictures.
But I am wondering what kind of pictures can the Pdfreader retrieve?
i.e. how did the pictures get on to the pdf page?
From photoshop or even MS word "save as pdf" ? ...
Surely, if it's a pdf made of photocopies with pictures on them, it can't isolate them ?
Just curious, as I might encounter such a pdf, one day.
thx,
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#7
(Oct-08-2023, 08:18 AM)DPaul Wrote: Surely, if it's a pdf made of photocopies with pictures on them, it can't isolate them ?
The pdf contains only 2 images. The first one is a montage of the flowers numbered 1 and 2 and the second image is a single flower.

For example the following image is a single image, it was created with the linux command
Output:
montage -mode concatenate b.jpg b.jpg m.jpg

Attached Files

Thumbnail(s)
   
Reply
#8
(Oct-08-2023, 08:27 AM)Gribouillis Wrote: For example the following image is a single image, it was created with the linux command
Sure, an image is a (jpg,png,tif...)-file with x_pixels / y_pixels, and what is on it, can be anything.
My question is:
a) Paste this picture in eg. MS Word, and "save as" pdf -> I assume the pdfReader can isolate it. (as one pic)
b) Print this image on a white page, scan the page as pdf -> I assume the pdfReader cannot see the picture.
Am I correct ? Or is pdfReader equipped with some magical algirithms.

thx,
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Isolate a word from a long string nicocorico 2 1,547 Feb-25-2022, 01:12 PM
Last Post: nicocorico
  How to get first 5 images form the document using Python BeautifulSoup sarath_unrelax 0 1,650 Dec-19-2019, 07:13 AM
Last Post: sarath_unrelax

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020