Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Comparing PDFs
#1
Hi there,
I want to start off by saying I have 0 experience with coding and am more asking for to see if this would be possible.

I want to create an app that would compare one or many child PDFs to a master PDF, scan through each child PDF highlighting any differences in each child PDF and then save the output of the processed child PDFs with the differences visibly highlighted.
I've seen the ability to compare and output the differences via text but Highlighting the changes directly in each PDF would suit my needs better.
This is what I've got so far, this opens a basic GUI that lets you select a master and 1 child but doesn't seem to do anything when I hit the compare button.

import tkinter as tk
from tkinter import filedialog, messagebox
import fitz


class PDFCompare:
    def __init__(self, master):
        self.master = master
        master.title("PDF Compare")
        
        self.master_file = None
        self.child_file = None
        self.result_file = None
        
        self.master_label = tk.Label(master, text="Master PDF:")
        self.master_label.grid(row=0, column=0, sticky="w")
        self.master_button = tk.Button(master, text="Select", command=self.select_master_pdf)
        self.master_button.grid(row=0, column=1, sticky="w")

        self.child_label = tk.Label(master, text="Child PDF:")
        self.child_label.grid(row=1, column=0, sticky="w")
        self.child_button = tk.Button(master, text="Select", command=self.select_child_pdf)
        self.child_button.grid(row=1, column=1, sticky="w")

        self.compare_button = tk.Button(master, text="Compare", command=self.compare_pdfs)
        self.compare_button.grid(row=2, column=0, sticky="w")

    def select_master_pdf(self):
        self.master_file = filedialog.askopenfilename(title="Select Master PDF", filetypes=[("PDF Files", "*.pdf")])

    def select_child_pdf(self):
        self.child_file = filedialog.askopenfilename(title="Select Child PDF", filetypes=[("PDF Files", "*.pdf")])

    def compare_pdfs(self):
        if self.master_file is None or self.child_file is None:
            messagebox.showerror("Error", "Please select both master and child PDFs.")
            return

        try:
            master_doc = fitz.open(self.master_file)
            child_doc = fitz.open(self.child_file)
        except:
            messagebox.showerror("Error", "Failed to open PDF files.")
            return

        result_doc = fitz.open()

        for parent_page in master_doc:
            child_page = child_doc[int(parent_page.number) - 1]
            result = parent_page.compare(child_page)
            if result:
                diff_rects = result[0].rects
                for rect in diff_rects:
                    highlight = result_doc.add_highlight_annot(rect)
                    highlight.update()
            else:
                result_doc.insert_pdf(parent_page)

        if not result_doc:
            messagebox.showwarning("Warning", "No differences found.")
            return

        output_file = filedialog.asksaveasfilename(title="Save Output PDF", filetypes=[("PDF Files", "*.pdf")])
        if not output_file.endswith(".pdf"):
            output_file += ".pdf"

        try:
            result_doc.save(output_file)
            messagebox.showinfo("Success", "Comparison complete. Results saved to {}".format(output_file))
        except:
            messagebox.showerror("Error", "Failed to save output file.")


root = tk.Tk()
app = PDFCompare(root)
root.mainloop()
Some additional info:
All the PDFs I'm needing to compare are 1 page however I could have 20 versions of the same page, 1 being the master and the other 19 being "child" PDFs
An example of the parent PDF
[Image: HOd83vV.th.png]
and an example of the child PDF
[Image: HOd8FyB.th.png]
Reply
#2
Given that you "have 0 experience with coding", I do wonder where the code that you have came from, but that's an aside.

My approach would differ from the one that you have taken...

First, I'd only start reading the PDF file if I knew that it had been altered. For that, I'd generate a MD5 hash digest for the two files; if the hd is the same, then the files are also the same. Then, for files that are clearly different, do whatever needs to be done.

The only PDF library that I've used is pdfrw 0.4, which is real easy to work with. My motivation (and a working app. See PDF file split and copy) was the need to split images from one PDF file and add said to another PDF file. The files do not (in my case) have any text; just photo images.
Sig:
>>> import this

The UNIX philosophy: "Do one thing, and do it well."

"The danger of computers becoming like humans is not as great as the danger of humans becoming like computers." :~ Konrad Zuse

"Everything should be made as simple as possible, but not simpler." :~ Albert Einstein
Reply
#3
As rob101 said, if you have 0 experience with coding, I don't think you can create this app. It is much harder than "hitting the compare button". A very experienced programmer is needed.
Reply
#4
(Mar-31-2023, 01:23 PM)rob101 Wrote: Given that you "have 0 experience with coding", I do wonder where the code that you have came from, but that's an aside.

My approach would differ from the one that you have taken...

First, I'd only start reading the PDF file if I knew that it had been altered. For that, I'd generate a MD5 hash digest for the two files; if the hd is the same, then the files are also the same. Then, for files that are clearly different, do whatever needs to be done.

The only PDF library that I've used is pdfrw 0.4, which is real easy to work with. My motivation (and a working app. See PDF file split and copy) was the need to split images from one PDF file and add said to another PDF file. The files do not (in my case) have any text; just photo images.

Appreciate the feedback.
I've been playing around with ChatGPT to try and work through the coding for this, however my end goal is proving to be to far outside the scope of what it is capable of so far and clearly beyond my attempts to try and put something together without any prior experience.
Think I'll need to either start learning/practicing Python or wait until someone comes out with a solution.

Thanks again for the feedback.
Reply
#5
(Mar-31-2023, 03:28 PM)CaseCRS Wrote: Think I'll need to either start learning/practicing Python or wait until someone comes out with a solution.

Thanks again for the feedback.

You're very welcome. One of the motivating factors for me to learn Python, was the wish to be able to code something as and when the need came. Since then, it [Python] has become an ever deeper 'rabbit hole'; a journey that is incredibly satisfying, sometimes frustrating, but overwhelmingly enjoyable and I'd recommend (time permitting) that you should learn to code with Python.
Sig:
>>> import this

The UNIX philosophy: "Do one thing, and do it well."

"The danger of computers becoming like humans is not as great as the danger of humans becoming like computers." :~ Konrad Zuse

"Everything should be made as simple as possible, but not simpler." :~ Albert Einstein
Reply
#6
I am not the "very experienced" programmer that Gribouillis talks about,
but I am confronted with text mixed with images/drawings every day. (In a totally different context).
It is a problem, because all the OCR softwares that I use, are good at recognising text, not shapes or images.
Furthermore, they are bad ad estimating white spaces between objects, essential for your project.

I see 2 possibilities:
With python :compare 2 images on a pixel by pixel basis. For this, the 2 scans would need to be exactly the same,
or maybe calibrate them on a fixed object in the image. Feasable, but you would be duplicating my second solution.

Without python: you could do this in 2 minutes using something like photoshop. Put master and child in 2 layers,
highlight the differences. Done.
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  download pubmed PDFs using pubmed2pdf in python Wooki 8 5,525 Oct-19-2020, 03:06 PM
Last Post: jefsummers
  How to compare two PDFs for differences Normanie 2 2,414 Jul-30-2020, 07:31 AM
Last Post: millpond
  Concatenate multiple PDFs using python gmehta1996 0 2,122 Mar-29-2020, 09:48 PM
Last Post: gmehta1996
  Most optimized way to merge figures from multiple PDFs into one PDF page? dmm809 1 2,071 May-22-2019, 10:32 PM
Last Post: micseydel
  Merging pdfs with PyPDF2 Pedroski55 0 3,296 Mar-07-2019, 11:58 PM
Last Post: Pedroski55
Photo How to Extract Specific Words from PDFs with Python danvsv 1 4,528 Jan-17-2019, 11:07 AM
Last Post: Larz60+
  reading pdfs in windows10 - Python 3.6 cobra 1 5,335 May-10-2018, 09:40 PM
Last Post: nilamo
  How to parse pdfs in Python CharType 2 4,040 Jan-09-2017, 11:56 PM
Last Post: Blue Dog

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020