Comparing PDFs

CaseCRS · (This post was last modified: Mar-31-2023, 10:38 AM by CaseCRS.)

Hi there,
I want to start off by saying I have 0 experience with coding and am more asking for to see if this would be possible.

I want to create an app that would compare one or many child PDFs to a master PDF, scan through each child PDF highlighting any differences in each child PDF and then save the output of the processed child PDFs with the differences visibly highlighted.
I've seen the ability to compare and output the differences via text but Highlighting the changes directly in each PDF would suit my needs better.
This is what I've got so far, this opens a basic GUI that lets you select a master and 1 child but doesn't seem to do anything when I hit the compare button.

import tkinter as tk
from tkinter import filedialog, messagebox
import fitz


class PDFCompare:
    def __init__(self, master):
        self.master = master
        master.title("PDF Compare")
        
        self.master_file = None
        self.child_file = None
        self.result_file = None
        
        self.master_label = tk.Label(master, text="Master PDF:")
        self.master_label.grid(row=0, column=0, sticky="w")
        self.master_button = tk.Button(master, text="Select", command=self.select_master_pdf)
        self.master_button.grid(row=0, column=1, sticky="w")

        self.child_label = tk.Label(master, text="Child PDF:")
        self.child_label.grid(row=1, column=0, sticky="w")
        self.child_button = tk.Button(master, text="Select", command=self.select_child_pdf)
        self.child_button.grid(row=1, column=1, sticky="w")

        self.compare_button = tk.Button(master, text="Compare", command=self.compare_pdfs)
        self.compare_button.grid(row=2, column=0, sticky="w")

    def select_master_pdf(self):
        self.master_file = filedialog.askopenfilename(title="Select Master PDF", filetypes=[("PDF Files", "*.pdf")])

    def select_child_pdf(self):
        self.child_file = filedialog.askopenfilename(title="Select Child PDF", filetypes=[("PDF Files", "*.pdf")])

    def compare_pdfs(self):
        if self.master_file is None or self.child_file is None:
            messagebox.showerror("Error", "Please select both master and child PDFs.")
            return

        try:
            master_doc = fitz.open(self.master_file)
            child_doc = fitz.open(self.child_file)
        except:
            messagebox.showerror("Error", "Failed to open PDF files.")
            return

        result_doc = fitz.open()

        for parent_page in master_doc:
            child_page = child_doc[int(parent_page.number) - 1]
            result = parent_page.compare(child_page)
            if result:
                diff_rects = result[0].rects
                for rect in diff_rects:
                    highlight = result_doc.add_highlight_annot(rect)
                    highlight.update()
            else:
                result_doc.insert_pdf(parent_page)

        if not result_doc:
            messagebox.showwarning("Warning", "No differences found.")
            return

        output_file = filedialog.asksaveasfilename(title="Save Output PDF", filetypes=[("PDF Files", "*.pdf")])
        if not output_file.endswith(".pdf"):
            output_file += ".pdf"

        try:
            result_doc.save(output_file)
            messagebox.showinfo("Success", "Comparison complete. Results saved to {}".format(output_file))
        except:
            messagebox.showerror("Error", "Failed to save output file.")


root = tk.Tk()
app = PDFCompare(root)
root.mainloop()

Some additional info:
All the PDFs I'm needing to compare are 1 page however I could have 20 versions of the same page, 1 being the master and the other 19 being "child" PDFs
An example of the parent PDF

and an example of the child PDF

rob101 · Mar-31-2023, 01:23 PM

Given that you "have 0 experience with coding", I do wonder where the code that you have came from, but that's an aside.

My approach would differ from the one that you have taken...

First, I'd only start reading the PDF file if I knew that it had been altered. For that, I'd generate a MD5 hash digest for the two files; if the hd is the same, then the files are also the same. Then, for files that are clearly different, do whatever needs to be done.

The only PDF library that I've used is pdfrw 0.4, which is real easy to work with. My motivation (and a working app. See PDF file split and copy) was the need to split images from one PDF file and add said to another PDF file. The files do not (in my case) have any text; just photo images.

**Gribouillis** · Mar-31-2023, 01:45 PM

As rob101 said, if you have 0 experience with coding, I don't think you can create this app. It is much harder than "hitting the compare button". A very experienced programmer is needed.

CaseCRS · Mar-31-2023, 03:28 PM

(Mar-31-2023, 01:23 PM)rob101 Wrote: Given that you "have 0 experience with coding", I do wonder where the code that you have came from, but that's an aside.

My approach would differ from the one that you have taken...

First, I'd only start reading the PDF file if I knew that it had been altered. For that, I'd generate a MD5 hash digest for the two files; if the hd is the same, then the files are also the same. Then, for files that are clearly different, do whatever needs to be done.

The only PDF library that I've used is pdfrw 0.4, which is real easy to work with. My motivation (and a working app. See PDF file split and copy) was the need to split images from one PDF file and add said to another PDF file. The files do not (in my case) have any text; just photo images.

Appreciate the feedback.
I've been playing around with ChatGPT to try and work through the coding for this, however my end goal is proving to be to far outside the scope of what it is capable of so far and clearly beyond my attempts to try and put something together without any prior experience.
Think I'll need to either start learning/practicing Python or wait until someone comes out with a solution.

Thanks again for the feedback.

rob101 · Mar-31-2023, 03:49 PM

(Mar-31-2023, 03:28 PM)CaseCRS Wrote: Think I'll need to either start learning/practicing Python or wait until someone comes out with a solution.

Thanks again for the feedback.

You're very welcome. One of the motivating factors for me to learn Python, was the wish to be able to code something as and when the need came. Since then, it [Python] has become an ever deeper 'rabbit hole'; a journey that is incredibly satisfying, sometimes frustrating, but overwhelmingly enjoyable and I'd recommend (time permitting) that you should learn to code with Python.

DPaul · Apr-01-2023, 05:46 AM

I am not the "very experienced" programmer that Gribouillis talks about,
but I am confronted with text mixed with images/drawings every day. (In a totally different context).
It is a problem, because all the OCR softwares that I use, are good at recognising text, not shapes or images.
Furthermore, they are bad ad estimating white spaces between objects, essential for your project.

I see 2 possibilities:
With python :compare 2 images on a pixel by pixel basis. For this, the 2 scans would need to be exactly the same,
or maybe calibrate them on a fixed object in the image. Feasable, but you would be duplicating my second solution.

Without python: you could do this in 2 minutes using something like photoshop. Put master and child in 2 layers,
highlight the differences. Done.
Paul

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Extracting data from bank statement PDFs (Accountant)	a4avinash	4	5,613	Feb-27-2025, 01:53 PM Last Post: griffinhenry
	download pubmed PDFs using pubmed2pdf in python	Wooki	8	8,478	Oct-19-2020, 03:06 PM Last Post: jefsummers
	How to compare two PDFs for differences	Normanie	2	3,240	Jul-30-2020, 07:31 AM Last Post: millpond
	Concatenate multiple PDFs using python	gmehta1996	0	2,647	Mar-29-2020, 09:48 PM Last Post: gmehta1996
	Most optimized way to merge figures from multiple PDFs into one PDF page?	dmm809	1	2,698	May-22-2019, 10:32 PM Last Post: micseydel
	Merging pdfs with PyPDF2	Pedroski55	0	3,763	Mar-07-2019, 11:58 PM Last Post: Pedroski55
	How to Extract Specific Words from PDFs with Python	danvsv	1	5,207	Jan-17-2019, 11:07 AM Last Post: Larz60+
	reading pdfs in windows10 - Python 3.6	cobra	1	6,002	May-10-2018, 09:40 PM Last Post: nilamo
	How to parse pdfs in Python	CharType	2	4,907	Jan-09-2017, 11:56 PM Last Post: Blue Dog

Comparing PDFs

User Panel Messages

Announcements