Python Forum
Merging multiple pdf's - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Merging multiple pdf's (/thread-26415.html)



Merging multiple pdf's - jonathan2582 - Apr-30-2020

Hi, is there a way to merge multiple pdf's according to the name(i.e. if the first seven characters of the name are equal.) I have thousands of pdf's that have been spliced and need to merge them back. Here are some of the files as example:

02-020694.pdf
02-020694-0.pdf
02-020694-1.pdf
04-011200.pdf
04-011200-0.pdf
Thanks


RE: Merging multiple pdf's - Larz60+ - May-01-2020

you can use package pypdf (you'll need to install with pip install pypdf from command line)
there's an example here: https://stackoverflow.com/a/3444735


RE: Merging multiple pdf's - jonathan2582 - May-01-2020

(May-01-2020, 12:36 AM)Larz60+ Wrote: you can use package pypdf (you'll need to install with pip install pypdf from command line)
there's an example here: https://stackoverflow.com/a/3444735

I have it installed but not sure how to do the pattern recognition, as in the name characters.


RE: Merging multiple pdf's - Larz60+ - May-01-2020

did you look at the link?


RE: Merging multiple pdf's - jonathan2582 - May-01-2020

I did and still nothing, is it even possible? I'm very new to python.


RE: Merging multiple pdf's - Larz60+ - May-01-2020

The link I provided merges several pdfs into one. Isn't that what you wanted?


RE: Merging multiple pdf's - DPaul - May-02-2020

It seems that the link provided by Larz60+ does everything you would need.
Maybe your problem is the file name pattern matching.
That would be easy if you can confirm that all filenames should match on the first 9 chars + ("-" + "suffix") + ".pdf"
If len 9 is not the rule, we need to look at more data.
Paul


RE: Merging multiple pdf's - DeaD_EyE - May-02-2020

I polished the code a bit.
Instead of PdfFileReader and PdfFileWriter, use PdfFileMerger.
To close many open files, use the ExitStack from contextlib.
Use argparse for a nice command line interface.


#!/usr/bin/env python3

"""
Merge PDF-Files

The order of supplied input files is preserved
The output_file is automatically removed from the list of input_files
"""
import sys
from argparse import ArgumentParser
#https://docs.python.org/3/library/argparse.html#argparse.ArgumentParser
from contextlib import ExitStack
# https://docs.python.org/3/library/contextlib.html#contextlib.ExitStack
from pathlib import Path
# https://docs.python.org/3/library/pathlib.html#basic-use


from PyPDF2 import PdfFileMerger
# https://pythonhosted.org/PyPDF2/
# use PdfFileMerger instead of Writer


def pdf_cat(input_files, output_file):
    # remove the output_file from input_files, which
    # could happen if using globbing "*.pdf" for input files
    input_files = [file for file in input_files if file != output_file]
    with ExitStack() as stack, open(output_file, "wb") as fd_out:
        # stack and fd_out are handled by conextmanager
        # leaving the contextmanager calls the __exit__ methods
        # which closes all files (input_files, fd_out)
        # if an Exception occours, the files are still closed
        open_pdf_files = [
            stack.enter_context(open(fname, "rb"))
            for fname in input_files
        ]
        # adding the input_files to the stack of the conextmanager
        # which calls open, so the stack has open files in his collection
        merger = PdfFileMerger(strict=False)
        for file in open_pdf_files:
            print("Processing", file.name)
            merger.append(file)
        print("Writing to", fd_out.name)
        merger.write(fd_out)
        # some broken or big pdf files
        # seems to block merger.write
        # Look for PyPDF3 or PyPDF4, which are forks
        # maybe some bugs have been removed


if __name__ == '__main__':
    parser = ArgumentParser(description=__doc__)
    parser.add_argument("input_pdfs", nargs="+", type=Path, help="Input PDF Files")
    parser.add_argument("output_pdf", type=Path, help="Output PDF Filee")
    args = parser.parse_args()
    pdf_cat(args.input_pdfs, args.output_pdf)
    
Which order is used, depends on your input.
If you use shell-globbing, the output is maybe sorted by lexical order. Usually it's not.
In addition patterns may sorted wrong by lexical order.

In general sorting by a pattern can be changed with the key parameter of the sorted built-in function. The key is some callable, which return a number, a tuple with numbers or str.

Let's invent a crazy file-name pattern, which looks hard to sort it:
Month_Year_Id_sometext.pdf
Let's say we want to sort by (year, month) and we forget the id.
We do have the _ as splitter between the fields.

def key_func(filename):
    try:
        month, year, id, name = filename.split("_", maxsplit=4)
        # maxsplit allows only 4 splits, so the filename
        # could contain also a dash, but it's not split
        month, year = int(month), int(year)
    except ValueError:
        # is raised, if too less _ are in the filename
        # is raised, if month or year is not a int
        # not sortable, wrong pattern of file_name?
        return (0,0)
    else:
        return (year, month)
        # sort first by year, then by month if year is equal


file_names = [
    "12_2020_54_sometext.pdf",
    "11_2019_14_sometext.pdf",
    "12_2021_41_sometext.pdf",
    "10_2020_12_sometext.pdf"
]
print(sorted(file_names, key=key_func))
print("Big first")
print(sorted(file_names, key=key_func, reverse=True))
Sometimes you don't have a delimiter in filename which could used to split the data.
Sometimes they follow a different pattern. For example iso8601 as timestamp.

You can combine the merger and the sorting function for filenames and change it for your task.