Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Merging multiple pdf's
#1
Hi, is there a way to merge multiple pdf's according to the name(i.e. if the first seven characters of the name are equal.) I have thousands of pdf's that have been spliced and need to merge them back. Here are some of the files as example:

02-020694.pdf
02-020694-0.pdf
02-020694-1.pdf
04-011200.pdf
04-011200-0.pdf
Thanks
Reply
#2
you can use package pypdf (you'll need to install with pip install pypdf from command line)
there's an example here: https://stackoverflow.com/a/3444735
Reply
#3
(May-01-2020, 12:36 AM)Larz60+ Wrote: you can use package pypdf (you'll need to install with pip install pypdf from command line)
there's an example here: https://stackoverflow.com/a/3444735

I have it installed but not sure how to do the pattern recognition, as in the name characters.
Reply
#4
did you look at the link?
Reply
#5
I did and still nothing, is it even possible? I'm very new to python.
Reply
#6
The link I provided merges several pdfs into one. Isn't that what you wanted?
Reply
#7
It seems that the link provided by Larz60+ does everything you would need.
Maybe your problem is the file name pattern matching.
That would be easy if you can confirm that all filenames should match on the first 9 chars + ("-" + "suffix") + ".pdf"
If len 9 is not the rule, we need to look at more data.
Paul
Reply
#8
I polished the code a bit.
Instead of PdfFileReader and PdfFileWriter, use PdfFileMerger.
To close many open files, use the ExitStack from contextlib.
Use argparse for a nice command line interface.


#!/usr/bin/env python3

"""
Merge PDF-Files

The order of supplied input files is preserved
The output_file is automatically removed from the list of input_files
"""
import sys
from argparse import ArgumentParser
#https://docs.python.org/3/library/argparse.html#argparse.ArgumentParser
from contextlib import ExitStack
# https://docs.python.org/3/library/contextlib.html#contextlib.ExitStack
from pathlib import Path
# https://docs.python.org/3/library/pathlib.html#basic-use


from PyPDF2 import PdfFileMerger
# https://pythonhosted.org/PyPDF2/
# use PdfFileMerger instead of Writer


def pdf_cat(input_files, output_file):
    # remove the output_file from input_files, which
    # could happen if using globbing "*.pdf" for input files
    input_files = [file for file in input_files if file != output_file]
    with ExitStack() as stack, open(output_file, "wb") as fd_out:
        # stack and fd_out are handled by conextmanager
        # leaving the contextmanager calls the __exit__ methods
        # which closes all files (input_files, fd_out)
        # if an Exception occours, the files are still closed
        open_pdf_files = [
            stack.enter_context(open(fname, "rb"))
            for fname in input_files
        ]
        # adding the input_files to the stack of the conextmanager
        # which calls open, so the stack has open files in his collection
        merger = PdfFileMerger(strict=False)
        for file in open_pdf_files:
            print("Processing", file.name)
            merger.append(file)
        print("Writing to", fd_out.name)
        merger.write(fd_out)
        # some broken or big pdf files
        # seems to block merger.write
        # Look for PyPDF3 or PyPDF4, which are forks
        # maybe some bugs have been removed


if __name__ == '__main__':
    parser = ArgumentParser(description=__doc__)
    parser.add_argument("input_pdfs", nargs="+", type=Path, help="Input PDF Files")
    parser.add_argument("output_pdf", type=Path, help="Output PDF Filee")
    args = parser.parse_args()
    pdf_cat(args.input_pdfs, args.output_pdf)
    
Which order is used, depends on your input.
If you use shell-globbing, the output is maybe sorted by lexical order. Usually it's not.
In addition patterns may sorted wrong by lexical order.

In general sorting by a pattern can be changed with the key parameter of the sorted built-in function. The key is some callable, which return a number, a tuple with numbers or str.

Let's invent a crazy file-name pattern, which looks hard to sort it:
Month_Year_Id_sometext.pdf
Let's say we want to sort by (year, month) and we forget the id.
We do have the _ as splitter between the fields.

def key_func(filename):
    try:
        month, year, id, name = filename.split("_", maxsplit=4)
        # maxsplit allows only 4 splits, so the filename
        # could contain also a dash, but it's not split
        month, year = int(month), int(year)
    except ValueError:
        # is raised, if too less _ are in the filename
        # is raised, if month or year is not a int
        # not sortable, wrong pattern of file_name?
        return (0,0)
    else:
        return (year, month)
        # sort first by year, then by month if year is equal


file_names = [
    "12_2020_54_sometext.pdf",
    "11_2019_14_sometext.pdf",
    "12_2021_41_sometext.pdf",
    "10_2020_12_sometext.pdf"
]
print(sorted(file_names, key=key_func))
print("Big first")
print(sorted(file_names, key=key_func, reverse=True))
Sometimes you don't have a delimiter in filename which could used to split the data.
Sometimes they follow a different pattern. For example iso8601 as timestamp.

You can combine the merger and the sorting function for filenames and change it for your task.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Merging multiple csv files with same X,Y,Z in each Auz_Pete 3 1,085 Feb-21-2023, 04:21 AM
Last Post: Auz_Pete

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020