I polished the code a bit.
Instead of PdfFileReader and PdfFileWriter, use PdfFileMerger.
To close many open files, use the ExitStack from contextlib.
Use argparse for a nice command line interface.
#!/usr/bin/env python3
"""
Merge PDF-Files
The order of supplied input files is preserved
The output_file is automatically removed from the list of input_files
"""
import sys
from argparse import ArgumentParser
#https://docs.python.org/3/library/argparse.html#argparse.ArgumentParser
from contextlib import ExitStack
# https://docs.python.org/3/library/contextlib.html#contextlib.ExitStack
from pathlib import Path
# https://docs.python.org/3/library/pathlib.html#basic-use
from PyPDF2 import PdfFileMerger
# https://pythonhosted.org/PyPDF2/
# use PdfFileMerger instead of Writer
def pdf_cat(input_files, output_file):
# remove the output_file from input_files, which
# could happen if using globbing "*.pdf" for input files
input_files = [file for file in input_files if file != output_file]
with ExitStack() as stack, open(output_file, "wb") as fd_out:
# stack and fd_out are handled by conextmanager
# leaving the contextmanager calls the __exit__ methods
# which closes all files (input_files, fd_out)
# if an Exception occours, the files are still closed
open_pdf_files = [
stack.enter_context(open(fname, "rb"))
for fname in input_files
]
# adding the input_files to the stack of the conextmanager
# which calls open, so the stack has open files in his collection
merger = PdfFileMerger(strict=False)
for file in open_pdf_files:
print("Processing", file.name)
merger.append(file)
print("Writing to", fd_out.name)
merger.write(fd_out)
# some broken or big pdf files
# seems to block merger.write
# Look for PyPDF3 or PyPDF4, which are forks
# maybe some bugs have been removed
if __name__ == '__main__':
parser = ArgumentParser(description=__doc__)
parser.add_argument("input_pdfs", nargs="+", type=Path, help="Input PDF Files")
parser.add_argument("output_pdf", type=Path, help="Output PDF Filee")
args = parser.parse_args()
pdf_cat(args.input_pdfs, args.output_pdf)
Which order is used, depends on your input.
If you use shell-globbing, the output is maybe sorted by lexical order. Usually it's not.
In addition patterns may sorted wrong by lexical order.
In general sorting by a pattern can be changed with the
key
parameter of the
sorted
built-in function. The key is some callable, which return a number, a tuple with numbers or str.
Let's invent a crazy file-name pattern, which looks hard to sort it:
Month_Year_Id_sometext.pdf
Let's say we want to sort by (year, month) and we forget the id.
We do have the _ as splitter between the fields.
def key_func(filename):
try:
month, year, id, name = filename.split("_", maxsplit=4)
# maxsplit allows only 4 splits, so the filename
# could contain also a dash, but it's not split
month, year = int(month), int(year)
except ValueError:
# is raised, if too less _ are in the filename
# is raised, if month or year is not a int
# not sortable, wrong pattern of file_name?
return (0,0)
else:
return (year, month)
# sort first by year, then by month if year is equal
file_names = [
"12_2020_54_sometext.pdf",
"11_2019_14_sometext.pdf",
"12_2021_41_sometext.pdf",
"10_2020_12_sometext.pdf"
]
print(sorted(file_names, key=key_func))
print("Big first")
print(sorted(file_names, key=key_func, reverse=True))
Sometimes you don't have a delimiter in filename which could used to split the data.
Sometimes they follow a different pattern. For example iso8601 as timestamp.
You can combine the merger and the sorting function for filenames and change it for your task.