Merging multiple pdf's - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Merging multiple pdf's (/thread-26415.html) |
Merging multiple pdf's - jonathan2582 - Apr-30-2020 Hi, is there a way to merge multiple pdf's according to the name(i.e. if the first seven characters of the name are equal.) I have thousands of pdf's that have been spliced and need to merge them back. Here are some of the files as example: 02-020694.pdf 02-020694-0.pdf 02-020694-1.pdf 04-011200.pdf 04-011200-0.pdf Thanks RE: Merging multiple pdf's - Larz60+ - May-01-2020 you can use package pypdf (you'll need to install with pip install pypdf from command line)there's an example here: https://stackoverflow.com/a/3444735 RE: Merging multiple pdf's - jonathan2582 - May-01-2020 (May-01-2020, 12:36 AM)Larz60+ Wrote: you can use package pypdf (you'll need to install with I have it installed but not sure how to do the pattern recognition, as in the name characters. RE: Merging multiple pdf's - Larz60+ - May-01-2020 did you look at the link? RE: Merging multiple pdf's - jonathan2582 - May-01-2020 I did and still nothing, is it even possible? I'm very new to python. RE: Merging multiple pdf's - Larz60+ - May-01-2020 The link I provided merges several pdfs into one. Isn't that what you wanted? RE: Merging multiple pdf's - DPaul - May-02-2020 It seems that the link provided by Larz60+ does everything you would need. Maybe your problem is the file name pattern matching. That would be easy if you can confirm that all filenames should match on the first 9 chars + ("-" + "suffix") + ".pdf" If len 9 is not the rule, we need to look at more data. Paul RE: Merging multiple pdf's - DeaD_EyE - May-02-2020 I polished the code a bit. Instead of PdfFileReader and PdfFileWriter, use PdfFileMerger. To close many open files, use the ExitStack from contextlib. Use argparse for a nice command line interface. #!/usr/bin/env python3 """ Merge PDF-Files The order of supplied input files is preserved The output_file is automatically removed from the list of input_files """ import sys from argparse import ArgumentParser #https://docs.python.org/3/library/argparse.html#argparse.ArgumentParser from contextlib import ExitStack # https://docs.python.org/3/library/contextlib.html#contextlib.ExitStack from pathlib import Path # https://docs.python.org/3/library/pathlib.html#basic-use from PyPDF2 import PdfFileMerger # https://pythonhosted.org/PyPDF2/ # use PdfFileMerger instead of Writer def pdf_cat(input_files, output_file): # remove the output_file from input_files, which # could happen if using globbing "*.pdf" for input files input_files = [file for file in input_files if file != output_file] with ExitStack() as stack, open(output_file, "wb") as fd_out: # stack and fd_out are handled by conextmanager # leaving the contextmanager calls the __exit__ methods # which closes all files (input_files, fd_out) # if an Exception occours, the files are still closed open_pdf_files = [ stack.enter_context(open(fname, "rb")) for fname in input_files ] # adding the input_files to the stack of the conextmanager # which calls open, so the stack has open files in his collection merger = PdfFileMerger(strict=False) for file in open_pdf_files: print("Processing", file.name) merger.append(file) print("Writing to", fd_out.name) merger.write(fd_out) # some broken or big pdf files # seems to block merger.write # Look for PyPDF3 or PyPDF4, which are forks # maybe some bugs have been removed if __name__ == '__main__': parser = ArgumentParser(description=__doc__) parser.add_argument("input_pdfs", nargs="+", type=Path, help="Input PDF Files") parser.add_argument("output_pdf", type=Path, help="Output PDF Filee") args = parser.parse_args() pdf_cat(args.input_pdfs, args.output_pdf)Which order is used, depends on your input. If you use shell-globbing, the output is maybe sorted by lexical order. Usually it's not. In addition patterns may sorted wrong by lexical order. In general sorting by a pattern can be changed with the key parameter of the sorted built-in function. The key is some callable, which return a number, a tuple with numbers or str.Let's invent a crazy file-name pattern, which looks hard to sort it: Month_Year_Id_sometext.pdf Let's say we want to sort by (year, month) and we forget the id. We do have the _ as splitter between the fields. def key_func(filename): try: month, year, id, name = filename.split("_", maxsplit=4) # maxsplit allows only 4 splits, so the filename # could contain also a dash, but it's not split month, year = int(month), int(year) except ValueError: # is raised, if too less _ are in the filename # is raised, if month or year is not a int # not sortable, wrong pattern of file_name? return (0,0) else: return (year, month) # sort first by year, then by month if year is equal file_names = [ "12_2020_54_sometext.pdf", "11_2019_14_sometext.pdf", "12_2021_41_sometext.pdf", "10_2020_12_sometext.pdf" ] print(sorted(file_names, key=key_func)) print("Big first") print(sorted(file_names, key=key_func, reverse=True))Sometimes you don't have a delimiter in filename which could used to split the data. Sometimes they follow a different pattern. For example iso8601 as timestamp. You can combine the merger and the sorting function for filenames and change it for your task. |