PDF Manipulation

9156088686 · Sep-27-2023, 09:27 PM

Hi guys,

My goal is pretty simple: to split a PDF page 2/3 the way down and move the bottom third on top.

I've been trying multiple ways to figure this out using PyPDF2 and haven't been able to crack it. Whenever I test/debug by making changes to my code I get very unexpected results even after going over the documentation. I feel like I'm misunderstanding something fundamental about what's going on here so any help would be much appreciated.

I am currently doing it in two steps, (which probably isn't the best but I'm not sure how to combine it either without saving it to file in between).

Step 1: Crop the page and save the split pages as a new PDF

        # Get the page dimensions
        width = page.mediabox.width
        height = page.mediabox.height

        # Calculate the bottom third of the page
        bottom = height / 3

        # Crop the bottom third and add it as a new page
        page.mediabox.lower_left = (0, 0)
        page.mediabox.upper_right = (width, bottom)
        PageTop = writer.add_page(page)

        # Crop the top two thirds and add it as a new page
        page.mediabox.lower_left = (0, bottom)
        page.mediabox.upper_right = (width, height)
        PageBottom = writer.add_page(page)

Step 2: Add a new blank page and merge the two cropped pages

        # Calculate dimensions
        max_width = max(page1.mediabox.width, page2.mediabox.width)
        sum_height = page1.mediabox.height + page2.mediabox.height

        # Create a new blank page with the calculated dimensions
        new_page = PageObject.create_blank_page(width = max_width, height = sum_height)
        
        # Add 1st page (bottom 3rd of the original) to the top
        new_page.merge_page(page1)
        
        # Add 2nd page (top 2/3 of the original) underneath
        page2.add_transformation(Transformation().translate(0, page2.mediabox.height))
        new_page.merge_page(page2)
        
        # Add the merged page to the output
        output_pdf.add_page(new_page)

Thanks for your time.

DPaul · Sep-28-2023, 05:43 AM

Hi,
This is not the complete code. Pls. show it + the error trace.
I do these operations all the time and depending what is on the page, it should be possible.
1) What is on the page? it must be a drawing, because splitting purely on a pixel basis is not ideal for text.
2) cropping code on line 6: maybe an int() might be better, decimals are a show stopper.
Paul

9156088686 · Sep-28-2023, 08:24 PM

Thanks for your response Paul,

I see what you mean by the decimals possibly becoming an issue when I divide, good point! I do have text on the page but the cropping seems to be working. My main issue is when I try to merge It's not merging as I expect. No matter how I try I can't get the two parts of the page back together and displayed how I'd like.

Here's my complete code:

from PyPDF2 import PdfReader, PdfWriter, PageObject, Transformation

file_path = "C:\\Users\\User\\Desktop\\PDF_SCRIPT\\"
input_file = file_path + "Input.pdf"
cropped_file = file_path + "Cropped.pdf"
merged_file = file_path + "Merged.pdf"

# Open the source PDF file
with open(input_file, "rb") as in_f:
    reader = PdfReader(in_f)
    writer = PdfWriter()

    # Loop through the pages
    for i in range(len(reader.pages)):
        page = reader.pages[i]      

        # Get the page dimensions
        width = page.mediabox.width
        height = page.mediabox.height

        # Calculate the bottom third of the page
        bottom = height / 3

        # Crop the bottom third and add it as a new page
        page.mediabox.lower_left = (0, 0)
        page.mediabox.upper_right = (width, bottom)
        PageTop = writer.add_page(page)

        # Crop the top two thirds and add it as a new page
        page.mediabox.lower_left = (0, bottom)
        page.mediabox.upper_right = (width, height)
        PageBottom = writer.add_page(page)

# Save the output PDF file
with open(cropped_file, "wb") as out_f:
    writer.write(out_f)


# Create a PdfReader object
input_pdf = PdfReader(open(cropped_file, "rb"), strict=False)

# Create a PdfWriter object
output_pdf = PdfWriter()

# Loop through every other page in the input PDF file
for i in range(0, len(input_pdf.pages), 2):

    # Check if the current page index is within the range of pages
    if i < len(input_pdf.pages):
    
        # Get the first and second page objects from the input PDF file
        page1 = input_pdf.pages[i]
        page2 = input_pdf.pages[i+1]

        # Calculate dimensions
        max_width = max(page1.mediabox.width, page2.mediabox.width)
        sum_height = page1.mediabox.height + page2.mediabox.height
         
        # Create a new blank page with the calculated dimensions
        new_page = PageObject.create_blank_page(width = max_width, height = sum_height)
                 
        # Add 1st page (bottom 3rd of the original) to the top
        new_page.merge_page(page1)
                 
        # Add 2nd page (top 2/3 of the original) underneath
        page2.add_transformation(Transformation().translate(0, page2.mediabox.height))
        new_page.merge_page(page2)
                 
        # Add the merged page to the output
        output_pdf.add_page(new_page)

# Open the output file and write the output PDF file to it
with open(merged_file, "wb") as out_file:
    output_pdf.write(out_file)

I'm not receiving any errors, but I'm not getting the result I'm looking for.

DPaul · (This post was last modified: Sep-29-2023, 07:58 AM by DPaul.)

Hi,
I tried your program, modified it somewhat, and, yes, cropping works, re-assembly also, but the pageObject always seems to remember it's
old coordinates, so I get a perfect original back every time.
There are many python-pdf modules, all have their merits and non-merits.
I use PyPDF2 only for turning pdf pages 90°, if the pdf is delivered in portrait, and I need to OCR it in landscape.
Sometimes, I also have your particular problem, having to re-arrange pages .
What i do is:
1) Cut the pdf up into png files using Fitz (pip install fitz)
2) Now you have your individual pages, re-arrange the 2/3 - 1/3 using PIL. (Pillow)
Image pixels are easily counted and parts are reassembled in a new image. Exactly what you do.
3) If need be, you may Merge the pngs into 1 pdf using img2pdf.

As stated elsewhere, pdfs are tricky business.
Paul

9156088686 · (This post was last modified: Oct-16-2023, 02:13 AM by 9156088686.)

Thanks for sending me down the right track, it helped me a lot! My full code is below for reference.

When I run it on a small sample PDF with 10 pages it runs fine. However, when I run it on the full 700 page PDF, it seems to run into issues. I haven't waited it out to know if it's just taking a really long time or there's some other issue but it hangs on this line:

images = convert_from_path(file_path, dpi=dpi, poppler_path=poppler_path)

Am I doing this in an inefficient way? Is there some way to prepare the PDF file, optimize the function, or use a different function so it doesn't take so long to convert without losing the quality?

Full code:

import fitz
from PIL import Image
from pdf2image import convert_from_path

poppler_path = r"C:\Program Files\poppler-23.08.0\poppler-23.08.0\Library\bin" 
file_path = "Input.pdf"
file_out = "Output.pdf"
dpi = 300
width, height = int(8.5*dpi), int(11*dpi)
y1, y2 = int((height)/3), int((2*height)/3)

new_pdf = fitz.open()

images = convert_from_path(file_path, dpi=dpi, poppler_path=poppler_path)

for i, page_image in enumerate(images):

    # Create blank image
    new_image = Image.new('RGB', (width, height))    

    # Crop page sections
    top = page_image.crop((0, 0, width, y2))
    bottom = page_image.crop((0, y2, width, height))
    
    # Paste the "bottom 3rd" image starting at the top
    new_image.paste(bottom, (0, 0))
    
    # Paste the "top 2/3rds" image at a 3rd the way down
    new_image.paste(top, (0, y1))

    # Create a new page in the new_pdf
    new_page = new_pdf.new_page(width=width, height=height)

    # Convert the PIL image object to a bytes object
    imagebytes = io.BytesIO()
    new_image.save(imagebytes, "JPEG")

    # Insert the image to the new page
    new_page.insert_image(fitz.Rect(0, 0, width, height), stream=imagebytes.getvalue())

new_pdf.save(file_out)

Update:
I waited for the code to finish it errored out and returned this.

MemoryError

DPaul · (This post was last modified: Oct-16-2023, 06:22 AM by DPaul.)

(Oct-16-2023, 02:13 AM)9156088686 Wrote: I waited for the code to finish it errored out and returned this.

Probably the error message is a bit longer, but it looks obvious, you seem to do it all in memory.
I do loads of these, and I have no memory problems, because I physically save every page
as a png. As I mentioned in my previous post, once you swapped the 1/3 and the 2/3 to a new image,
you could convert all these new images into a pdf an delete the intermediate pngs.
pngs allow you to immediately find the number of pixels (x and y) via PIL, Image, as you do.
Directly, not via the DPI (dots per inch) detour , which should be ppi, by the way (pixels per inch).
Here is my code to slice a pdf into separate images, that are easily manipulated.

for pdffile in glob.glob(pdf_path + '\*.pdf'): 
        doc = fitz.open(pdffile)
        zoom = 4
        mat = fitz.Matrix(zoom, zoom)
        count = 0
        # Count variable to get the number of pages in the pdf
        for p in doc:
            count += 1
        for i in range(count):
            img= os.path.join(os.curdir,'data',f'scan-{str(i+1)}.png')
            page = doc.load_page(i)
            pix = page.get_pixmap(matrix=mat)
            pix.save(img)
        doc.close()

Paul

PDF Manipulation

User Panel Messages

Announcements