Python Forum

Full Version: a collection of pdf-files (copy of a book) in disorder: solfing wiht pikepdf
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
hello dear Python-friends,


added an update below ...

first of all - i hope youre well and all goes okay at your hometown.

i have a collection of 330 pages ( a copy of a book) with mupdf i have separated the pages. But unfortunatly the pages are not in a linear order - i need to reorder in order to get a right setting for printing the stuff

the question: how to achieve this!?

Should i take a pdf-programme and cut the pdf-pages or should i stick to a pythonic way:

i heard about pikepdf: It provides a Pythonic wrapper around the C++ PDF content transformation library, QPDF. Python + QPDF ... Extract content from a PDF such as text or images.

PDFDocEncoding

Quote:The PDF specification defines PDFDocEncoding, a character encoding used only in PDFs. This encoding matches ASCII for code points 32 through 126 (0x20 to 0x7e). At all other code points, it is not ASCII and cannot be treated as equivalent. If you look at a PDF in a binary file viewer (hex editor), a string surrounded by parentheses such as (Hello World) is usually using PDFDocEncoding.
When pikepdf is imported, it automatically registers "pdfdoc" as a codec with the standard library, so that it may be used in string and byte conversions. cf. https://pikepdf.readthedocs.io/en/latest...oding.html
https://github.com/pikepdf/pikepdf
It provides a Pythonic wrapper around the C++ PDF content transformation library, QPDF. Python + QPDF ... Extract content from a PDF such as text or images.

PDFDocEncoding

Quote:The PDF specification defines PDFDocEncoding, a character encoding used only in PDFs. This encoding matches ASCII for code points 32 through 126 (0x20 to 0x7e). At all other code points, it is not ASCII and cannot be treated as equivalent. If you look at a PDF in a binary file viewer (hex editor), a string surrounded by parentheses such as (Hello World) is usually using PDFDocEncoding.
When pikepdf is imported, it automatically registers "pdfdoc" as a codec with the standard library, so that it may be used in string and byte conversions. cf. https://pikepdf.readthedocs.io/en/latest...oding.html

https://github.com/pikepdf/pikepdf
pikepdf.readthedocs.io/
https://pypi.org/project/pikepdf/
Released: May 21, 2021
version: pikepdf 2.12.1


well this sound very good . do you think that i can solve my issues with that!?


update: the background:


to explain all a bit more: i run into these issues while applying Mutool and MuP

running this on MX-Linux: I'm tried to work with the latest release of MuPDF library.

my findings: if i a the document into pieces (A 5) then i get fancy results: the number of the pages (the pagination) does fully get lost..

1,4,3,2,5, and so forth - and this is awful

btw: see the commands i run:

    mutool poster -x 2 input.pdf output.pdf 
..states that the document should be divided into two parts in the X axis.
The cutting axis is accordingly in the middle from top to bottom, so that two equal sides are created on the left and right.

You can split a document into individual pages with pdftk


  pdftk input.pdf burst
we can find the output files in the same directory as pg_0001.pdf, pg_0002.pdf etc

what goes wrong here!?


see the datset - https://www.file-upload.net/download-142...7.pdf.html

what is wanted: i want to cut this into A5 :: note: the A5-Formate is 148 mm width and 210 mm height

i use the commands from these ressources:

https://www.mankier.com/1/mupdf
https://mupdf.com/docs/


any ideas?
Without knowing how the pages are ordered improperly, there's no way I could say whether or not a particular utility will solve your particular issue. That said, I've used PyPDF2 before, and it seems pretty competent. https://pythonhosted.org/PyPDF2/
In linux here is how I would do

  1. call the command pdfseparate source.pdf page-%d.pdf . This creates files page-1.pdf, page-2.pdf etc one for each page.
  2. call a python script to rename the files in correct order, for example tmp-001.pdf, tmp-002.pdf
  3. call the command pdfunite tmp-*.pdf target.pdf to create the reordered book.
These programs are part of the poppler library's utilities https://poppler.freedesktop.org/
hello you both, good day dear nilamo and Gribouillis Smile


many many thanks for your reply. i am very happy to hear from you. your ideas seem to be very helpful. I will try out you approaches.


note: i added an update: and described how i went into these issues:

i run into these issues with Mutool and MuP

running this on MX-Linux: I'm trying to work with the latest release of MuPDF library.

if i a the document into pieces (A 5) then i get fancy results: the number of the pages (the pagination) does fully get lost..

1,4,3,2,5, and so forth


btw: see the commands i run:

    mutool poster -x 2 input.pdf output.pdf 
...see more above.
note: again. i am very happy to see your ideas. your ideas seem to be very helpful. I will try out you approaches.


i come back and report all my findings

dear nilamo and Gribouillis - many thanks. Smile
have a great day.

apollo