Reading a copy-protected PDF

CaptainCsaba · Oct-03-2019, 07:06 AM

Hi!

In our job we use PDF reports that we get from multiple corporate clients. We have one which has very corporate views about handing out information. The PDFs come with almost all of the restrictions they can come with. They need a password to open them. You can't copy text from them, convert them and basically they have almost every single restriction when I check them in Adobe Acrobat. I tried loading the text with PyPDF2 and PDFMiner, with both I work with on a regular basis, but no luck. Recently I had the idea that maybe I could read it from the memory or maybe use some command line tools but I honestly have never did anything like this and would like to stick to Python if possible. We need to create a loop which gets data from multiple files and puts them into a csv, so it should be something that runs in the background, reads the text and then gets the needed substring. This part I can create easily but I need the text to work with.

So the question is: How do you get the text from a PDF which has all these restrictions?

- Password protection to open the file (we always get the password, so this should be no problem)
- Changing the document
- Document Assembly
- Content Copying
- Page Extraction
- Commenting
- of form fields
- Signing
- Template page creation

The only two tools allowed are:

- Printing
- Copying Content for Accessibility

***ichabod801*** · Oct-03-2019, 12:43 PM

This sounds like a management problem to me. Tell your boss you need them to get on corporation X to provide machine readable pdfs, or you will have no choice but to back burner processing their pdfs.

***snippsat*** · (This post was last modified: Oct-03-2019, 02:02 PM by snippsat.)

Try pikepdf

Quote:pikepdf.Pdf.open() can open almost all types of encrypted PDF!
Just provide the password= keyword argument.

It's based on QPDF that has a very good support for all PDF encryption methods.
qpdf is a command line tool,so can use it to just decrypt.

qpdf --decrypt  password=PASSWORD in_doc.pdf out_decrypt.pdf

CaptainCsaba · Oct-03-2019, 02:11 PM

(Oct-03-2019, 12:43 PM)ichabod801 Wrote: This sounds like a management problem to me. Tell your boss you need them to get on corporation X to provide machine readable pdfs, or you will have no choice but to back burner processing their pdfs.

We have tried already, problem is that we are not in a position to be able to ask for such things (they have monopoly on the data we ask.)

I'm going to try to work with pikepdf (have not heard of it yet, thanks for the info) and I'll return tomorrow to share the results.

***ichabod801*** · Oct-03-2019, 03:06 PM

(Oct-03-2019, 02:11 PM)CaptainCsaba Wrote: We have tried already, problem is that we are not in a position to be able to ask for such things

Yeah, I was afraid that might be the situation. Been there, done that.

CaptainCsaba · (This post was last modified: Oct-04-2019, 12:43 PM by CaptainCsaba.)

hey. Unfortunately we don't have permission to use QPDF, although it seemed useful. I tried pikepdf as it should wrok (since it uses the same code). I used the following code. If I did not include the password part it did not open it due to it needing a password. If I add it, Powershell crashes when I run it. Strange...

import pikepdf

pdf = pikepdf.open('test.pdf', password='abc123')
pdf.save('test2.pdf')

Caslenty · (This post was last modified: Oct-25-2021, 07:06 AM by Caslenty.)

To be honest, I am not happy to use QPDF to crack the permission password on the PDF. This program is too complicated because the command prompt involved is beyond my knowledge.
However, I think these programs are very good and easy to use.
https://lightpdf.com/unlock-pdf
https://pdfcandy.com/unlock-pdf.html
https://www.bestpdfpasswordremover.com/

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Why is the copy method name in python list copy and not `__copy__`?	YouHoGeon	2	1,449	Apr-04-2024, 01:18 AM Last Post: YouHoGeon
	Login through a protected directory	ebolisa	3	2,855	Jul-24-2021, 09:12 PM Last Post: ebolisa
	Password protected xls data transfer to master	OTH	1	4,396	Feb-15-2021, 08:28 PM Last Post: OTH
	Reading data from password protected excel	Anirudh_Avantsa	2	25,637	Apr-04-2018, 03:26 PM Last Post: nilamo
	copy files from one destination to another by reading filename from csv	Prince_Bhatia	3	9,656	Feb-27-2018, 10:56 AM Last Post: Prince_Bhatia

Reading a copy-protected PDF

User Panel Messages

Announcements