Python Forum
Reading a copy-protected PDF
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Reading a copy-protected PDF
#1
Hi!

In our job we use PDF reports that we get from multiple corporate clients. We have one which has very corporate views about handing out information. The PDFs come with almost all of the restrictions they can come with. They need a password to open them. You can't copy text from them, convert them and basically they have almost every single restriction when I check them in Adobe Acrobat. I tried loading the text with PyPDF2 and PDFMiner, with both I work with on a regular basis, but no luck. Recently I had the idea that maybe I could read it from the memory or maybe use some command line tools but I honestly have never did anything like this and would like to stick to Python if possible. We need to create a loop which gets data from multiple files and puts them into a csv, so it should be something that runs in the background, reads the text and then gets the needed substring. This part I can create easily but I need the text to work with.

So the question is: How do you get the text from a PDF which has all these restrictions?

- Password protection to open the file (we always get the password, so this should be no problem)
- Changing the document
- Document Assembly
- Content Copying
- Page Extraction
- Commenting
- of form fields
- Signing
- Template page creation

The only two tools allowed are:

- Printing
- Copying Content for Accessibility
Reply
#2
This sounds like a management problem to me. Tell your boss you need them to get on corporation X to provide machine readable pdfs, or you will have no choice but to back burner processing their pdfs.
Craig "Ichabod" O'Brien - xenomind.com
I wish you happiness.
Recommended Tutorials: BBCode, functions, classes, text adventures
Reply
#3
Try pikepdf
Quote:pikepdf.Pdf.open() can open almost all types of encrypted PDF!
Just provide the password= keyword argument.
It's based on QPDF that has a very good support for all PDF encryption methods.
qpdf is a command line tool,so can use it to just decrypt.
qpdf --decrypt  password=PASSWORD in_doc.pdf out_decrypt.pdf
Reply
#4
(Oct-03-2019, 12:43 PM)ichabod801 Wrote: This sounds like a management problem to me. Tell your boss you need them to get on corporation X to provide machine readable pdfs, or you will have no choice but to back burner processing their pdfs.

We have tried already, problem is that we are not in a position to be able to ask for such things (they have monopoly on the data we ask.)

I'm going to try to work with pikepdf (have not heard of it yet, thanks for the info) and I'll return tomorrow to share the results.
Reply
#5
(Oct-03-2019, 02:11 PM)CaptainCsaba Wrote: We have tried already, problem is that we are not in a position to be able to ask for such things

Yeah, I was afraid that might be the situation. Been there, done that.
Craig "Ichabod" O'Brien - xenomind.com
I wish you happiness.
Recommended Tutorials: BBCode, functions, classes, text adventures
Reply
#6
hey. Unfortunately we don't have permission to use QPDF, although it seemed useful. I tried pikepdf as it should wrok (since it uses the same code). I used the following code. If I did not include the password part it did not open it due to it needing a password. If I add it, Powershell crashes when I run it. Strange...

import pikepdf

pdf = pikepdf.open('test.pdf', password='abc123')
pdf.save('test2.pdf')
Reply
#7
To be honest, I am not happy to use QPDF to crack the permission password on the PDF. This program is too complicated because the command prompt involved is beyond my knowledge.
However, I think these programs are very good and easy to use.
https://lightpdf.com/unlock-pdf
https://pdfcandy.com/unlock-pdf.html
https://www.bestpdfpasswordremover.com/
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Why is the copy method name in python list copy and not `__copy__`? YouHoGeon 2 269 Apr-04-2024, 01:18 AM
Last Post: YouHoGeon
  Login through a protected directory ebolisa 3 2,044 Jul-24-2021, 09:12 PM
Last Post: ebolisa
  Password protected xls data transfer to master OTH 1 3,204 Feb-15-2021, 08:28 PM
Last Post: OTH
  Reading data from password protected excel Anirudh_Avantsa 2 22,291 Apr-04-2018, 03:26 PM
Last Post: nilamo
  copy files from one destination to another by reading filename from csv Prince_Bhatia 3 7,642 Feb-27-2018, 10:56 AM
Last Post: Prince_Bhatia

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020