Oct-03-2019, 07:06 AM
Hi!
In our job we use PDF reports that we get from multiple corporate clients. We have one which has very corporate views about handing out information. The PDFs come with almost all of the restrictions they can come with. They need a password to open them. You can't copy text from them, convert them and basically they have almost every single restriction when I check them in Adobe Acrobat. I tried loading the text with PyPDF2 and PDFMiner, with both I work with on a regular basis, but no luck. Recently I had the idea that maybe I could read it from the memory or maybe use some command line tools but I honestly have never did anything like this and would like to stick to Python if possible. We need to create a loop which gets data from multiple files and puts them into a csv, so it should be something that runs in the background, reads the text and then gets the needed substring. This part I can create easily but I need the text to work with.
So the question is: How do you get the text from a PDF which has all these restrictions?
- Password protection to open the file (we always get the password, so this should be no problem)
- Changing the document
- Document Assembly
- Content Copying
- Page Extraction
- Commenting
- of form fields
- Signing
- Template page creation
The only two tools allowed are:
- Printing
- Copying Content for Accessibility
In our job we use PDF reports that we get from multiple corporate clients. We have one which has very corporate views about handing out information. The PDFs come with almost all of the restrictions they can come with. They need a password to open them. You can't copy text from them, convert them and basically they have almost every single restriction when I check them in Adobe Acrobat. I tried loading the text with PyPDF2 and PDFMiner, with both I work with on a regular basis, but no luck. Recently I had the idea that maybe I could read it from the memory or maybe use some command line tools but I honestly have never did anything like this and would like to stick to Python if possible. We need to create a loop which gets data from multiple files and puts them into a csv, so it should be something that runs in the background, reads the text and then gets the needed substring. This part I can create easily but I need the text to work with.
So the question is: How do you get the text from a PDF which has all these restrictions?
- Password protection to open the file (we always get the password, so this should be no problem)
- Changing the document
- Document Assembly
- Content Copying
- Page Extraction
- Commenting
- of form fields
- Signing
- Template page creation
The only two tools allowed are:
- Printing
- Copying Content for Accessibility