Python Forum
PyPDF2 Hanging When Trying to Open Corrupted PDF
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
PyPDF2 Hanging When Trying to Open Corrupted PDF
#1
I'm using the PyPDF2 library to cycle through several thousand PDFs each day, search for specific text that's present in the top-most part of the PDF that indicates the file can't natively be opened in Adobe PDF, and moves these files to a different directory.

I'm then using the free bioPDF utility "Acrowrap.exe" to basically re-print these bad PDFs to where they can later successfully be opened using the Adobe PDF Reader.

I've never had any trouble opening any PDF documents using PyPDF2, but I've hit one specific PDF document today that absolutely will not open using the PyPDF2 library. When the code tries to open this one PDF document, it just hangs forever and never errors out. I've tried putting the code snippet below in a Try... Except... block but that doesn't catch any errors. It just continues to stall after the PDF is attempted to be opened and never continues on and never gives me an error message. This 1 PDF is definitely corrupted, as I cannot open the PDF using Adobe Acrobat Reader nor open it within Chrome manually. Tells me that the file is corrupted and cannot be repaired.

Here's a very short snippet of my code that's been running fine for months except for hanging permanently on this 1 PDF document just this a.m.:

# creating a pdf File object of original pdf
pdfFileObj = open(path + file, 'rb')

# creating a pdf Reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

Any thoughts on how to work around this (and potential future corrupted) corrupt PDF document to where my code will just successfully skip this corrupted files and keep on going would be greatly appreciated.
Reply
#2
you can find some more detail on other methods for extracting with pypdf2 here: https://www.blog.pythonlibrary.org/2018/...th-python/.

In addition, Mike Driscoll (author of above page) reciently finished a book on ReportLab, and uses it to reconstruct PDF's.
Reply
#3
Quote:where my code will just successfully skip this corrupted files and keep on going
Look up Python's try and except. In addition to Larz60+ suggestion, try converting it with either ImageMagick or PythonMagick to some other format and then back again, to see if that removes or corrects the corrupted part.
Reply
#4
(Nov-08-2018, 09:16 PM)woooee Wrote: try converting it with either ImageMagick or PythonMagick to some other format and then back again
You can perhaps also install poppler-utils on your computer and use tools such as pdfseparate or pdfunite to transform the pdf file in various ways.
Reply
#5
Okay, so I wasn't waiting long enough for the Try... Except block to catch the file open error. I ran it again and gave it about 10-15 min. and it finally did get into the "Except" portion of the Try... Except block and I was able to skip on to the next PDF file that wasn't damaged.

Thanks for everyone's suggestions.

Still surprised that it takes 10-15 min. for the code to figure out that it simply cannot open the damaged PDF document, but that's what happens.
Reply
#6
After a few, < 5 minutes, cancel it and go on. If it isn't read by then it's not going to be. If you want you can keep a list of the bad file names and try again after all the files have been processed. Also, web browsers are supposed to handle mangled pdf files. I've never tried, so don't know.
Reply
#7
You can ask the author about timeout, I don't see one in the docs. Author email is [email protected]
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  python script is hanging while calling a procedure in database prasanthi417 4 444 Jan-17-2024, 02:33 PM
Last Post: deanhystad
  Downloaded file corrupted emont 5 764 Oct-01-2023, 11:32 AM
Last Post: snippsat
  PyPDF2 deprecation problem gowb0w 5 3,531 Sep-21-2023, 12:38 PM
Last Post: Pedroski55
  ModuleNotFoundError: No module named 'PyPDF2' Benitta2525 1 1,393 Aug-07-2023, 05:32 AM
Last Post: DPaul
  Pypdf2 will not find text standenman 2 878 Feb-03-2023, 10:52 PM
Last Post: standenman
  pyPDF2 PDFMerger close pensding file japo85 2 2,340 Jul-28-2022, 09:49 AM
Last Post: japo85
Sad pandas writer create "corrupted" file freko75 1 2,737 Jun-14-2022, 09:57 PM
Last Post: snippsat
  Structuring and pivoting corrupted dataframe in pandas gunner1905 2 2,196 Sep-18-2021, 01:30 PM
Last Post: gunner1905
  PyPDF2 processing problem Pavel_47 6 9,645 May-04-2021, 06:58 AM
Last Post: chaitanya
  Error in Python3.6:free() Corrupted unsorted chunks error sameer_k 2 3,798 Mar-18-2020, 09:37 AM
Last Post: sameer_k

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020