Need help to open PDF file and Export to text file - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Need help to open PDF file and Export to text file (/thread-5536.html) |
Need help to open PDF file and Export to text file - ratna_ain - Oct-09-2017 Hi All, I need to open PDF file with Adobe Reader and save to text file using sendkeys: - File : ALT+F - Save to Others : H - Text : X This is my code to open the file and sendkeys: import win32com.client import os from sys import argv shell = win32com.client.Dispatch("WScript.Shell") filename = "C:\RATNA\temp\TU1-2.pdf" os.chdir('C:\\RATNA\\temp') os.system('"C:\\Program Files (x86)\\Adobe\\Reader 11.0\\Reader\\AcroRd32.exe" TU1-2.pdf' ) shell.AppActivate('Acrobat.exe') shell.SendKeys("%{f}",0) shell.SendKeys("H", 0) shell.SendKeys("X", 0) The problem with this code is the sendkeys will be triggered only after I closed the PDF file. Thank You RE: Need help to open PDF file and Export to text file - nilamo - Oct-09-2017 os.system will block until the call completes. If you use something that doesn't block, such as the subprocess module, it might work. https://docs.python.org/3/library/subprocess.html#subprocess.run RE: Need help to open PDF file and Export to text file - buran - Oct-09-2017 python has plenty of packages that allow to convert pdf to text (extract text from pdf) in a native way, not by sending keys to external application. Just to name a few (in no particular order, i.e. not as recommendation): textract PDFminer - python2 and its pdf2txt tool. also pdfminer.six - a fork with python2/3 support slate - wrapper around PDFminer RE: Need help to open PDF file and Export to text file - ratna_ain - Oct-10-2017 (Oct-09-2017, 05:49 PM)buran Wrote: python has plenty of packages that allow to convert pdf to text (extract text from pdf) in a native way, not by sending keys to external application. Yeah, I have tried all of them including apache tika. The problem is when I use those packages , some files(around 10%) are not extracted correctly. Example : the title is extracted in the middle of the content, but actually the title is at the top in the PDF. And when I extract them with adobe reader manually, the title is extracted correctly. So my idea is to use this adobe reader for those files with these exception. We cannot do it one by one because the volume of the files is high. |