Jul-06-2021, 05:07 PM
(This post was last modified: Jul-06-2021, 05:33 PM by deanhystad.)
Do you have code that works for 1 file? I would convert that to a function and call the function for each pdf file. The process of calling a function to process a file will be trivial.
I have not tested this code at all, but provide it as an example of how I would split up this task.
Next I write a function that will convert all the pdf files in a folder into text files. This too is a useful task, so I write it as a function that can be called by other code.
I have not tested this code at all, but provide it as an example of how I would split up this task.
import sys import os from pytesseract import pytesseract from pdf2image import convert_from_path POPPLER = r'C:\Program Files\poppler-0.68.0_x86\poppler-0.68.0\bin' pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' def pdf_to_text(pdf_file, text_file, lang='fra'): """Use OCR to convert PDF file to text.""" images = convert_from_path(pdf_file, poppler_path=POPPLER) with open(text_file, 'w') as out_file: for image in images: text = pytesseract.image_to_string(image, lang =lang) out_file.write(text.replace("\n", " ")) def pdf_files_to_text(folder, out_folder=None): """Using OCR convert all PDF files in folder to text files. Text files are saved in out_folder (defaults to folder) using same name with extension changed to .txt """ if out_folder is None: out_folder = folder for file in os.listdir(folder): if file.endswith('.png'): pdf_to_text(folder+file, out_folder+file[:-3]+'txt') if __name__ == '__main__': pdf_files_to_text(os.getcwd())First I solve the real problem: converting a PDF file to text. This is a useful task, so I write a function that does this task.
Next I write a function that will convert all the pdf files in a folder into text files. This too is a useful task, so I write it as a function that can be called by other code.