Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Several pdf files to text
#5
Do you have code that works for 1 file? I would convert that to a function and call the function for each pdf file. The process of calling a function to process a file will be trivial.

I have not tested this code at all, but provide it as an example of how I would split up this task.
import sys
import os
from pytesseract import pytesseract
from pdf2image import convert_from_path

POPPLER = r'C:\Program Files\poppler-0.68.0_x86\poppler-0.68.0\bin'
pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

def pdf_to_text(pdf_file, text_file, lang='fra'):
    """Use OCR to convert PDF file to text."""
    images = convert_from_path(pdf_file, poppler_path=POPPLER)
    with open(text_file, 'w') as out_file:
        for image in images:
            text = pytesseract.image_to_string(image, lang =lang)
            out_file.write(text.replace("\n", " "))

def pdf_files_to_text(folder, out_folder=None):
    """Using OCR convert all PDF files in folder to text files.  Text
    files are saved in out_folder (defaults to folder) using same name
    with extension changed to .txt
    """
    if out_folder is None:
        out_folder = folder
    for file in os.listdir(folder):
        if file.endswith('.png'):
            pdf_to_text(folder+file, out_folder+file[:-3]+'txt')

if __name__ == '__main__':
    pdf_files_to_text(os.getcwd())
First I solve the real problem: converting a PDF file to text. This is a useful task, so I write a function that does this task.

Next I write a function that will convert all the pdf files in a folder into text files. This too is a useful task, so I write it as a function that can be called by other code.
Reply


Messages In This Thread
Several pdf files to text - by mfernandes - Jul-05-2021, 08:56 PM
RE: Several pdf files to text - by mfernandes - Jul-05-2021, 09:02 PM
RE: Several pdf files to text - by Pedroski55 - Jul-06-2021, 02:54 AM
RE: Several pdf files to text - by mfernandes - Jul-06-2021, 11:10 AM
RE: Several pdf files to text - by deanhystad - Jul-06-2021, 05:07 PM
RE: Several pdf files to text - by mfernandes - Jul-06-2021, 06:38 PM
RE: Several pdf files to text - by deanhystad - Jul-06-2021, 07:25 PM
RE: Several pdf files to text - by Pedroski55 - Jul-06-2021, 11:42 PM
RE: Several pdf files to text - by mfernandes - Jul-07-2021, 08:14 PM
RE: Several pdf files to text - by deanhystad - Jul-07-2021, 09:25 PM
RE: Several pdf files to text - by Pedroski55 - Jul-07-2021, 11:39 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  azure TTS from text files to mp3s mutantGOD 2 1,809 Jan-17-2023, 03:20 AM
Last Post: mutantGOD
  Writing into 2 text files from the same function paul18fr 4 1,770 Jul-28-2022, 04:34 AM
Last Post: ndc85430
  Delete empty text files [SOLVED] AlphaInc 5 1,654 Jul-09-2022, 02:15 PM
Last Post: DeaD_EyE
  select files such as text file RolanRoll 2 1,257 Jun-25-2022, 08:07 PM
Last Post: RolanRoll
  Two text files, want to add a column value zxcv101 8 2,063 Jun-20-2022, 03:06 PM
Last Post: deanhystad
  select Eof extension files based on text list of filenames with if condition RolanRoll 1 1,588 Apr-04-2022, 09:29 PM
Last Post: Larz60+
  Separate text files and convert into csv marfer 6 3,005 Dec-10-2021, 12:09 PM
Last Post: marfer
  Sorting and Merging text-files [SOLVED] AlphaInc 10 5,084 Aug-20-2021, 05:42 PM
Last Post: snippsat
  Replace String in multiple text-files [SOLVED] AlphaInc 5 8,333 Aug-08-2021, 04:59 PM
Last Post: Axel_Erfurt
  Open and read multiple text files and match words kozaizsvemira 3 6,858 Jul-07-2021, 11:27 AM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020