Several pdf files to text

mfernandes · Jul-07-2021, 08:14 PM

Thank you all for your suggestions and time. I apologize, but I only had now availability to answer.
deanhystad, about "Your code then goes to the trouble of creating a bunch of temporary files just to read the images back into memory. Why? Is there not a function where you can pass in the image directly?" The reason for doing this way is because I have been searching for code online to build my one. I tried to go from your code to create a code where it is possible to run for the folder but I obtained error (but the code works for just one file).
Pedroski55, for me the code does not run. I also tried by removing def myApp():, and then I just obtain the last file in the folder.
I tried know this way:

from PIL import Image
import pytesseract
import sys
from pdf2image import convert_from_path
import os
import string

for filename in os.listdir("where the pdfs are"):    
    filepath = ("where the pdfs are" + filename)
    pages = convert_from_path(filepath, poppler_path=r'C:\Program Files\poppler-0.68.0_x86\poppler-0.68.0\bin')
    image_counter = 1
    for page in pages:
        file = "where the image files are being saved" + str(os.path.splitext(filename)[0]) + "_" + str(image_counter)+ ".jpg"
        page.save(file, 'JPEG')
        image_counter = image_counter + 1

# All pdfs were converted into images. Given that my pdfs are 2 pages, 2 images are created. Now I will convert the images into text.

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

def main():
    # path for the folder for getting the raw images
    path ="where the images are saved"
  
    # path for the folder for getting the output
    tempPath ="where I want to save the text files"
  
    # iterating the images inside the folder
    for imageName in os.listdir(path):
              
        inputPath = os.path.join(path, imageName)
        img = Image.open(inputPath)
  
        # applying ocr using pytesseract for python
        text = pytesseract.image_to_string(img, lang ="fra")
        text = text.replace("\n", " ")
  
        # for removing the .jpg from the imagePath
        imageName = imageName[0:-4]
  
        fullTempPath = os.path.join(tempPath+imageName+".txt")
        print(text)
        
        # saving the  text for every image in a separate .txt file
        file1 = open(fullTempPath, "w")
        file1.write(text)
        file1.close() 
  
if __name__ == '__main__':
    main()

This code works, the problem is that because my pdfs are 2 pages, I am obtaining 2 text files, e.g."name_1" "name_2". Is it possible to merge two text files that start with the same name? Or slightly amend the code, so that I just obtain 1 text file?

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	azure TTS from text files to mp3s	mutantGOD	2	1,812	Jan-17-2023, 03:20 AM Last Post: mutantGOD
	Writing into 2 text files from the same function	paul18fr	4	1,770	Jul-28-2022, 04:34 AM Last Post: ndc85430
	Delete empty text files [SOLVED]	AlphaInc	5	1,656	Jul-09-2022, 02:15 PM Last Post: DeaD_EyE
	select files such as text file	RolanRoll	2	1,264	Jun-25-2022, 08:07 PM Last Post: RolanRoll
	Two text files, want to add a column value	zxcv101	8	2,065	Jun-20-2022, 03:06 PM Last Post: deanhystad
	select Eof extension files based on text list of filenames with if condition	RolanRoll	1	1,594	Apr-04-2022, 09:29 PM Last Post: Larz60+
	Separate text files and convert into csv	marfer	6	3,008	Dec-10-2021, 12:09 PM Last Post: marfer
	Sorting and Merging text-files [SOLVED]	AlphaInc	10	5,090	Aug-20-2021, 05:42 PM Last Post: snippsat
	Replace String in multiple text-files [SOLVED]	AlphaInc	5	8,337	Aug-08-2021, 04:59 PM Last Post: Axel_Erfurt
	Open and read multiple text files and match words	kozaizsvemira	3	6,863	Jul-07-2021, 11:27 AM Last Post: Larz60+

Several pdf files to text

User Panel Messages

Announcements