Jul-07-2021, 08:14 PM
Thank you all for your suggestions and time. I apologize, but I only had now availability to answer.
deanhystad, about "Your code then goes to the trouble of creating a bunch of temporary files just to read the images back into memory. Why? Is there not a function where you can pass in the image directly?" The reason for doing this way is because I have been searching for code online to build my one. I tried to go from your code to create a code where it is possible to run for the folder but I obtained error (but the code works for just one file).
Pedroski55, for me the code does not run. I also tried by removing def myApp():, and then I just obtain the last file in the folder.
I tried know this way:
deanhystad, about "Your code then goes to the trouble of creating a bunch of temporary files just to read the images back into memory. Why? Is there not a function where you can pass in the image directly?" The reason for doing this way is because I have been searching for code online to build my one. I tried to go from your code to create a code where it is possible to run for the folder but I obtained error (but the code works for just one file).
Pedroski55, for me the code does not run. I also tried by removing def myApp():, and then I just obtain the last file in the folder.
I tried know this way:
from PIL import Image import pytesseract import sys from pdf2image import convert_from_path import os import string for filename in os.listdir("where the pdfs are"): filepath = ("where the pdfs are" + filename) pages = convert_from_path(filepath, poppler_path=r'C:\Program Files\poppler-0.68.0_x86\poppler-0.68.0\bin') image_counter = 1 for page in pages: file = "where the image files are being saved" + str(os.path.splitext(filename)[0]) + "_" + str(image_counter)+ ".jpg" page.save(file, 'JPEG') image_counter = image_counter + 1 # All pdfs were converted into images. Given that my pdfs are 2 pages, 2 images are created. Now I will convert the images into text. pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' def main(): # path for the folder for getting the raw images path ="where the images are saved" # path for the folder for getting the output tempPath ="where I want to save the text files" # iterating the images inside the folder for imageName in os.listdir(path): inputPath = os.path.join(path, imageName) img = Image.open(inputPath) # applying ocr using pytesseract for python text = pytesseract.image_to_string(img, lang ="fra") text = text.replace("\n", " ") # for removing the .jpg from the imagePath imageName = imageName[0:-4] fullTempPath = os.path.join(tempPath+imageName+".txt") print(text) # saving the text for every image in a separate .txt file file1 = open(fullTempPath, "w") file1.write(text) file1.close() if __name__ == '__main__': main()This code works, the problem is that because my pdfs are 2 pages, I am obtaining 2 text files, e.g."name_1" "name_2". Is it possible to merge two text files that start with the same name? Or slightly amend the code, so that I just obtain 1 text file?