Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Several pdf files to text
#9
Thank you all for your suggestions and time. I apologize, but I only had now availability to answer.
deanhystad, about "Your code then goes to the trouble of creating a bunch of temporary files just to read the images back into memory. Why? Is there not a function where you can pass in the image directly?" The reason for doing this way is because I have been searching for code online to build my one. I tried to go from your code to create a code where it is possible to run for the folder but I obtained error (but the code works for just one file).
Pedroski55, for me the code does not run. I also tried by removing def myApp():, and then I just obtain the last file in the folder.
I tried know this way:
from PIL import Image
import pytesseract
import sys
from pdf2image import convert_from_path
import os
import string

for filename in os.listdir("where the pdfs are"):    
    filepath = ("where the pdfs are" + filename)
    pages = convert_from_path(filepath, poppler_path=r'C:\Program Files\poppler-0.68.0_x86\poppler-0.68.0\bin')
    image_counter = 1
    for page in pages:
        file = "where the image files are being saved" + str(os.path.splitext(filename)[0]) + "_" + str(image_counter)+ ".jpg"
        page.save(file, 'JPEG')
        image_counter = image_counter + 1

# All pdfs were converted into images. Given that my pdfs are 2 pages, 2 images are created. Now I will convert the images into text.

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

def main():
    # path for the folder for getting the raw images
    path ="where the images are saved"
  
    # path for the folder for getting the output
    tempPath ="where I want to save the text files"
  
    # iterating the images inside the folder
    for imageName in os.listdir(path):
              
        inputPath = os.path.join(path, imageName)
        img = Image.open(inputPath)
  
        # applying ocr using pytesseract for python
        text = pytesseract.image_to_string(img, lang ="fra")
        text = text.replace("\n", " ")
  
        # for removing the .jpg from the imagePath
        imageName = imageName[0:-4]
  
        fullTempPath = os.path.join(tempPath+imageName+".txt")
        print(text)
        
        # saving the  text for every image in a separate .txt file
        file1 = open(fullTempPath, "w")
        file1.write(text)
        file1.close() 
  
if __name__ == '__main__':
    main()
This code works, the problem is that because my pdfs are 2 pages, I am obtaining 2 text files, e.g."name_1" "name_2". Is it possible to merge two text files that start with the same name? Or slightly amend the code, so that I just obtain 1 text file?
Reply


Messages In This Thread
Several pdf files to text - by mfernandes - Jul-05-2021, 08:56 PM
RE: Several pdf files to text - by mfernandes - Jul-05-2021, 09:02 PM
RE: Several pdf files to text - by Pedroski55 - Jul-06-2021, 02:54 AM
RE: Several pdf files to text - by mfernandes - Jul-06-2021, 11:10 AM
RE: Several pdf files to text - by deanhystad - Jul-06-2021, 05:07 PM
RE: Several pdf files to text - by mfernandes - Jul-06-2021, 06:38 PM
RE: Several pdf files to text - by deanhystad - Jul-06-2021, 07:25 PM
RE: Several pdf files to text - by Pedroski55 - Jul-06-2021, 11:42 PM
RE: Several pdf files to text - by mfernandes - Jul-07-2021, 08:14 PM
RE: Several pdf files to text - by deanhystad - Jul-07-2021, 09:25 PM
RE: Several pdf files to text - by Pedroski55 - Jul-07-2021, 11:39 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  azure TTS from text files to mp3s mutantGOD 2 1,812 Jan-17-2023, 03:20 AM
Last Post: mutantGOD
  Writing into 2 text files from the same function paul18fr 4 1,770 Jul-28-2022, 04:34 AM
Last Post: ndc85430
  Delete empty text files [SOLVED] AlphaInc 5 1,656 Jul-09-2022, 02:15 PM
Last Post: DeaD_EyE
  select files such as text file RolanRoll 2 1,264 Jun-25-2022, 08:07 PM
Last Post: RolanRoll
  Two text files, want to add a column value zxcv101 8 2,065 Jun-20-2022, 03:06 PM
Last Post: deanhystad
  select Eof extension files based on text list of filenames with if condition RolanRoll 1 1,594 Apr-04-2022, 09:29 PM
Last Post: Larz60+
  Separate text files and convert into csv marfer 6 3,008 Dec-10-2021, 12:09 PM
Last Post: marfer
  Sorting and Merging text-files [SOLVED] AlphaInc 10 5,090 Aug-20-2021, 05:42 PM
Last Post: snippsat
  Replace String in multiple text-files [SOLVED] AlphaInc 5 8,337 Aug-08-2021, 04:59 PM
Last Post: Axel_Erfurt
  Open and read multiple text files and match words kozaizsvemira 3 6,863 Jul-07-2021, 11:27 AM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020