Several pdf files to text

mfernandes · Jul-05-2021, 08:56 PM

Dear Python community,
I have several pdf files in a folder and I would like to convert all of them into text file. In this link it is explained how to prepare the code for one pdf file: https://www.geeksforgeeks.org/python-rea...cognition/.
Before coding, it was necessary to install tesseract (https://pypi.org/project/pytesseract/) and poppler (https://poppler.freedesktop.org/).
I am trying to prepare my code for several pdf files:

 from PIL import Image
import pytesseract
import sys
from pdf2image import convert_from_path
import os
import string

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

def main():
    # path for the folder for getting the pdfs
    path="C:/Users/mydirectory"
    # path for the folder for getting the output
    tempPath ="C:/Users/mydirectory (2)"
    
    for imageName in os.listdir(path):
        pages = convert_from_path(path, poppler_path=r'C:\Program Files\poppler-0.68.0_x86\poppler-0.68.0\bin')
        image_counter = 1
        for page in pages:
            filename = "page_"+str(os.image_counter)+".png"
            page.save(filename, 'PNG')
            image_counter = image_counter + 1
        filelimit = image_counter-1
        for i in range(1, filelimit+1):
            filename="page_"+str(i)+".png"
        inputPath=os.path.join(path, imageName)
        text = pt.image_to_string(Image.open(filename), lang ="fra")        
        text = text.replace("\n", " ")
        fullTempPath = os.path.join(tempPath, 'time_'+imageName+".txt")
        file1 = open(fullTempPath, "w")
        file1.write(text)
        file1.close() 
  
if __name__ == '__main__':
    main()

However, I am obtaining the following message "PDFPageCountError: Unable to get page count.
I/O Error: Couldn't open file 'C:\Users\mydirectory': No error."
Thank you!

mfernandes · Jul-05-2021, 09:02 PM

I apologize, but I only noticed after posting that this thread was published on "Code sharing" when my intention was in "General coding help". Please delete it.

Pedroski55 · Jul-06-2021, 02:54 AM

In a previous job my gf needed to convert a lot of old exams that were image pdfs to text.

I made this, it is using chi_sim, not French, but it works well. Probably needs tidying up a bit, I am not an expert, but it works!

If you need to convert many pdf files, just put this in a big loop and read fileforocr from a list.

import os
from pdf2jpg import pdf2jpg
from PIL import Image
import pytesseract

# set your source and destination and choose your PDF first
source = '/a/path/'
destination = '/another/path/'

# if you want to convert many PDFs, make loop here
fileforocr = input('Enter the path and name of the pdf to ocr ... ' )

# crack the PDF open

def splitPDF(aPDF, source, destination):
    print('Splitting the PDF to individual jpgs ... ')
    outputName = aPDF.split('.')
    savename = outputName[0]    
    # images is a list
    images = pdf2image.convert_from_path(source + aPDF)
    i=1
    for image in images:
        image.save(destination + savename + str(i) + '.jpg', 'JPEG')
        i+=1           
    print('PDF split to .jpgs and all saved in: ' + destination + '\n\n')

# get the jpgs
jpgFiles = os.listdir(convertPdfJpg)
jpgFiles.sort()

saveFilepath = '/home/pedro/babystuff/ocr_textfiles/'
saveFilename = input('Enter a name to save the text to ... ')
bookCopy = open(saveFilepath + saveFilename, 'a')

# this works fine
for i in range( len(jpgFiles)):
    chiText1 = pytesseract.image_to_string(Image.open(convertPdfJpg + jpgFiles[i]), lang='chi_sim')
    print('Page ' + str(i + 1) + ' done')
    bookCopy.write(chiText1)
    print('Next loop coming up')

bookCopy.close()

mfernandes · Jul-06-2021, 11:10 AM

Thank you Pedroski55 for your code. I tried to run it and python responds with "/a/path", so I inserted the name of a random pdf that is in that file, and I obtained the following error:

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-5d01fbb89ebc> in <module>()
     30 
     31 # get the jpgs
---> 32 jpgFiles = os.listdir(convertPdfJpg)
     33 jpgFiles.sort()
     34 

NameError: name 'convertPdfJpg' is not defined

From what I perceived, this code will ask me to insert the name of each pdf file in my folder, when I would like just to convert everything at once. I forgot to mention, but I would like my text files to have the same name of my pdf files.

**deanhystad** · (This post was last modified: Jul-06-2021, 05:33 PM by deanhystad.)

Do you have code that works for 1 file? I would convert that to a function and call the function for each pdf file. The process of calling a function to process a file will be trivial.

I have not tested this code at all, but provide it as an example of how I would split up this task.

import sys
import os
from pytesseract import pytesseract
from pdf2image import convert_from_path

POPPLER = r'C:\Program Files\poppler-0.68.0_x86\poppler-0.68.0\bin'
pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

def pdf_to_text(pdf_file, text_file, lang='fra'):
    """Use OCR to convert PDF file to text."""
    images = convert_from_path(pdf_file, poppler_path=POPPLER)
    with open(text_file, 'w') as out_file:
        for image in images:
            text = pytesseract.image_to_string(image, lang =lang)
            out_file.write(text.replace("\n", " "))

def pdf_files_to_text(folder, out_folder=None):
    """Using OCR convert all PDF files in folder to text files.  Text
    files are saved in out_folder (defaults to folder) using same name
    with extension changed to .txt
    """
    if out_folder is None:
        out_folder = folder
    for file in os.listdir(folder):
        if file.endswith('.png'):
            pdf_to_text(folder+file, out_folder+file[:-3]+'txt')

if __name__ == '__main__':
    pdf_files_to_text(os.getcwd())

First I solve the real problem: converting a PDF file to text. This is a useful task, so I write a function that does this task.

Next I write a function that will convert all the pdf files in a folder into text files. This too is a useful task, so I write it as a function that can be called by other code.

mfernandes · (This post was last modified: Jul-06-2021, 06:39 PM by mfernandes.)

Thank you deanhystad for your suggestions.
Yes, I have code that it works for 1 file, but I do not know how to convert the code for all pdfs (the 1st code that I posted was my attempt from jumping to 1 pdf to all pdfs). I started using python very recently. Here is the code for 1 pdf:

from PIL import Image
import pytesseract
import sys
from pdf2image import convert_from_path
import os
import string

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
  
# Path of the pdf
PDF_file = "where I have the pdfs/Same-title.pdf"

pages = convert_from_path(PDF_file, poppler_path=r'C:\Program Files\poppler-0.68.0_x86\poppler-0.68.0\bin')

image_counter = 1
  
# Iterate through all the pages stored above
for page in pages:
    # Declaring filename for each page of PDF as png. Also works with jpg
    filename = "where I want to save the images"+"page_"+str(image_counter)+".png"
      
    # Save the image of the page in system
    page.save(filename, 'PNG')
  
    # Increment the counter to update filename
    image_counter = image_counter + 1
    
    filelimit = image_counter-1
    
outfile="where I want to save the text files/Same-title.txt"
  
# Open the file in append mode so that all contents of all images are added to the same file
f = open(outfile, "a")
  
# Iterate from 1 to total number of pages
for i in range(1, filelimit + 1):
    filename = "page_"+str(i)+".png"
    # Recognize the text as string in image using pytesserct
    text = str(((pytesseract.image_to_string(Image.open(filename), lang='fra'))))
    text = text.replace("\n", " ") 

    # Finally, write the processed text to the file.
    f.write(text)
    
f.close()

I also tried to use your code and added after "POPPLER":

pdf_file="C:/Users/Desktop/path for getting the pdfs"
folder="C:/Users/Desktop/path to save the text files"

but I obtained the following error: "PDFPageCountError: Unable to get page count.
I/O Error: Couldn't open file 'C:\Userspage_1.png': No error."

**deanhystad** · (This post was last modified: Jul-06-2021, 07:25 PM by deanhystad.)

This does nothing useful:

pdf_file="C:/Users/Desktop/path for getting the pdfs"
folder="C:/Users/Desktop/path to save the text files"

The pdf_file you define here is not the pdf_file argument used by pdf_to_text(). Same goes for folder not being the argument passed to pdf_files_to_text().

As I said, I have not tested any of this code. I don't want to install poppler and I have no need for pytesserract. But if you did want to try running my code you either need to run this file while in the "C:/Users/Desktop/path for getting the pdfs" folder or modify the last line to be this.

    pdf_files_to_text("C:/Users/Desktop/path for getting the pdfs")

Why are you creating image files from images only to open them back up to create images? This creates a list of images.

pages = convert_from_path(PDF_file, poppler_path=r'C:\Program Files\poppler-0.68.0_x86\poppler-0.68.0\bin')

Your code then goes to the trouble of creating a bunch of temporary files just to read the images back into memory. Why? Is there not a function where you can pass in the image directly?

If you need to do so for some image processing reason I suggest translating one image at a time.

import pytesseract
import sys
from pdf2image import convert_from_path
import os

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

PDF_file = "where I have the pdfs/Same-title.pdf"
outfile="where I want to save the text files/Same-title.txt"
tempfile = "where I can create a file for temporary use/tempfile.png"

pages = convert_from_path(PDF_file, poppler_path=r'C:\Program Files\poppler-0.68.0_x86\poppler-0.68.0\bin')
 
with open(outfile, "w") as f:
    for page in pages:
        page.save(tempfile, 'PNG')

        # Recognize the text as string in image using pytesserct
        text = str(pytesseract.image_to_string(tempfile, lang='fra'))  # Why were there extra () here?
        text = text.replace("\n", " ") 
    
        # Finally, write the processed text to the file.
        f.write(text)

Now you don't have to worry about keeping track of page numbers.

The next step is to clean things up a bit and turn most of that code into a function you can call. I suggest passing in the filenames for the pdf file and the text file.

from PIL import Image
import pytesseract
import sys
from pdf2image import convert_from_path
import os

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

PDF_file = "where I have the pdfs/Same-title.pdf"
outfile="where I want to save the text files/Same-title.txt"
tempfile = "where I can create a file for temporary use/tempfile.png"

def convert_pdf_to_text(pdf_file, text_file):
    pages = convert_from_path(pdf_file, poppler_path=r'C:\Program Files\poppler-0.68.0_x86\poppler-0.68.0\bin')
    
    with open(text_file, "w") as f:
        for page in pages:
            page.save(tempfile, 'PNG')

            # Recognize the text as string in image using pytesserct
            text = str(((pytesseract.image_to_string(tempfile, lang='fra'))))
            text = text.replace("\n", " ") 
        
            # Finally, write the processed text to the file.
            f.write(text)

convert_pdf_to_text(PDF_file, outfile)

Now that you have a function that does all the conversion, all you need to do is call the function for each file in a folder.

Pedroski55 · (This post was last modified: Jul-07-2021, 04:15 AM by Pedroski55.)

HI again, just had some free time, so I tidied up my pdf to text program.

You just need to change the paths, I don't use Windows, so I am not too sure about the correct format.

Then you can paste this in your Idle shell and enter myApp()

Works well for me! The girlfriend might need it again someday, have to keep her happy

def myApp():
    import os    
    import pdf2image
    from PIL import Image
    import pytesseract

    # set your paths

    source = '/home/pedro/babystuff/pdf2text/'
    destination_jpg = '/home/pedro/babystuff/pdf2jpg/'
    save_text_path = '/home/pedro/babystuff/ocr_textfiles/'

    # get the pdf files

    files = os.listdir(source)
    mypdfs = []

    # maybe there are some other files in there, so only get .pdf files
    for f in files:
        if f.endswith('.pdf'):
            mypdfs.append(f)

    # get rid of the jpg files after reading them

    def junkjpgs(path):
        print('Clearing out the folders we use, in case there is anything in there ... ')
        pics = os.listdir(path)
        if len(pics) == 0:
            print('Nothing in ' + path + '\n\n')
            return
        for file in pics:
            os.remove(path + file)
        print('ALL files removed from: ' + path + '\n\n')

    # in case there are any old jpg files in the jpg folder
    junkjpgs(destination_jpg)

    # crack the PDF open

    def splitPDF(aPDF, source, destination):
        print('Splitting the PDF to individual jpgs ... ')
        outputName = aPDF.split('.')
        savename = outputName[0]    
        # images is a list
        images = pdf2image.convert_from_path(source + aPDF)
        i=1
        for image in images:
            image.save(destination + savename + str(i) + '.jpg', 'JPEG')
            i+=1           
        print('PDF split to .jpgs and all saved in: ' + destination + '\n\n')
        savetextname = savename + '.txt'
        return savetextname

    def convert2text(name):
        # get the jpgs
        jpgFiles = os.listdir(destination_jpg)
        jpgFiles.sort()
        this_text = open(save_text_path + name, 'a')
        # this works fine
        for i in range(len(jpgFiles)):
            chiText1 = pytesseract.image_to_string(Image.open(destination_jpg + jpgFiles[i]), lang='chi_sim')
            print('Page ' + str(i + 1) + ' done')
            this_text.write(chiText1)
            print('Next loop coming up')
        this_text.close()
        print('removing the jpgs ... ')
        junkjpgs(destination_jpg)
        print('finished this PDF ... ')
        

    for f in mypdfs:
        text_name = splitPDF(f, source, destination_jpg)
        convert2text(text_name)

mfernandes · Jul-07-2021, 08:14 PM

Thank you all for your suggestions and time. I apologize, but I only had now availability to answer.
deanhystad, about "Your code then goes to the trouble of creating a bunch of temporary files just to read the images back into memory. Why? Is there not a function where you can pass in the image directly?" The reason for doing this way is because I have been searching for code online to build my one. I tried to go from your code to create a code where it is possible to run for the folder but I obtained error (but the code works for just one file).
Pedroski55, for me the code does not run. I also tried by removing def myApp():, and then I just obtain the last file in the folder.
I tried know this way:

from PIL import Image
import pytesseract
import sys
from pdf2image import convert_from_path
import os
import string

for filename in os.listdir("where the pdfs are"):    
    filepath = ("where the pdfs are" + filename)
    pages = convert_from_path(filepath, poppler_path=r'C:\Program Files\poppler-0.68.0_x86\poppler-0.68.0\bin')
    image_counter = 1
    for page in pages:
        file = "where the image files are being saved" + str(os.path.splitext(filename)[0]) + "_" + str(image_counter)+ ".jpg"
        page.save(file, 'JPEG')
        image_counter = image_counter + 1

# All pdfs were converted into images. Given that my pdfs are 2 pages, 2 images are created. Now I will convert the images into text.

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

def main():
    # path for the folder for getting the raw images
    path ="where the images are saved"
  
    # path for the folder for getting the output
    tempPath ="where I want to save the text files"
  
    # iterating the images inside the folder
    for imageName in os.listdir(path):
              
        inputPath = os.path.join(path, imageName)
        img = Image.open(inputPath)
  
        # applying ocr using pytesseract for python
        text = pytesseract.image_to_string(img, lang ="fra")
        text = text.replace("\n", " ")
  
        # for removing the .jpg from the imagePath
        imageName = imageName[0:-4]
  
        fullTempPath = os.path.join(tempPath+imageName+".txt")
        print(text)
        
        # saving the  text for every image in a separate .txt file
        file1 = open(fullTempPath, "w")
        file1.write(text)
        file1.close() 
  
if __name__ == '__main__':
    main()

This code works, the problem is that because my pdfs are 2 pages, I am obtaining 2 text files, e.g."name_1" "name_2". Is it possible to merge two text files that start with the same name? Or slightly amend the code, so that I just obtain 1 text file?

**deanhystad** · (This post was last modified: Jul-07-2021, 09:43 PM by deanhystad.)

You cannot write code by cut and paste. You need to understand what the code does, otherwise you have no ability to debug the code when there are errors. And there are always errors.

When you say my code works for 1 file, what code do you mean? This code?

def convert_pdf_to_text(pdf_file, text_file):
    pages = convert_from_path(pdf_file, poppler_path=r'C:\Program Files\poppler-0.68.0_x86\poppler-0.68.0\bin')
     
    with open(text_file, "w") as f:
        for page in pages:
            page.save(tempfile, 'PNG')
 
            # Recognize the text as string in image using pytesserct
            text = str(((pytesseract.image_to_string(tempfile, lang='fra'))))
            text = text.replace("\n", " ") 
         
            # Finally, write the processed text to the file.
            f.write(text)

This code is designed to only work for 1 file. If you want it to work for multiple files you call it multiple times. You could cut and past from my earlier post.

def pdf_files_to_text(folder, out_folder=None):
    """Using OCR convert all PDF files in folder to text files.  Text
    files are saved in out_folder (defaults to folder) using same name
    with extension changed to .txt
    """
    if out_folder is None:
        out_folder = folder
    for file in os.listdir(folder):
        if file.endswith('.pdf'):
            convert_pdf_to_text(folder+file, out_folder+file[:-3]+'txt')

This function should call convert_pdf_to_text() for each pdf file in "folder". It did contain an error in line 9 where it mistakenly looked for files ending with '.png' instead of '.pdf'. This would probably raise an exception because poppler would not know how to translate a .png file. This kind of problem would be immediately obvious to me. I'd look at the error message and wonder "Why am I passing a .png file to poppler?" This problem is not obvious to you because you do not understand the program you are trying to write. You don't understand the errors, not because you are stupid, but because you did not go through the steps of designing the code. Problem understanding happens during the design stage and you skipped immediately to coding. That makes debugging really hard.

The error in my function is partially on purpose. I don't have a convenient folder full of .pdf files to use to test my code, but I do have a folder full of png and jpg files. To test the pdf_files_to_text() function I modified things slightly:

def pdf_files_to_text(folder, out_folder=None):
    """Using OCR convert all PDF files in folder to text files.  Text
    files are saved in out_folder (defaults to folder) using same name
    with extension changed to .txt
    """
    if out_folder is None:
        out_folder = folder
    for file in os.listdir(folder):
        if file.endswith('.png'):
            print(folder+file, out_folder+file[:-3]+'txt')
            # convert_pdf_to_text(folder+file, out_folder+file[:-3]+'txt')

This printed out all the png files in the folder along with their associated txt files. I removed the print and uncommented the convert, but forgot to set the extension back to ".png". A simple silly error.

I suggest you doing something similar with your code. If the conversion works for 1 file but throws an exception when trying to convert multiple files, break the problem in half to determine where the error occurs. If you comment out the call to convert_pdf_to_text() and the program still crashes, that is strong evidence that the error is not in the file translator but instead in the file loop. If there error goes away, but the code works fine for a single file, look at the arguments passed to convert_pdf_to_text(). That is how I debug code. Break the program into smaller and smaller pieces until I have the error isolated down to a few lines of code.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	azure TTS from text files to mp3s	mutantGOD	2	3,421	Jan-17-2023, 03:20 AM Last Post: mutantGOD
	Writing into 2 text files from the same function	paul18fr	4	2,853	Jul-28-2022, 04:34 AM Last Post: ndc85430
	Delete empty text files [SOLVED]	AlphaInc	5	3,482	Jul-09-2022, 02:15 PM Last Post: DeaD_EyE
	select files such as text file	RolanRoll	2	2,061	Jun-25-2022, 08:07 PM Last Post: RolanRoll
	Two text files, want to add a column value	zxcv101	8	3,699	Jun-20-2022, 03:06 PM Last Post: deanhystad
	select Eof extension files based on text list of filenames with if condition	RolanRoll	1	2,327	Apr-04-2022, 09:29 PM Last Post: Larz60+
	Separate text files and convert into csv	marfer	6	5,092	Dec-10-2021, 12:09 PM Last Post: marfer
	Sorting and Merging text-files [SOLVED]	AlphaInc	10	8,936	Aug-20-2021, 05:42 PM Last Post: snippsat
	Replace String in multiple text-files [SOLVED]	AlphaInc	5	11,545	Aug-08-2021, 04:59 PM Last Post: Axel_Erfurt
	Open and read multiple text files and match words	kozaizsvemira	3	8,927	Jul-07-2021, 11:27 AM Last Post: Larz60+

Several pdf files to text

User Panel Messages

Announcements