Jul-06-2021, 02:54 AM
In a previous job my gf needed to convert a lot of old exams that were image pdfs to text.
I made this, it is using chi_sim, not French, but it works well. Probably needs tidying up a bit, I am not an expert, but it works!
If you need to convert many pdf files, just put this in a big loop and read fileforocr from a list.
I made this, it is using chi_sim, not French, but it works well. Probably needs tidying up a bit, I am not an expert, but it works!
If you need to convert many pdf files, just put this in a big loop and read fileforocr from a list.
import os from pdf2jpg import pdf2jpg from PIL import Image import pytesseract # set your source and destination and choose your PDF first source = '/a/path/' destination = '/another/path/' # if you want to convert many PDFs, make loop here fileforocr = input('Enter the path and name of the pdf to ocr ... ' ) # crack the PDF open def splitPDF(aPDF, source, destination): print('Splitting the PDF to individual jpgs ... ') outputName = aPDF.split('.') savename = outputName[0] # images is a list images = pdf2image.convert_from_path(source + aPDF) i=1 for image in images: image.save(destination + savename + str(i) + '.jpg', 'JPEG') i+=1 print('PDF split to .jpgs and all saved in: ' + destination + '\n\n') # get the jpgs jpgFiles = os.listdir(convertPdfJpg) jpgFiles.sort() saveFilepath = '/home/pedro/babystuff/ocr_textfiles/' saveFilename = input('Enter a name to save the text to ... ') bookCopy = open(saveFilepath + saveFilename, 'a') # this works fine for i in range( len(jpgFiles)): chiText1 = pytesseract.image_to_string(Image.open(convertPdfJpg + jpgFiles[i]), lang='chi_sim') print('Page ' + str(i + 1) + ' done') bookCopy.write(chiText1) print('Next loop coming up') bookCopy.close()