Several pdf files to text

Pedroski55 · Jul-06-2021, 02:54 AM

In a previous job my gf needed to convert a lot of old exams that were image pdfs to text.

I made this, it is using chi_sim, not French, but it works well. Probably needs tidying up a bit, I am not an expert, but it works!

If you need to convert many pdf files, just put this in a big loop and read fileforocr from a list.

import os
from pdf2jpg import pdf2jpg
from PIL import Image
import pytesseract

# set your source and destination and choose your PDF first
source = '/a/path/'
destination = '/another/path/'

# if you want to convert many PDFs, make loop here
fileforocr = input('Enter the path and name of the pdf to ocr ... ' )

# crack the PDF open

def splitPDF(aPDF, source, destination):
    print('Splitting the PDF to individual jpgs ... ')
    outputName = aPDF.split('.')
    savename = outputName[0]    
    # images is a list
    images = pdf2image.convert_from_path(source + aPDF)
    i=1
    for image in images:
        image.save(destination + savename + str(i) + '.jpg', 'JPEG')
        i+=1           
    print('PDF split to .jpgs and all saved in: ' + destination + '\n\n')

# get the jpgs
jpgFiles = os.listdir(convertPdfJpg)
jpgFiles.sort()

saveFilepath = '/home/pedro/babystuff/ocr_textfiles/'
saveFilename = input('Enter a name to save the text to ... ')
bookCopy = open(saveFilepath + saveFilename, 'a')

# this works fine
for i in range( len(jpgFiles)):
    chiText1 = pytesseract.image_to_string(Image.open(convertPdfJpg + jpgFiles[i]), lang='chi_sim')
    print('Page ' + str(i + 1) + ' done')
    bookCopy.write(chiText1)
    print('Next loop coming up')

bookCopy.close()

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	azure TTS from text files to mp3s	mutantGOD	2	1,809	Jan-17-2023, 03:20 AM Last Post: mutantGOD
	Writing into 2 text files from the same function	paul18fr	4	1,770	Jul-28-2022, 04:34 AM Last Post: ndc85430
	Delete empty text files [SOLVED]	AlphaInc	5	1,654	Jul-09-2022, 02:15 PM Last Post: DeaD_EyE
	select files such as text file	RolanRoll	2	1,259	Jun-25-2022, 08:07 PM Last Post: RolanRoll
	Two text files, want to add a column value	zxcv101	8	2,063	Jun-20-2022, 03:06 PM Last Post: deanhystad
	select Eof extension files based on text list of filenames with if condition	RolanRoll	1	1,590	Apr-04-2022, 09:29 PM Last Post: Larz60+
	Separate text files and convert into csv	marfer	6	3,008	Dec-10-2021, 12:09 PM Last Post: marfer
	Sorting and Merging text-files [SOLVED]	AlphaInc	10	5,086	Aug-20-2021, 05:42 PM Last Post: snippsat
	Replace String in multiple text-files [SOLVED]	AlphaInc	5	8,334	Aug-08-2021, 04:59 PM Last Post: Axel_Erfurt
	Open and read multiple text files and match words	kozaizsvemira	3	6,861	Jul-07-2021, 11:27 AM Last Post: Larz60+

Several pdf files to text

User Panel Messages

Announcements