Several pdf files to text

Pedroski55 · (This post was last modified: Jul-07-2021, 04:15 AM by Pedroski55.)

HI again, just had some free time, so I tidied up my pdf to text program.

You just need to change the paths, I don't use Windows, so I am not too sure about the correct format.

Then you can paste this in your Idle shell and enter myApp()

Works well for me! The girlfriend might need it again someday, have to keep her happy

def myApp():
    import os    
    import pdf2image
    from PIL import Image
    import pytesseract

    # set your paths

    source = '/home/pedro/babystuff/pdf2text/'
    destination_jpg = '/home/pedro/babystuff/pdf2jpg/'
    save_text_path = '/home/pedro/babystuff/ocr_textfiles/'

    # get the pdf files

    files = os.listdir(source)
    mypdfs = []

    # maybe there are some other files in there, so only get .pdf files
    for f in files:
        if f.endswith('.pdf'):
            mypdfs.append(f)

    # get rid of the jpg files after reading them

    def junkjpgs(path):
        print('Clearing out the folders we use, in case there is anything in there ... ')
        pics = os.listdir(path)
        if len(pics) == 0:
            print('Nothing in ' + path + '\n\n')
            return
        for file in pics:
            os.remove(path + file)
        print('ALL files removed from: ' + path + '\n\n')

    # in case there are any old jpg files in the jpg folder
    junkjpgs(destination_jpg)

    # crack the PDF open

    def splitPDF(aPDF, source, destination):
        print('Splitting the PDF to individual jpgs ... ')
        outputName = aPDF.split('.')
        savename = outputName[0]    
        # images is a list
        images = pdf2image.convert_from_path(source + aPDF)
        i=1
        for image in images:
            image.save(destination + savename + str(i) + '.jpg', 'JPEG')
            i+=1           
        print('PDF split to .jpgs and all saved in: ' + destination + '\n\n')
        savetextname = savename + '.txt'
        return savetextname

    def convert2text(name):
        # get the jpgs
        jpgFiles = os.listdir(destination_jpg)
        jpgFiles.sort()
        this_text = open(save_text_path + name, 'a')
        # this works fine
        for i in range(len(jpgFiles)):
            chiText1 = pytesseract.image_to_string(Image.open(destination_jpg + jpgFiles[i]), lang='chi_sim')
            print('Page ' + str(i + 1) + ' done')
            this_text.write(chiText1)
            print('Next loop coming up')
        this_text.close()
        print('removing the jpgs ... ')
        junkjpgs(destination_jpg)
        print('finished this PDF ... ')
        

    for f in mypdfs:
        text_name = splitPDF(f, source, destination_jpg)
        convert2text(text_name)

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	azure TTS from text files to mp3s	mutantGOD	2	1,826	Jan-17-2023, 03:20 AM Last Post: mutantGOD
	Writing into 2 text files from the same function	paul18fr	4	1,783	Jul-28-2022, 04:34 AM Last Post: ndc85430
	Delete empty text files [SOLVED]	AlphaInc	5	1,665	Jul-09-2022, 02:15 PM Last Post: DeaD_EyE
	select files such as text file	RolanRoll	2	1,268	Jun-25-2022, 08:07 PM Last Post: RolanRoll
	Two text files, want to add a column value	zxcv101	8	2,078	Jun-20-2022, 03:06 PM Last Post: deanhystad
	select Eof extension files based on text list of filenames with if condition	RolanRoll	1	1,603	Apr-04-2022, 09:29 PM Last Post: Larz60+
	Separate text files and convert into csv	marfer	6	3,018	Dec-10-2021, 12:09 PM Last Post: marfer
	Sorting and Merging text-files [SOLVED]	AlphaInc	10	5,106	Aug-20-2021, 05:42 PM Last Post: snippsat
	Replace String in multiple text-files [SOLVED]	AlphaInc	5	8,356	Aug-08-2021, 04:59 PM Last Post: Axel_Erfurt
	Open and read multiple text files and match words	kozaizsvemira	3	6,868	Jul-07-2021, 11:27 AM Last Post: Larz60+

Several pdf files to text

User Panel Messages

Announcements