Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Several pdf files to text
#8
HI again, just had some free time, so I tidied up my pdf to text program.

You just need to change the paths, I don't use Windows, so I am not too sure about the correct format.

Then you can paste this in your Idle shell and enter myApp()

Works well for me! The girlfriend might need it again someday, have to keep her happy

def myApp():
    import os    
    import pdf2image
    from PIL import Image
    import pytesseract

    # set your paths

    source = '/home/pedro/babystuff/pdf2text/'
    destination_jpg = '/home/pedro/babystuff/pdf2jpg/'
    save_text_path = '/home/pedro/babystuff/ocr_textfiles/'

    # get the pdf files

    files = os.listdir(source)
    mypdfs = []

    # maybe there are some other files in there, so only get .pdf files
    for f in files:
        if f.endswith('.pdf'):
            mypdfs.append(f)

    # get rid of the jpg files after reading them

    def junkjpgs(path):
        print('Clearing out the folders we use, in case there is anything in there ... ')
        pics = os.listdir(path)
        if len(pics) == 0:
            print('Nothing in ' + path + '\n\n')
            return
        for file in pics:
            os.remove(path + file)
        print('ALL files removed from: ' + path + '\n\n')

    # in case there are any old jpg files in the jpg folder
    junkjpgs(destination_jpg)

    # crack the PDF open

    def splitPDF(aPDF, source, destination):
        print('Splitting the PDF to individual jpgs ... ')
        outputName = aPDF.split('.')
        savename = outputName[0]    
        # images is a list
        images = pdf2image.convert_from_path(source + aPDF)
        i=1
        for image in images:
            image.save(destination + savename + str(i) + '.jpg', 'JPEG')
            i+=1           
        print('PDF split to .jpgs and all saved in: ' + destination + '\n\n')
        savetextname = savename + '.txt'
        return savetextname

    def convert2text(name):
        # get the jpgs
        jpgFiles = os.listdir(destination_jpg)
        jpgFiles.sort()
        this_text = open(save_text_path + name, 'a')
        # this works fine
        for i in range(len(jpgFiles)):
            chiText1 = pytesseract.image_to_string(Image.open(destination_jpg + jpgFiles[i]), lang='chi_sim')
            print('Page ' + str(i + 1) + ' done')
            this_text.write(chiText1)
            print('Next loop coming up')
        this_text.close()
        print('removing the jpgs ... ')
        junkjpgs(destination_jpg)
        print('finished this PDF ... ')
        

    for f in mypdfs:
        text_name = splitPDF(f, source, destination_jpg)
        convert2text(text_name)
Reply


Messages In This Thread
Several pdf files to text - by mfernandes - Jul-05-2021, 08:56 PM
RE: Several pdf files to text - by mfernandes - Jul-05-2021, 09:02 PM
RE: Several pdf files to text - by Pedroski55 - Jul-06-2021, 02:54 AM
RE: Several pdf files to text - by mfernandes - Jul-06-2021, 11:10 AM
RE: Several pdf files to text - by deanhystad - Jul-06-2021, 05:07 PM
RE: Several pdf files to text - by mfernandes - Jul-06-2021, 06:38 PM
RE: Several pdf files to text - by deanhystad - Jul-06-2021, 07:25 PM
RE: Several pdf files to text - by Pedroski55 - Jul-06-2021, 11:42 PM
RE: Several pdf files to text - by mfernandes - Jul-07-2021, 08:14 PM
RE: Several pdf files to text - by deanhystad - Jul-07-2021, 09:25 PM
RE: Several pdf files to text - by Pedroski55 - Jul-07-2021, 11:39 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  azure TTS from text files to mp3s mutantGOD 2 1,826 Jan-17-2023, 03:20 AM
Last Post: mutantGOD
  Writing into 2 text files from the same function paul18fr 4 1,783 Jul-28-2022, 04:34 AM
Last Post: ndc85430
  Delete empty text files [SOLVED] AlphaInc 5 1,665 Jul-09-2022, 02:15 PM
Last Post: DeaD_EyE
  select files such as text file RolanRoll 2 1,268 Jun-25-2022, 08:07 PM
Last Post: RolanRoll
  Two text files, want to add a column value zxcv101 8 2,078 Jun-20-2022, 03:06 PM
Last Post: deanhystad
  select Eof extension files based on text list of filenames with if condition RolanRoll 1 1,603 Apr-04-2022, 09:29 PM
Last Post: Larz60+
  Separate text files and convert into csv marfer 6 3,018 Dec-10-2021, 12:09 PM
Last Post: marfer
  Sorting and Merging text-files [SOLVED] AlphaInc 10 5,106 Aug-20-2021, 05:42 PM
Last Post: snippsat
  Replace String in multiple text-files [SOLVED] AlphaInc 5 8,356 Aug-08-2021, 04:59 PM
Last Post: Axel_Erfurt
  Open and read multiple text files and match words kozaizsvemira 3 6,868 Jul-07-2021, 11:27 AM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020