Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Several pdf files to text
#11
Oh, that's a shame, must be a windows thing.

I tried it by putting 3 old exam pdfs in the source directory and letting it run.

I get good results.

Here is the same as a program, runs perfectly for me in a bash terminal.

#! /usr/bin/python3

import os    
import pdf2image
from PIL import Image
import pytesseract


# set your paths

source = '/home/pedro/babystuff/pdf2text/'
destination_jpg = '/home/pedro/babystuff/pdf2jpg/'
save_text_path = '/home/pedro/babystuff/ocr_textfiles/'

# get rid of the jpg files after reading them

def junkjpgs(path):
    print('Clearing out the folders we use, in case there is anything in there ... ')
    pics = os.listdir(path)
    if len(pics) == 0:
        print('Nothing in ' + path + '\n\n')
        return
    for file in pics:
        os.remove(path + file)
    print('ALL files removed from: ' + path + '\n\n')

# crack the PDF open

def splitPDF(aPDF, source, destination):
    print('Splitting the PDF to individual jpgs ... ')
    outputName = aPDF.split('.')
    savename = outputName[0]    
    # images is a list
    images = pdf2image.convert_from_path(source + aPDF)
    i=1
    for image in images:
        image.save(destination + savename + str(i) + '.jpg', 'JPEG')
        i+=1           
    print('PDF split to .jpgs and all saved in: ' + destination + '\n\n')
    savetextname = savename + '.txt'
    return savetextname

def convert2text(name):
    # get the jpgs
    jpgFiles = os.listdir(destination_jpg)
    jpgFiles.sort()
    this_text = open(save_text_path + name, 'a')
    # this works fine
    for i in range(len(jpgFiles)):
        chiText1 = pytesseract.image_to_string(Image.open(destination_jpg + jpgFiles[i]), lang='chi_sim')
        print('Page ' + str(i + 1) + ' done')
        this_text.write(chiText1)
        print('Next loop coming up')
    this_text.close()
    print('removing the jpgs ... ')
    junkjpgs(destination_jpg)
    print('finished this PDF ... ')
    
if __name__ == '__main__':

    # in case there are any old jpgs in the jpg folder
    junkjpgs(destination_jpg)

    # get the pdf files

    files = os.listdir(source)
    mypdfs = []

    # maybe there are some other files in there

    for f in files:
        if f.endswith('.pdf'):
            mypdfs.append(f)

    # ocr the jpgs

    for f in mypdfs:
        text_name = splitPDF(f, source, destination_jpg)
        convert2text(text_name)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  azure TTS from text files to mp3s mutantGOD 2 1,702 Jan-17-2023, 03:20 AM
Last Post: mutantGOD
  Writing into 2 text files from the same function paul18fr 4 1,678 Jul-28-2022, 04:34 AM
Last Post: ndc85430
  Delete empty text files [SOLVED] AlphaInc 5 1,567 Jul-09-2022, 02:15 PM
Last Post: DeaD_EyE
  select files such as text file RolanRoll 2 1,172 Jun-25-2022, 08:07 PM
Last Post: RolanRoll
  Two text files, want to add a column value zxcv101 8 1,917 Jun-20-2022, 03:06 PM
Last Post: deanhystad
  select Eof extension files based on text list of filenames with if condition RolanRoll 1 1,514 Apr-04-2022, 09:29 PM
Last Post: Larz60+
  Separate text files and convert into csv marfer 6 2,874 Dec-10-2021, 12:09 PM
Last Post: marfer
  Sorting and Merging text-files [SOLVED] AlphaInc 10 4,890 Aug-20-2021, 05:42 PM
Last Post: snippsat
  Replace String in multiple text-files [SOLVED] AlphaInc 5 8,132 Aug-08-2021, 04:59 PM
Last Post: Axel_Erfurt
  Open and read multiple text files and match words kozaizsvemira 3 6,750 Jul-07-2021, 11:27 AM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020