Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Several pdf files to text
#3
In a previous job my gf needed to convert a lot of old exams that were image pdfs to text.

I made this, it is using chi_sim, not French, but it works well. Probably needs tidying up a bit, I am not an expert, but it works!

If you need to convert many pdf files, just put this in a big loop and read fileforocr from a list.

import os
from pdf2jpg import pdf2jpg
from PIL import Image
import pytesseract

# set your source and destination and choose your PDF first
source = '/a/path/'
destination = '/another/path/'

# if you want to convert many PDFs, make loop here
fileforocr = input('Enter the path and name of the pdf to ocr ... ' )

# crack the PDF open

def splitPDF(aPDF, source, destination):
    print('Splitting the PDF to individual jpgs ... ')
    outputName = aPDF.split('.')
    savename = outputName[0]    
    # images is a list
    images = pdf2image.convert_from_path(source + aPDF)
    i=1
    for image in images:
        image.save(destination + savename + str(i) + '.jpg', 'JPEG')
        i+=1           
    print('PDF split to .jpgs and all saved in: ' + destination + '\n\n')

# get the jpgs
jpgFiles = os.listdir(convertPdfJpg)
jpgFiles.sort()

saveFilepath = '/home/pedro/babystuff/ocr_textfiles/'
saveFilename = input('Enter a name to save the text to ... ')
bookCopy = open(saveFilepath + saveFilename, 'a')

# this works fine
for i in range( len(jpgFiles)):
    chiText1 = pytesseract.image_to_string(Image.open(convertPdfJpg + jpgFiles[i]), lang='chi_sim')
    print('Page ' + str(i + 1) + ' done')
    bookCopy.write(chiText1)
    print('Next loop coming up')

bookCopy.close()
Reply


Messages In This Thread
Several pdf files to text - by mfernandes - Jul-05-2021, 08:56 PM
RE: Several pdf files to text - by mfernandes - Jul-05-2021, 09:02 PM
RE: Several pdf files to text - by Pedroski55 - Jul-06-2021, 02:54 AM
RE: Several pdf files to text - by mfernandes - Jul-06-2021, 11:10 AM
RE: Several pdf files to text - by deanhystad - Jul-06-2021, 05:07 PM
RE: Several pdf files to text - by mfernandes - Jul-06-2021, 06:38 PM
RE: Several pdf files to text - by deanhystad - Jul-06-2021, 07:25 PM
RE: Several pdf files to text - by Pedroski55 - Jul-06-2021, 11:42 PM
RE: Several pdf files to text - by mfernandes - Jul-07-2021, 08:14 PM
RE: Several pdf files to text - by deanhystad - Jul-07-2021, 09:25 PM
RE: Several pdf files to text - by Pedroski55 - Jul-07-2021, 11:39 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  azure TTS from text files to mp3s mutantGOD 2 1,809 Jan-17-2023, 03:20 AM
Last Post: mutantGOD
  Writing into 2 text files from the same function paul18fr 4 1,770 Jul-28-2022, 04:34 AM
Last Post: ndc85430
  Delete empty text files [SOLVED] AlphaInc 5 1,654 Jul-09-2022, 02:15 PM
Last Post: DeaD_EyE
  select files such as text file RolanRoll 2 1,259 Jun-25-2022, 08:07 PM
Last Post: RolanRoll
  Two text files, want to add a column value zxcv101 8 2,063 Jun-20-2022, 03:06 PM
Last Post: deanhystad
  select Eof extension files based on text list of filenames with if condition RolanRoll 1 1,590 Apr-04-2022, 09:29 PM
Last Post: Larz60+
  Separate text files and convert into csv marfer 6 3,008 Dec-10-2021, 12:09 PM
Last Post: marfer
  Sorting and Merging text-files [SOLVED] AlphaInc 10 5,086 Aug-20-2021, 05:42 PM
Last Post: snippsat
  Replace String in multiple text-files [SOLVED] AlphaInc 5 8,334 Aug-08-2021, 04:59 PM
Last Post: Axel_Erfurt
  Open and read multiple text files and match words kozaizsvemira 3 6,861 Jul-07-2021, 11:27 AM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020