Several pdf files to text

**deanhystad** · (This post was last modified: Jul-06-2021, 07:25 PM by deanhystad.)

This does nothing useful:

pdf_file="C:/Users/Desktop/path for getting the pdfs"
folder="C:/Users/Desktop/path to save the text files"

The pdf_file you define here is not the pdf_file argument used by pdf_to_text(). Same goes for folder not being the argument passed to pdf_files_to_text().

As I said, I have not tested any of this code. I don't want to install poppler and I have no need for pytesserract. But if you did want to try running my code you either need to run this file while in the "C:/Users/Desktop/path for getting the pdfs" folder or modify the last line to be this.

    pdf_files_to_text("C:/Users/Desktop/path for getting the pdfs")

Why are you creating image files from images only to open them back up to create images? This creates a list of images.

pages = convert_from_path(PDF_file, poppler_path=r'C:\Program Files\poppler-0.68.0_x86\poppler-0.68.0\bin')

Your code then goes to the trouble of creating a bunch of temporary files just to read the images back into memory. Why? Is there not a function where you can pass in the image directly?

If you need to do so for some image processing reason I suggest translating one image at a time.

import pytesseract
import sys
from pdf2image import convert_from_path
import os

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

PDF_file = "where I have the pdfs/Same-title.pdf"
outfile="where I want to save the text files/Same-title.txt"
tempfile = "where I can create a file for temporary use/tempfile.png"

pages = convert_from_path(PDF_file, poppler_path=r'C:\Program Files\poppler-0.68.0_x86\poppler-0.68.0\bin')
 
with open(outfile, "w") as f:
    for page in pages:
        page.save(tempfile, 'PNG')

        # Recognize the text as string in image using pytesserct
        text = str(pytesseract.image_to_string(tempfile, lang='fra'))  # Why were there extra () here?
        text = text.replace("\n", " ") 
    
        # Finally, write the processed text to the file.
        f.write(text)

Now you don't have to worry about keeping track of page numbers.

The next step is to clean things up a bit and turn most of that code into a function you can call. I suggest passing in the filenames for the pdf file and the text file.

from PIL import Image
import pytesseract
import sys
from pdf2image import convert_from_path
import os

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

PDF_file = "where I have the pdfs/Same-title.pdf"
outfile="where I want to save the text files/Same-title.txt"
tempfile = "where I can create a file for temporary use/tempfile.png"

def convert_pdf_to_text(pdf_file, text_file):
    pages = convert_from_path(pdf_file, poppler_path=r'C:\Program Files\poppler-0.68.0_x86\poppler-0.68.0\bin')
    
    with open(text_file, "w") as f:
        for page in pages:
            page.save(tempfile, 'PNG')

            # Recognize the text as string in image using pytesserct
            text = str(((pytesseract.image_to_string(tempfile, lang='fra'))))
            text = text.replace("\n", " ") 
        
            # Finally, write the processed text to the file.
            f.write(text)

convert_pdf_to_text(PDF_file, outfile)

Now that you have a function that does all the conversion, all you need to do is call the function for each file in a folder.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	azure TTS from text files to mp3s	mutantGOD	2	1,780	Jan-17-2023, 03:20 AM Last Post: mutantGOD
	Writing into 2 text files from the same function	paul18fr	4	1,751	Jul-28-2022, 04:34 AM Last Post: ndc85430
	Delete empty text files [SOLVED]	AlphaInc	5	1,643	Jul-09-2022, 02:15 PM Last Post: DeaD_EyE
	select files such as text file	RolanRoll	2	1,227	Jun-25-2022, 08:07 PM Last Post: RolanRoll
	Two text files, want to add a column value	zxcv101	8	2,009	Jun-20-2022, 03:06 PM Last Post: deanhystad
	select Eof extension files based on text list of filenames with if condition	RolanRoll	1	1,567	Apr-04-2022, 09:29 PM Last Post: Larz60+
	Separate text files and convert into csv	marfer	6	2,989	Dec-10-2021, 12:09 PM Last Post: marfer
	Sorting and Merging text-files [SOLVED]	AlphaInc	10	5,034	Aug-20-2021, 05:42 PM Last Post: snippsat
	Replace String in multiple text-files [SOLVED]	AlphaInc	5	8,277	Aug-08-2021, 04:59 PM Last Post: Axel_Erfurt
	Open and read multiple text files and match words	kozaizsvemira	3	6,829	Jul-07-2021, 11:27 AM Last Post: Larz60+

Several pdf files to text

User Panel Messages

Announcements