Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Several pdf files to text
#7
This does nothing useful:
pdf_file="C:/Users/Desktop/path for getting the pdfs"
folder="C:/Users/Desktop/path to save the text files"
The pdf_file you define here is not the pdf_file argument used by pdf_to_text(). Same goes for folder not being the argument passed to pdf_files_to_text().

As I said, I have not tested any of this code. I don't want to install poppler and I have no need for pytesserract. But if you did want to try running my code you either need to run this file while in the "C:/Users/Desktop/path for getting the pdfs" folder or modify the last line to be this.
    pdf_files_to_text("C:/Users/Desktop/path for getting the pdfs")
Why are you creating image files from images only to open them back up to create images? This creates a list of images.
pages = convert_from_path(PDF_file, poppler_path=r'C:\Program Files\poppler-0.68.0_x86\poppler-0.68.0\bin')
Your code then goes to the trouble of creating a bunch of temporary files just to read the images back into memory. Why? Is there not a function where you can pass in the image directly?

If you need to do so for some image processing reason I suggest translating one image at a time.
import pytesseract
import sys
from pdf2image import convert_from_path
import os

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

PDF_file = "where I have the pdfs/Same-title.pdf"
outfile="where I want to save the text files/Same-title.txt"
tempfile = "where I can create a file for temporary use/tempfile.png"

pages = convert_from_path(PDF_file, poppler_path=r'C:\Program Files\poppler-0.68.0_x86\poppler-0.68.0\bin')
 
with open(outfile, "w") as f:
    for page in pages:
        page.save(tempfile, 'PNG')

        # Recognize the text as string in image using pytesserct
        text = str(pytesseract.image_to_string(tempfile, lang='fra'))  # Why were there extra () here?
        text = text.replace("\n", " ") 
    
        # Finally, write the processed text to the file.
        f.write(text)
Now you don't have to worry about keeping track of page numbers.

The next step is to clean things up a bit and turn most of that code into a function you can call. I suggest passing in the filenames for the pdf file and the text file.
from PIL import Image
import pytesseract
import sys
from pdf2image import convert_from_path
import os

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

PDF_file = "where I have the pdfs/Same-title.pdf"
outfile="where I want to save the text files/Same-title.txt"
tempfile = "where I can create a file for temporary use/tempfile.png"

def convert_pdf_to_text(pdf_file, text_file):
    pages = convert_from_path(pdf_file, poppler_path=r'C:\Program Files\poppler-0.68.0_x86\poppler-0.68.0\bin')
    
    with open(text_file, "w") as f:
        for page in pages:
            page.save(tempfile, 'PNG')

            # Recognize the text as string in image using pytesserct
            text = str(((pytesseract.image_to_string(tempfile, lang='fra'))))
            text = text.replace("\n", " ") 
        
            # Finally, write the processed text to the file.
            f.write(text)

convert_pdf_to_text(PDF_file, outfile)
Now that you have a function that does all the conversion, all you need to do is call the function for each file in a folder.
Reply


Messages In This Thread
Several pdf files to text - by mfernandes - Jul-05-2021, 08:56 PM
RE: Several pdf files to text - by mfernandes - Jul-05-2021, 09:02 PM
RE: Several pdf files to text - by Pedroski55 - Jul-06-2021, 02:54 AM
RE: Several pdf files to text - by mfernandes - Jul-06-2021, 11:10 AM
RE: Several pdf files to text - by deanhystad - Jul-06-2021, 05:07 PM
RE: Several pdf files to text - by mfernandes - Jul-06-2021, 06:38 PM
RE: Several pdf files to text - by deanhystad - Jul-06-2021, 07:25 PM
RE: Several pdf files to text - by Pedroski55 - Jul-06-2021, 11:42 PM
RE: Several pdf files to text - by mfernandes - Jul-07-2021, 08:14 PM
RE: Several pdf files to text - by deanhystad - Jul-07-2021, 09:25 PM
RE: Several pdf files to text - by Pedroski55 - Jul-07-2021, 11:39 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  azure TTS from text files to mp3s mutantGOD 2 1,780 Jan-17-2023, 03:20 AM
Last Post: mutantGOD
  Writing into 2 text files from the same function paul18fr 4 1,751 Jul-28-2022, 04:34 AM
Last Post: ndc85430
  Delete empty text files [SOLVED] AlphaInc 5 1,643 Jul-09-2022, 02:15 PM
Last Post: DeaD_EyE
  select files such as text file RolanRoll 2 1,227 Jun-25-2022, 08:07 PM
Last Post: RolanRoll
  Two text files, want to add a column value zxcv101 8 2,009 Jun-20-2022, 03:06 PM
Last Post: deanhystad
  select Eof extension files based on text list of filenames with if condition RolanRoll 1 1,567 Apr-04-2022, 09:29 PM
Last Post: Larz60+
  Separate text files and convert into csv marfer 6 2,989 Dec-10-2021, 12:09 PM
Last Post: marfer
  Sorting and Merging text-files [SOLVED] AlphaInc 10 5,034 Aug-20-2021, 05:42 PM
Last Post: snippsat
  Replace String in multiple text-files [SOLVED] AlphaInc 5 8,277 Aug-08-2021, 04:59 PM
Last Post: Axel_Erfurt
  Open and read multiple text files and match words kozaizsvemira 3 6,829 Jul-07-2021, 11:27 AM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020