Jul-06-2021, 07:25 PM
(This post was last modified: Jul-06-2021, 07:25 PM by deanhystad.)
This does nothing useful:
As I said, I have not tested any of this code. I don't want to install poppler and I have no need for pytesserract. But if you did want to try running my code you either need to run this file while in the "C:/Users/Desktop/path for getting the pdfs" folder or modify the last line to be this.
If you need to do so for some image processing reason I suggest translating one image at a time.
The next step is to clean things up a bit and turn most of that code into a function you can call. I suggest passing in the filenames for the pdf file and the text file.
pdf_file="C:/Users/Desktop/path for getting the pdfs" folder="C:/Users/Desktop/path to save the text files"The pdf_file you define here is not the pdf_file argument used by pdf_to_text(). Same goes for folder not being the argument passed to pdf_files_to_text().
As I said, I have not tested any of this code. I don't want to install poppler and I have no need for pytesserract. But if you did want to try running my code you either need to run this file while in the "C:/Users/Desktop/path for getting the pdfs" folder or modify the last line to be this.
pdf_files_to_text("C:/Users/Desktop/path for getting the pdfs")Why are you creating image files from images only to open them back up to create images? This creates a list of images.
pages = convert_from_path(PDF_file, poppler_path=r'C:\Program Files\poppler-0.68.0_x86\poppler-0.68.0\bin')Your code then goes to the trouble of creating a bunch of temporary files just to read the images back into memory. Why? Is there not a function where you can pass in the image directly?
If you need to do so for some image processing reason I suggest translating one image at a time.
import pytesseract import sys from pdf2image import convert_from_path import os pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' PDF_file = "where I have the pdfs/Same-title.pdf" outfile="where I want to save the text files/Same-title.txt" tempfile = "where I can create a file for temporary use/tempfile.png" pages = convert_from_path(PDF_file, poppler_path=r'C:\Program Files\poppler-0.68.0_x86\poppler-0.68.0\bin') with open(outfile, "w") as f: for page in pages: page.save(tempfile, 'PNG') # Recognize the text as string in image using pytesserct text = str(pytesseract.image_to_string(tempfile, lang='fra')) # Why were there extra () here? text = text.replace("\n", " ") # Finally, write the processed text to the file. f.write(text)Now you don't have to worry about keeping track of page numbers.
The next step is to clean things up a bit and turn most of that code into a function you can call. I suggest passing in the filenames for the pdf file and the text file.
from PIL import Image import pytesseract import sys from pdf2image import convert_from_path import os pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' PDF_file = "where I have the pdfs/Same-title.pdf" outfile="where I want to save the text files/Same-title.txt" tempfile = "where I can create a file for temporary use/tempfile.png" def convert_pdf_to_text(pdf_file, text_file): pages = convert_from_path(pdf_file, poppler_path=r'C:\Program Files\poppler-0.68.0_x86\poppler-0.68.0\bin') with open(text_file, "w") as f: for page in pages: page.save(tempfile, 'PNG') # Recognize the text as string in image using pytesserct text = str(((pytesseract.image_to_string(tempfile, lang='fra')))) text = text.replace("\n", " ") # Finally, write the processed text to the file. f.write(text) convert_pdf_to_text(PDF_file, outfile)Now that you have a function that does all the conversion, all you need to do is call the function for each file in a folder.