Feb-16-2022, 06:28 PM
HI,
I am working on a project to OCR text from tiff images, the below code works fine on individual images, but I am looking for a solution where I can extract the batch images from respective subfolders and OCR in .HOCR format.
Example :
There are several subfolders in the D drive with Tiff image, which needs to pass through OCR one by one and output in E drive with the similar DIR tree as the D drive.
D:\\subfolder\Subfolder1\tiff image to E:\subfolder\Subfolder1\Hocr image
Please suggest how to tweak the code to achieve the requirement
My code
Joe
I am working on a project to OCR text from tiff images, the below code works fine on individual images, but I am looking for a solution where I can extract the batch images from respective subfolders and OCR in .HOCR format.
Example :
There are several subfolders in the D drive with Tiff image, which needs to pass through OCR one by one and output in E drive with the similar DIR tree as the D drive.
D:\\subfolder\Subfolder1\tiff image to E:\subfolder\Subfolder1\Hocr image
Please suggest how to tweak the code to achieve the requirement
My code
from PIL import Image import pytesseract pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files (x86)\Tesseract- OCR\tesseract.exe" image = Image.open(r"C:\Users\multipage.tiff") config = ("--oem 3 --psm 6") txt = '' for frame in range(image.n_frames): image.seek(frame) txt += pytesseract.image_to_string(image, config = config, lang='eng') + '\n' print(txt) with open(r"C:\Users\multipage_output.txt", mode = 'w') as f: f.write(txt)Thanks!
Joe