Oct-29-2022, 06:49 AM
Hi,
Not a coding problem, but:
I have inherited zillions of scanned documents (prayer cards & the likes),
produced by a scanner that outputs them as tiffs, with maximum compression.
This compression type, says photoshop and also pytesseract , is "unsupported".
The tif is viewable (eg. irfanview) but not OCR-able.
So I have a workaround:
Open the tif (with cv2 or PIL), save it as a jpg, read the jpg and ocr it. Works like a charm.
it takes extra processing time, as opposed to reading a "good" scan, that I can OCR immediately.
Question: can I somehow transform the TIF into JPG in memory.
As far as I understand, the jpg is created by writing it to the disk.
I would economise 1 write and 1 read and 1 delete times a zillion, if i somehow can do it im memory.
thx,
Paul
Edit, I can also open the TIF, write it to disk as TIF, and i can OCR it . (filesize About 4 times larger) Question remains the same.
Not a coding problem, but:
I have inherited zillions of scanned documents (prayer cards & the likes),
produced by a scanner that outputs them as tiffs, with maximum compression.
This compression type, says photoshop and also pytesseract , is "unsupported".
The tif is viewable (eg. irfanview) but not OCR-able.
So I have a workaround:
Open the tif (with cv2 or PIL), save it as a jpg, read the jpg and ocr it. Works like a charm.
img = PIL.Image.open(xxx.tif) img.save(xxx.jpg) #... now read the jpg, do the ocr etc.... and finally delete the jpg again!As I have zillions and zillions of these, each time I do this operation,
it takes extra processing time, as opposed to reading a "good" scan, that I can OCR immediately.
Question: can I somehow transform the TIF into JPG in memory.
As far as I understand, the jpg is created by writing it to the disk.
I would economise 1 write and 1 read and 1 delete times a zillion, if i somehow can do it im memory.
thx,
Paul
Edit, I can also open the TIF, write it to disk as TIF, and i can OCR it . (filesize About 4 times larger) Question remains the same.