Python Forum - OCR again

Pages: 1 2 3

Hi,
Not a coding problem, but:
I have inherited zillions of scanned documents (prayer cards & the likes),
produced by a scanner that outputs them as tiffs, with maximum compression.
This compression type, says photoshop and also pytesseract , is "unsupported".
The tif is viewable (eg. irfanview) but not OCR-able.
So I have a workaround:
Open the tif (with cv2 or PIL), save it as a jpg, read the jpg and ocr it. Works like a charm.

img = PIL.Image.open(xxx.tif)
img.save(xxx.jpg)
#... now read the jpg, do the ocr etc.... and finally delete the jpg again!

As I have zillions and zillions of these, each time I do this operation,
it takes extra processing time, as opposed to reading a "good" scan, that I can OCR immediately.
Question: can I somehow transform the TIF into JPG in memory.
As far as I understand, the jpg is created by writing it to the disk.
I would economise 1 write and 1 read and 1 delete times a zillion, if i somehow can do it im memory.
thx,
Paul

Edit, I can also open the TIF, write it to disk as TIF, and i can OCR it . (filesize About 4 times larger) Question remains the same.

Can't you convert all the tifs into jpg outside of the program by invoking e.g. Imagemagick's convert command?

Another trick would be to use a ramdisk as the target device.

(Oct-29-2022, 08:16 AM)Gribouillis Wrote: [ -> ]Can't you convert all the tifs into jpg outside of the program by invoking e.g. Imagemagick's convert command?
Another trick would be to use a ramdisk as the target device.

@Gribouillis:
1) Yes I can convert all the tiffs outside the program: tried that. (using batch conversion in XNview, or even python itself ...)
But: say the extreme compressed tiffs take 1 terabyte, anything converted takes 4 TB. That is only temporary, and
it also takes a very long time to do. Not practical.
2) Ramdrive: not sure how to do that, will look it up. Thanks.

i've been looking at this for some time, and as you read about OCR and pytesseract,
most of the examples use cv2 to open images.
a) It is faster than PIL (written in C ?)
b) It has more features
c) the Image feature of PIL conflicts with tKinter if you're not careful.

BUT: as i discovered in the past hour, PIL seems to have one thing over cv2: it can (apparently) read more types of tif compression.

What I have to do now is test if the slower PIL, but able to read "exotic" compressions, would be faster/slower than eg. a ramdrive
approach with cv2.
Paul

(Oct-29-2022, 08:39 AM)DPaul Wrote: [ -> ]Ramdrive: not sure how to do that, will look it up. Thanks.

If your OS is linux, I can help you do that

(Oct-29-2022, 08:39 AM)DPaul Wrote: [ -> ]he extreme compressed tiffs take 1 terabyte, anything converted takes 4 TB

In imagemagick's conversion, you can pass a "quality" argument which reduces the size of the converted image, eg

Output:
convert -quality 20 spam.tif spam.jpg

Also there are Python bindings to imagemagick such as wand that can save converted image directly in a file-like object, which means in memory. Perhaps PIL can do the same?

@Gribouillis:
Imagemagic: never used it. As a photographer i use adobe's products mostly (Lightroom PS, PSE...) + irafanview, xnview...
But I can try Imagemagic, on condition:
a) It can open the extreme compressed tifs (adobe can't)
b) it converts fast enough to be a viable option.

Linux: we talked about that in another post. Sad

Let's try imagemagic (what a name!) first.
Paul
Edit: Inagemagic reads the compressed tiff, but with a warning. It senses that something is not quite right
with the compression. (Tiff warning 950). But it does open the document and displays the content.
Save as jpg, doubles the size, save as tif = size x 10!
If it is possible to do this "ramdrive" on windows, I'll persue that option rather than imagemagic.
Paul

It seems that with PIL you can save images to BytesIO directly. A search engine yields results such as this one

import io
from PIL import Image

im = Image.open('test.jpg')
im_resize = im.resize((500, 500))
buf = io.BytesIO()
im_resize.save(buf, format='JPEG')
byte_im = buf.getvalue()

Same with opencv

import cv2

im = cv2.imread('test.jpg')
im_resize = cv2.resize(im, (500, 500))

is_success, im_buf_arr = cv2.imencode(".jpg", im_resize)
byte_im = im_buf_arr.tobytes()

# or using BytesIO
# io_buf = io.BytesIO(im_buf_arr)
# byte_im = io_buf.getvalue()

So I'm not sure you need a ramdisk after all...

Ok, let's give that a try, anything to shave a few tenths of seconds from the OCR process.
But later on. Manchester City plays this afternoon.
Always observe the difference between "urgent" and "important". Cool

Thanks again for all the help,
I'll report the results.
Paul

Conclusion:
1) Man City won, and they did not even need Haaland.
2) Although I do not fully understand what is going on, I tested both routines on a "compressed tif".
To clarify things, i wrapped them in a try...except Exception as e...

a) The PIL version went throug without any message, we know it can handle the tifs.
b) The opencv version crashed:

OpenCV(4.6.0) D:\a\opencv-python\opencv-python\opencv\modules\imgproc\src\resize.cpp:4052: error: (-215:Assertion failed) !ssize.empty() in function 'cv::resize'

What to do ?
They say that opencv is faster than PIL. If so is it significant for a zillion documents?
There is more to do than "open", also crop, resize.. etc depending on the case.
A small batch of jpgs done with cv2 yields 1"945 second / document (on average)
I'll test a PIL version of the same program, and draw the conclusion wether
it's worth to persue the in-memory swapping.
Paul

I did a PIL and a cv2 version , processing the same representative batch of 309 jpg scans.
Not being a scientific benchmark, the on-the-fly conclusion is that PIL came out 1/10 of a second faster
(on average) than cv2, per document. (this includes opening, and cropping as needed)

That is without any memory swapping trick in the cv2 version, needed for the compressed tifs.
There can be only one conclusion, considering the effort that needs to be done on the cv2 side.
PIL is the way to go in this case.
thx,
Paul

For cpu intensive task why not use additional cores?

But since it is reading a lot of files from a disk and the capabilities of nowadays machines I wonder if that is a more IO-bound task instead of a CPU one and using multiple threads instead of CPU cores.

Pages: 1 2 3