OCR again

DPaul · Oct-31-2022, 11:10 AM

(Oct-31-2022, 08:41 AM)wavic Wrote: For cpu intensive task why not use additional cores?

Hi, anything to make it go faster!
But this may be a bit complicated ?
What is CPU intensive ? I just checked , and the CPU usage goes to 15 % and stays there,
while doing the OCR on 10 documents, one after the other.

The above posts have made it possible to conclude that only one IO operation is needed: i.e. read the compressed tif(s) , one by one.
The rest happens in memory.
The results of the OCR are later on written to file, but that is peanuts.

The only possible saving is in the processing , currently between 1.5 to 2 seconds a scan.
Should i use multiple threads ?
thx,
Paul

wavic · Oct-31-2022, 01:44 PM

It costs you nothing to try. One import, one with statement and one method call.

You can try both - ThreadPoolExecutor and ProcessPoolExecutor. Over 100 testing files and measure the time needed to finish it.

DPaul · (This post was last modified: Oct-31-2022, 04:31 PM by DPaul.)

In theory, the ThreadPoolExecutor looks promising.

The scans are organised in batches (+/- by village)
Normally the user would say : do village_a(...), village_b(...), village_c(...).
And the program OCRs them one after the other.

Of course, if the 3 would run simultaneously, the advantages are obvious.
To make that work is something else, because we need to start
3 times the same suite of functions with different parameters.
thx,
Paul

Edit: it would seem that it won't work:

Output:Use ThreadPoolExecutor When…

    Your tasks can be defined by a pure function that has no state or side effects.
    Your task can fit within a single Python function, likely making it simple and easy to understand.

wavic · Oct-31-2022, 05:51 PM

Alright, I see it that way. Simplified:

def convert_img(img_obj):
    return new_img # not  written to the disk but in memory

def do_ocr(image_data):
    return document 

def worker(path):
    with open(path, 'br') as file_obj:
        converted = convert_img(file_obj)

    document = do_ocr(converted)

    with open(path, 'w') as doc:
        doc.write(document)

images = pathlib.Path('path_to_folder').glob('**/*.tif') # recusively. returns a generator

with concurrent.futures.ThreadPoolExecutor() as executor:
    _ = executor.map(worker, images)

Should it work ? I think it's IO bound so threads are used here.

DPaul · Oct-31-2022, 06:29 PM

(Oct-31-2022, 05:51 PM)wavic Wrote: Alright, I see it that way. Simplified:

I've done some simple threading, but tis is new to me.
Of course, it's worth persuing because currently we are planning of processing
8.000 to 10.000 scans per day.
If that could only be doubled, we could be ready before Xmas Cool

(Although scans are added all the time, by the thousands.)

If I get this example right, you are processing multiple images in parallel,
extract the text, and write that to disk.
This is a different approach than what i have currently, but that could be accomodated.

Only one question: I do not understand how many images are processed simultaneously?
thx for your time,
Paul

wavic · (This post was last modified: Oct-31-2022, 07:14 PM by wavic.)

It depends on the time needed for an image, memory consumption and the max. number of threads allowed.
You can limit them to a number you want passing max_workers as a parameter.

with concurrent.futures.ThreadPoolExecutor(max_workers=30) as executor:
    _ = executor.map(worker, images)

concurrent.futures.ThreadPoolExecutor module makes it easy to code for such a task.

Your goal is to put everything into a single function in order to map it with the data.
One function - one task. Packing them all in one func to give it to the executor.

However, even if you run 10 000 threads and the PC has enough memory to handle it that may not work as you may expect.

I called it worker here but you may call it whatever you want. No memory sharing here, no complicated stuff. Just processing an image.

I remember I did some web scraping a long time ago and I had to get almost 250 web pages and extract some data.
I did some tests with different numbers of threads and I had to limit them to e certain number for optimal performance.

So it's good to run some tests and see what happens. Not always more is better

I found this. I watched it before and it helped me a lot to clear that threading thing in my head.

https://www.youtube.com/watch?v=IEEhzQoKtQU

DPaul · Nov-01-2022, 06:44 AM

Ok, i'm sort of convinced that this would be a performance boost.
I'm less convinced that I am going to make it work, because it is not just
OCR and getting a text, there are several extra steps to format it in a
user searchable dataset.
My original idea was wrong, don't do 3 village threads in parallel, instead
do all the ocr's from 1 vilage in parallel.
I'll give it a try. If you don't hear from me, call the police. Undecided

Paul

DPaul · Nov-01-2022, 08:31 AM

Did the test with 100 tifs.
I limited the procedure to only producing the OCR textresult,
and write it to disk, not formatting it.
But I think the OCR part is the most resource consuming.
My PC crashed a number of times before I got the import right.

import concurrent.futures

I got results, but they are hard to believe.
No multithreading : 90 seconds
with ThreadPoolExecutor : 13 seconds, a factor of 7 faster!
No maxworkers parameter.
Although I do not fully understand what exacly is going on in this ThreadPoolExecutor,
I know enough to further simplify the procedure for my needs.
2 days ago, I was trying to shave off tenths of seconds...
Thanks to all who helped me on this, a big step forward!
Paul

wavic · (This post was last modified: Nov-01-2022, 09:22 AM by wavic.)

Glad to hear that. And I am sorry that didn't mention the import part.

In case you are not familiar with generators in Python take a look at some tutorials on the web. They can save a lot of memory.

Also, swapping ThreadPoolExecutor with ProcessPoolExecutor in with statement is simple enough to try with additional CPU cores instead of threads and test the performance difference.

Just to be sure that threads are the way. Or not. Could be CPUs

with concurrent.futures.ProcessPoolExecutor() as executor:
    _ = executor.map(worker, images)

DPaul · Nov-01-2022, 10:14 AM

Ok, tried ProcessPoolExecutor.
Found out you need to start it from :
https://superfastpython.com/processpoole...on-errors/

if __name__ == '__main__':

It is actually 1 second slower than ThreadPoolExecutor.
If I specify eg. maxworkers = 10, it even takes a few seconds more.
So I'll stick with Thread !
thx,
Paul

OCR again

User Panel Messages

Announcements