Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Multiprocessing
#1
Hi,
Thanks to info gathered on this site, we have been able to speed up OCR activities considerably.
We have processed over 320.000 documents in a month, using these 2 lines:
 with concurrent.futures.ThreadPoolExecutor(4) as executor:
        _ = executor.map(worker, images)
After a while you become aware of some limitations:
1) 16 mb ram is a must
2) You should not use ALL the cores, in fact I use # of cores - 2,
because if you use too many cores, OCR errors start to pile up, that disappear if you use one core less
3) Most importantly, you cannot process more than 4.000 documents in one run,
because you get some type of out-of-memory message.
Question:
(Typical machine = 16mb ram , 6 cores, 4 used for the OCR + logic.)
Does anyone know of a (theoretical) formula between, ram, cores, documents, ...? that would explain
why I can't do more than 4.000 documents at a time. The current speed is about 5-6.000 docs / hour.
Some would like to do overnight runs with 100.000 documents, but that is not feasible,
unless there are other considerations that I do not appreciate at this time..
thx,
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#2
Seems to be a difficult subject, but I'm still testing various options...
If I use all 6 cores like this: (on the same PC as above)
    with ProcessPoolExecutor(6) as executor:
        _= executor.map(worker, images)
_ i can use all 6 cores with no problems
_ it runs 30% faster than the above ThreadPoolExecutor. That is significant.

Any ideas why ?
Is this method dodgy, or sound python programming?
thx,
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#3
Quote:Any ideas why ?

The Python interpreter is single threaded. Only one Python Statement can be executed at the same time (actually there is ongoing work to make a nogil version). C does not have this limitation in Python. If you're using ProcessPoolExecutor, Python-Processes are started instead of Threads. The communication is the downside because all data must be serialized. But if the input is smaller than the output, processes will help.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#4
(Feb-28-2023, 05:37 PM)DeaD_EyE Wrote: But if the input is smaller than the output, processes will help.
Ok , thanks, we'll keep that in mind.
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020