Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Multiprocessing
#1
Hi,
Thanks to info gathered on this site, we have been able to speed up OCR activities considerably.
We have processed over 320.000 documents in a month, using these 2 lines:
 with concurrent.futures.ThreadPoolExecutor(4) as executor:
        _ = executor.map(worker, images)
After a while you become aware of some limitations:
1) 16 mb ram is a must
2) You should not use ALL the cores, in fact I use # of cores - 2,
because if you use too many cores, OCR errors start to pile up, that disappear if you use one core less
3) Most importantly, you cannot process more than 4.000 documents in one run,
because you get some type of out-of-memory message.
Question:
(Typical machine = 16mb ram , 6 cores, 4 used for the OCR + logic.)
Does anyone know of a (theoretical) formula between, ram, cores, documents, ...? that would explain
why I can't do more than 4.000 documents at a time. The current speed is about 5-6.000 docs / hour.
Some would like to do overnight runs with 100.000 documents, but that is not feasible,
unless there are other considerations that I do not appreciate at this time..
thx,
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply


Messages In This Thread
Multiprocessing - by DPaul - Feb-25-2023, 07:14 AM
RE: Multiprocessing - by DPaul - Feb-27-2023, 09:16 AM
RE: Multiprocessing - by DeaD_EyE - Feb-28-2023, 05:37 PM
RE: Multiprocessing - by DPaul - Feb-28-2023, 06:09 PM

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020