Python Forum

Full Version: OCR again
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3
Hi,
Currently rewriting and testing for a number of different document formats,
and the speed is beyond expectation!
Still, I would like to make 1 improvement, and I am looking for a suggestion:

When I launch the function "worker" it does it's job.
I can print the starttime, and when it is done, the endtime, but in between
the user has no clue how long this batch is ging to take.
If I print the tif filename from within "worker", that contains a sequence number,
but due to the multiprocessing, it prints "erratic", not very nice.

I could also be very arrogant, and print a "Predicted end time", based on some tests,
and the average time per document, knowing how many there are in the batch.

Something more clever ?
thx,
Paul
For whom it may concern, i ran tests on a batch of 100 tifs,
with OCR and rather intricate analysis of the OCR text. PC has 6 cores.
Sensitivity analysis on max_workers =
1 : 197 seconds
2 : 97 seconds
3 : 69 seconds
4 : 54 seconds # more or less linear : 1000 tifs = 581 seconds (4 cores)
5 : 47 seconds
6 : 43 seconds
7 : 40 seconds
8 : 38 seconds

But, once you start on larger batches of tifs, like 6000, you run into problems:
"Too many files open", although I close everything i can possibly close during runtime.
?
Paul

Edit: the "too many files open" can be overcome by opening the image with a with... statement.
That will close it automatically at the end of the function.
Again, to whom it may concern:
There are obviously more parameters that influence a sensitivity "test" like this.
The previous test was performed on an i5 processor, 16GB ram, 6 cores.

My laptop has an AMD ryzen, 16GB ram, 8 cores. (100 tifs)
max_workers =
4 = 95 seconds
5 = 79 seconds
6 = 68 seconds
7/
8 = 53 seconds

so more cores make up for a slower processor.
Paul
Pages: 1 2 3