Python Forum
math (stats) question
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
math (stats) question
#1
Hi,
We processed our first batch of 25.000 prayer cards. Awesome!
But: the quality of the result depends on a lot of factors,
probably the most important is the print quality of the card and the font used.
In older cards, eg. an "ij" is sometimes seen as "u" etc...
So I want to estimate the quality by studying a sample of the 25.000.
Question is how big a sample, and what does it tell me?
There are tables of course, but they deal mostly with people behaviour
and can I apply that in this situation?
Population 25.000
Confidence level :95%
Fault % :5
Sample size : 379
I remember that in statistics nothing is what it seems, so can I apply these formulas just like that?
It is difficult to give a "size" to the mistake. "u" instead of "ij", no problem, we use "acceptable or not".
thx,
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#2
You can compute a confidence interval
from numpy import sqrt
from scipy import stats


def confidence_interval(number, size, confidence):
    if number < 15 or number > size - 15:
        raise ValueError(
            'Invalid value to approximate with the normal law')
    tolerance = 1 - confidence
    k = stats.norm.ppf(1 - 0.5 * tolerance)
    f = number / size
    radius = k * sqrt(f * (1 - f) / size)
    return f - radius, f + radius


if __name__ == '__main__':
    a, b = confidence_interval(50, 379, 0.95)
    print(f'Confidence interval: {a:.1%}, {b:.1%}')
Output:
Confidence interval: 9.8%, 16.6%
This tells me that if 50 cards have bad results out of a sample of size 379, the confidence interval for the frequency of bad cards is between 9.8% and 16.6% at the level of confidence 95%.
Reply
#3
OK, thanks, let's try it !
Paul

Edit: It is difficult to qualify a card with levels of "bad", because OCR errors are not always in plain sight.
But I can qualify a card as "very bad" if OCR has failed.
Using these numbers with some caution, I can say that pytesseract does a fine job, small confidence interval , high confidence level.
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Anyone know what happened to pypi-org stats page? Crontab 4 3,369 Jul-11-2019, 07:01 PM
Last Post: Crontab

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020