math (stats) question

DPaul · (This post was last modified: Nov-12-2022, 08:16 AM by DPaul.)

Hi,
We processed our first batch of 25.000 prayer cards. Awesome!
But: the quality of the result depends on a lot of factors,
probably the most important is the print quality of the card and the font used.
In older cards, eg. an "ij" is sometimes seen as "u" etc...
So I want to estimate the quality by studying a sample of the 25.000.
Question is how big a sample, and what does it tell me?
There are tables of course, but they deal mostly with people behaviour
and can I apply that in this situation?
Population 25.000
Confidence level :95%
Fault % :5
Sample size : 379
I remember that in statistics nothing is what it seems, so can I apply these formulas just like that?
It is difficult to give a "size" to the mistake. "u" instead of "ij", no problem, we use "acceptable or not".
thx,
Paul

**Gribouillis** · (This post was last modified: Nov-15-2022, 07:20 PM by Gribouillis.)

You can compute a confidence interval

from numpy import sqrt
from scipy import stats


def confidence_interval(number, size, confidence):
    if number < 15 or number > size - 15:
        raise ValueError(
            'Invalid value to approximate with the normal law')
    tolerance = 1 - confidence
    k = stats.norm.ppf(1 - 0.5 * tolerance)
    f = number / size
    radius = k * sqrt(f * (1 - f) / size)
    return f - radius, f + radius


if __name__ == '__main__':
    a, b = confidence_interval(50, 379, 0.95)
    print(f'Confidence interval: {a:.1%}, {b:.1%}')

Output:
Confidence interval: 9.8%, 16.6%

This tells me that if 50 cards have bad results out of a sample of size 379, the confidence interval for the frequency of bad cards is between 9.8% and 16.6% at the level of confidence 95%.

DPaul · (This post was last modified: Nov-17-2022, 07:16 AM by DPaul.)

OK, thanks, let's try it !
Paul

Edit: It is difficult to qualify a card with levels of "bad", because OCR errors are not always in plain sight.
But I can qualify a card as "very bad" if OCR has failed.
Using these numbers with some caution, I can say that pytesseract does a fine job, small confidence interval , high confidence level.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Anyone know what happened to pypi-org stats page?	Crontab	4	3,369	Jul-11-2019, 07:01 PM Last Post: Crontab

math (stats) question

User Panel Messages

Announcements