Mixed types of numeric data in the statistics module

stevendaprano · May-16-2022, 02:49 AM

Users of the statistics module, how often do you use it with heterogeneous data (mixed numeric types)?

Currently most of the functions try hard to honour homogeneous data, e.g. if your data is Decimal or Fraction, you will (usually) get Decimal or Fraction results:

Output:>>> statistics.variance([Decimal('0.5'), Decimal(2)/3, Decimal(5)/2])
Decimal('1.231481481481481481481481481')
>>> statistics.variance([Fraction(1, 2), Fraction(2, 3), Fraction(5, 2)])
Fraction(133, 108)

With mixed types, the functions usually try to coerce the values into a sensible common type, honouring subclasses:

Output:>>> class MyFloat(float):
...     def __repr__(self):
...             return "MyFloat(%s)" % super().__repr__()
...
>>> statistics.mean([1.5, 2.25, MyFloat(1.0), 3.125, 1.75])
MyFloat(1.925)

but that's harder than you might expect and the extra complexity causes some significant performance costs. And not all combinations are supported (Decimal is particularly difficult).

If you are a user of statistics, how important to you is the ability to mix numeric types, in the same data set?

Which combinations do you care about?

Would you be satisfied with a rule that said that the statistics functions expect homogeneous data and that the result of calling the functions on mixed types is not guaranteed?

This has also been discussed here.

**Gribouillis** · (This post was last modified: May-16-2022, 07:47 AM by Gribouillis.)

I don't use the statistics module very often but here is my take on this

I don't think the ability to mix numeric types is very important in the sense that the user of the module can take the responsibility to homogeneize the data. In a sense it is more flexible.
The «result is not guaranteed» part worries me. It means that the occasional user has a certain probability to get a wrong result because they introduced some inhomogeneity in the data. For example if you do data[3] = 0 while data contains decimals, would this lead to an unpredictable result? I would prefer an exception in that case.

stevendaprano · May-23-2022, 02:15 AM

Thanks for the feedback, and sorry for the delay in responding.

(May-16-2022, 07:45 AM)Gribouillis Wrote: The «result is not guaranteed» part worries me. It means that the occasional user has a certain probability to get a wrong result because they introduced some inhomogeneity in the data.

The numeric value of the result should be correct (to the limitations of the data types involved) or else it would count as a bug. It is only the output type of the result which may be surprising, if you mix input types. So if you have a mix of float, Fraction, Decimal, subclasses of each, etc, you may not be able to easily guess the output type. But the output value should be the same regardless.

The statistics module doesn't actually document the rule it uses to work out the "best" output type, but it is complicated and requires a lot of work. (Possibly more work than actually computing the numeric value!) I hope that, by simplifying that rule, I can speed up the statistics functions. But that may mean that computations which today return one type may return a different type in the future.

Technically that is not documented behaviour, but I need to get an estimate of how many people are relying on the current behaviour and will notice the change.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	numeric string sort	Skaperen	6	5,246	Jan-23-2018, 08:52 AM Last Post: Skaperen

Mixed types of numeric data in the statistics module

User Panel Messages

Announcements