Commit 11c79531 authored by Raymond Hettinger's avatar Raymond Hettinger Committed by GitHub

bpo-36018: Add the NormalDist class to the statistics module (GH-11973)

parent 64d6cc82
...@@ -467,6 +467,201 @@ A single exception is defined: ...@@ -467,6 +467,201 @@ A single exception is defined:
Subclass of :exc:`ValueError` for statistics-related exceptions. Subclass of :exc:`ValueError` for statistics-related exceptions.
:class:`NormalDist` objects
===========================
A :class:`NormalDist` is a a composite class that treats the mean and standard
deviation of data measurements as a single entity. It is a tool for creating
and manipulating normal distributions of a random variable.
Normal distributions arise from the `Central Limit Theorem
<https://en.wikipedia.org/wiki/Central_limit_theorem>`_ and have a wide range
of applications in statistics, including simulations and hypothesis testing.
.. class:: NormalDist(mu=0.0, sigma=1.0)
Returns a new *NormalDist* object where *mu* represents the `arithmetic
mean <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ of data and *sigma*
represents the `standard deviation
<https://en.wikipedia.org/wiki/Standard_deviation>`_ of the data.
If *sigma* is negative, raises :exc:`StatisticsError`.
.. attribute:: mu
The mean of a normal distribution.
.. attribute:: sigma
The standard deviation of a normal distribution.
.. attribute:: variance
A read-only property representing the `variance
<https://en.wikipedia.org/wiki/Variance>`_ of a normal
distribution. Equal to the square of the standard deviation.
.. classmethod:: NormalDist.from_samples(data)
Class method that makes a normal distribution instance
from sample data. The *data* can be any :term:`iterable`
and should consist of values that can be converted to type
:class:`float`.
If *data* does not contain at least two elements, raises
:exc:`StatisticsError` because it takes at least one point to estimate
a central value and at least two points to estimate dispersion.
.. method:: NormalDist.samples(n, seed=None)
Generates *n* random samples for a given mean and standard deviation.
Returns a :class:`list` of :class:`float` values.
If *seed* is given, creates a new instance of the underlying random
number generator. This is useful for creating reproducible results,
even in a multi-threading context.
.. method:: NormalDist.pdf(x)
Using a `probability density function (pdf)
<https://en.wikipedia.org/wiki/Probability_density_function>`_,
compute the relative likelihood that a random sample *X* will be near
the given value *x*. Mathematically, it is the ratio ``P(x <= X <
x+dx) / dx``.
Note the relative likelihood of *x* can be greater than `1.0`. The
probability for a specific point on a continuous distribution is `0.0`,
so the :func:`pdf` is used instead. It gives the probability of a
sample occurring in a narrow range around *x* and then dividing that
probability by the width of the range (hence the word "density").
.. method:: NormalDist.cdf(x)
Using a `cumulative distribution function (cdf)
<https://en.wikipedia.org/wiki/Cumulative_distribution_function>`_,
compute the probability that a random sample *X* will be less than or
equal to *x*. Mathematically, it is written ``P(X <= x)``.
Instances of :class:`NormalDist` support addition, subtraction,
multiplication and division by a constant. These operations
are used for translation and scaling. For example:
.. doctest::
>>> temperature_february = NormalDist(5, 2.5) # Celsius
>>> temperature_february * (9/5) + 32 # Fahrenheit
NormalDist(mu=41.0, sigma=4.5)
Dividing a constant by an instance of :class:`NormalDist` is not supported.
Since normal distributions arise from additive effects of independent
variables, it is possible to `add and subtract two normally distributed
random variables
<https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables>`_
represented as instances of :class:`NormalDist`. For example:
.. doctest::
>>> birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5])
>>> drug_effects = NormalDist(0.4, 0.15)
>>> combined = birth_weights + drug_effects
>>> f'mu={combined.mu :.1f} sigma={combined.sigma :.1f}'
'mu=3.1 sigma=0.5'
.. versionadded:: 3.8
:class:`NormalDist` Examples and Recipes
----------------------------------------
A :class:`NormalDist` readily solves classic probability problems.
For example, given `historical data for SAT exams
<https://blog.prepscholar.com/sat-standard-deviation>`_ showing that scores
are normally distributed with a mean of 1060 and standard deviation of 192,
determine the percentage of students with scores between 1100 and 1200:
.. doctest::
>>> sat = NormalDist(1060, 195)
>>> fraction = sat.cdf(1200) - sat.cdf(1100)
>>> f'{fraction * 100 :.1f}% score between 1100 and 1200'
'18.2% score between 1100 and 1200'
To estimate the distribution for a model than isn't easy to solve
analytically, :class:`NormalDist` can generate input samples for a `Monte
Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_ of the
model:
.. doctest::
>>> n = 100_000
>>> X = NormalDist(350, 15).samples(n)
>>> Y = NormalDist(47, 17).samples(n)
>>> Z = NormalDist(62, 6).samples(n)
>>> model_simulation = [x * y / z for x, y, z in zip(X, Y, Z)]
>>> NormalDist.from_samples(model_simulation) # doctest: +SKIP
NormalDist(mu=267.6516398754636, sigma=101.357284306067)
Normal distributions commonly arise in machine learning problems.
Wikipedia has a `nice example with a Naive Bayesian Classifier
<https://en.wikipedia.org/wiki/Naive_Bayes_classifier>`_. The challenge
is to guess a person's gender from measurements of normally distributed
features including height, weight, and foot size.
The `prior probability <https://en.wikipedia.org/wiki/Prior_probability>`_ of
being male or female is 50%:
.. doctest::
>>> prior_male = 0.5
>>> prior_female = 0.5
We also have a training dataset with measurements for eight people. These
measurements are assumed to be normally distributed, so we summarize the data
with :class:`NormalDist`:
.. doctest::
>>> height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92])
>>> height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75])
>>> weight_male = NormalDist.from_samples([180, 190, 170, 165])
>>> weight_female = NormalDist.from_samples([100, 150, 130, 150])
>>> foot_size_male = NormalDist.from_samples([12, 11, 12, 10])
>>> foot_size_female = NormalDist.from_samples([6, 8, 7, 9])
We observe a new person whose feature measurements are known but whose gender
is unknown:
.. doctest::
>>> ht = 6.0 # height
>>> wt = 130 # weight
>>> fs = 8 # foot size
The posterior is the product of the prior times each likelihood of a
feature measurement given the gender:
.. doctest::
>>> posterior_male = (prior_male * height_male.pdf(ht) *
... weight_male.pdf(wt) * foot_size_male.pdf(fs))
>>> posterior_female = (prior_female * height_female.pdf(ht) *
... weight_female.pdf(wt) * foot_size_female.pdf(fs))
The final prediction is awarded to the largest posterior -- this is known as
the `maximum a posteriori
<https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation>`_ or MAP:
.. doctest::
>>> 'male' if posterior_male > posterior_female else 'female'
'female'
.. ..
# This modelines must appear within the last ten lines of the file. # This modelines must appear within the last ten lines of the file.
kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8; kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;
...@@ -278,6 +278,32 @@ Added :func:`statistics.fmean` as a faster, floating point variant of ...@@ -278,6 +278,32 @@ Added :func:`statistics.fmean` as a faster, floating point variant of
:func:`statistics.mean()`. (Contributed by Raymond Hettinger and :func:`statistics.mean()`. (Contributed by Raymond Hettinger and
Steven D'Aprano in :issue:`35904`.) Steven D'Aprano in :issue:`35904`.)
Added :class:`statistics.NormalDist`, a tool for creating
and manipulating normal distributions of a random variable.
(Contributed by Raymond Hettinger in :issue:`36018`.)
::
>>> temperature_feb = NormalDist.from_samples([4, 12, -3, 2, 7, 14])
>>> temperature_feb
NormalDist(mu=6.0, sigma=6.356099432828281)
>>> temperature_feb.cdf(3) # Chance of being under 3 degrees
0.3184678262814532
>>> # Relative chance of being 7 degrees versus 10 degrees
>>> temperature_feb.pdf(7) / temperature_feb.pdf(10)
1.2039930378537762
>>> el_nino = NormalDist(4, 2.5)
>>> temperature_feb += el_nino # Add in a climate effect
>>> temperature_feb
NormalDist(mu=10.0, sigma=6.830080526611674)
>>> temperature_feb * (9/5) + 32 # Convert to Fahrenheit
NormalDist(mu=50.0, sigma=12.294144947901014)
>>> temperature_feb.samples(3) # Generate random samples
[7.672102882379219, 12.000027119750287, 4.647488369766392]
tokenize tokenize
-------- --------
......
...@@ -76,7 +76,7 @@ A single exception is defined: StatisticsError is a subclass of ValueError. ...@@ -76,7 +76,7 @@ A single exception is defined: StatisticsError is a subclass of ValueError.
""" """
__all__ = [ 'StatisticsError', __all__ = [ 'StatisticsError', 'NormalDist',
'pstdev', 'pvariance', 'stdev', 'variance', 'pstdev', 'pvariance', 'stdev', 'variance',
'median', 'median_low', 'median_high', 'median_grouped', 'median', 'median_low', 'median_high', 'median_grouped',
'mean', 'mode', 'harmonic_mean', 'fmean', 'mean', 'mode', 'harmonic_mean', 'fmean',
...@@ -85,11 +85,13 @@ __all__ = [ 'StatisticsError', ...@@ -85,11 +85,13 @@ __all__ = [ 'StatisticsError',
import collections import collections
import math import math
import numbers import numbers
import random
from fractions import Fraction from fractions import Fraction
from decimal import Decimal from decimal import Decimal
from itertools import groupby from itertools import groupby
from bisect import bisect_left, bisect_right from bisect import bisect_left, bisect_right
from math import hypot, sqrt, fabs, exp, erf, tau
...@@ -694,3 +696,155 @@ def pstdev(data, mu=None): ...@@ -694,3 +696,155 @@ def pstdev(data, mu=None):
return var.sqrt() return var.sqrt()
except AttributeError: except AttributeError:
return math.sqrt(var) return math.sqrt(var)
## Normal Distribution #####################################################
class NormalDist:
'Normal distribution of a random variable'
# https://en.wikipedia.org/wiki/Normal_distribution
# https://en.wikipedia.org/wiki/Variance#Properties
__slots__ = ('mu', 'sigma')
def __init__(self, mu=0.0, sigma=1.0):
'NormalDist where mu is the mean and sigma is the standard deviation'
if sigma < 0.0:
raise StatisticsError('sigma must be non-negative')
self.mu = mu
self.sigma = sigma
@classmethod
def from_samples(cls, data):
'Make a normal distribution instance from sample data'
if not isinstance(data, (list, tuple)):
data = list(data)
xbar = fmean(data)
return cls(xbar, stdev(data, xbar))
def samples(self, n, seed=None):
'Generate *n* samples for a given mean and standard deviation'
gauss = random.gauss if seed is None else random.Random(seed).gauss
mu, sigma = self.mu, self.sigma
return [gauss(mu, sigma) for i in range(n)]
def pdf(self, x):
'Probability density function: P(x <= X < x+dx) / dx'
variance = self.sigma ** 2.0
if not variance:
raise StatisticsError('pdf() not defined when sigma is zero')
return exp((x - self.mu)**2.0 / (-2.0*variance)) / sqrt(tau * variance)
def cdf(self, x):
'Cumulative density function: P(X <= x)'
if not self.sigma:
raise StatisticsError('cdf() not defined when sigma is zero')
return 0.5 * (1.0 + erf((x - self.mu) / (self.sigma * sqrt(2.0))))
@property
def variance(self):
'Square of the standard deviation'
return self.sigma ** 2.0
def __add__(x1, x2):
if isinstance(x2, NormalDist):
return NormalDist(x1.mu + x2.mu, hypot(x1.sigma, x2.sigma))
return NormalDist(x1.mu + x2, x1.sigma)
def __sub__(x1, x2):
if isinstance(x2, NormalDist):
return NormalDist(x1.mu - x2.mu, hypot(x1.sigma, x2.sigma))
return NormalDist(x1.mu - x2, x1.sigma)
def __mul__(x1, x2):
return NormalDist(x1.mu * x2, x1.sigma * fabs(x2))
def __truediv__(x1, x2):
return NormalDist(x1.mu / x2, x1.sigma / fabs(x2))
def __pos__(x1):
return x1
def __neg__(x1):
return NormalDist(-x1.mu, x1.sigma)
__radd__ = __add__
def __rsub__(x1, x2):
return -(x1 - x2)
__rmul__ = __mul__
def __eq__(x1, x2):
if not isinstance(x2, NormalDist):
return NotImplemented
return (x1.mu, x2.sigma) == (x2.mu, x2.sigma)
def __repr__(self):
return f'{type(self).__name__}(mu={self.mu!r}, sigma={self.sigma!r})'
if __name__ == '__main__':
# Show math operations computed analytically in comparsion
# to a monte carlo simulation of the same operations
from math import isclose
from operator import add, sub, mul, truediv
from itertools import repeat
g1 = NormalDist(10, 20)
g2 = NormalDist(-5, 25)
# Test scaling by a constant
assert (g1 * 5 / 5).mu == g1.mu
assert (g1 * 5 / 5).sigma == g1.sigma
n = 100_000
G1 = g1.samples(n)
G2 = g2.samples(n)
for func in (add, sub):
print(f'\nTest {func.__name__} with another NormalDist:')
print(func(g1, g2))
print(NormalDist.from_samples(map(func, G1, G2)))
const = 11
for func in (add, sub, mul, truediv):
print(f'\nTest {func.__name__} with a constant:')
print(func(g1, const))
print(NormalDist.from_samples(map(func, G1, repeat(const))))
const = 19
for func in (add, sub, mul):
print(f'\nTest constant with {func.__name__}:')
print(func(const, g1))
print(NormalDist.from_samples(map(func, repeat(const), G1)))
def assert_close(G1, G2):
assert isclose(G1.mu, G1.mu, rel_tol=0.01), (G1, G2)
assert isclose(G1.sigma, G2.sigma, rel_tol=0.01), (G1, G2)
X = NormalDist(-105, 73)
Y = NormalDist(31, 47)
s = 32.75
n = 100_000
S = NormalDist.from_samples([x + s for x in X.samples(n)])
assert_close(X + s, S)
S = NormalDist.from_samples([x - s for x in X.samples(n)])
assert_close(X - s, S)
S = NormalDist.from_samples([x * s for x in X.samples(n)])
assert_close(X * s, S)
S = NormalDist.from_samples([x / s for x in X.samples(n)])
assert_close(X / s, S)
S = NormalDist.from_samples([x + y for x, y in zip(X.samples(n),
Y.samples(n))])
assert_close(X + Y, S)
S = NormalDist.from_samples([x - y for x, y in zip(X.samples(n),
Y.samples(n))])
assert_close(X - Y, S)
...@@ -5,9 +5,11 @@ approx_equal function. ...@@ -5,9 +5,11 @@ approx_equal function.
import collections import collections
import collections.abc import collections.abc
import copy
import decimal import decimal
import doctest import doctest
import math import math
import pickle
import random import random
import sys import sys
import unittest import unittest
...@@ -2025,6 +2027,181 @@ class TestStdev(VarianceStdevMixin, NumericTestCase): ...@@ -2025,6 +2027,181 @@ class TestStdev(VarianceStdevMixin, NumericTestCase):
expected = math.sqrt(statistics.variance(data)) expected = math.sqrt(statistics.variance(data))
self.assertEqual(self.func(data), expected) self.assertEqual(self.func(data), expected)
class TestNormalDist(unittest.TestCase):
def test_slots(self):
nd = statistics.NormalDist(300, 23)
with self.assertRaises(TypeError):
vars(nd)
self.assertEqual(nd.__slots__, ('mu', 'sigma'))
def test_instantiation_and_attributes(self):
nd = statistics.NormalDist(500, 17)
self.assertEqual(nd.mu, 500)
self.assertEqual(nd.sigma, 17)
self.assertEqual(nd.variance, 17**2)
# default arguments
nd = statistics.NormalDist()
self.assertEqual(nd.mu, 0)
self.assertEqual(nd.sigma, 1)
self.assertEqual(nd.variance, 1**2)
# error case: negative sigma
with self.assertRaises(statistics.StatisticsError):
statistics.NormalDist(500, -10)
def test_alternative_constructor(self):
NormalDist = statistics.NormalDist
data = [96, 107, 90, 92, 110]
# list input
self.assertEqual(NormalDist.from_samples(data), NormalDist(99, 9))
# tuple input
self.assertEqual(NormalDist.from_samples(tuple(data)), NormalDist(99, 9))
# iterator input
self.assertEqual(NormalDist.from_samples(iter(data)), NormalDist(99, 9))
# error cases
with self.assertRaises(statistics.StatisticsError):
NormalDist.from_samples([]) # empty input
with self.assertRaises(statistics.StatisticsError):
NormalDist.from_samples([10]) # only one input
def test_sample_generation(self):
NormalDist = statistics.NormalDist
mu, sigma = 10_000, 3.0
X = NormalDist(mu, sigma)
n = 1_000
data = X.samples(n)
self.assertEqual(len(data), n)
self.assertEqual(set(map(type, data)), {float})
# mean(data) expected to fall within 8 standard deviations
xbar = statistics.mean(data)
self.assertTrue(mu - sigma*8 <= xbar <= mu + sigma*8)
# verify that seeding makes reproducible sequences
n = 100
data1 = X.samples(n, seed='happiness and joy')
data2 = X.samples(n, seed='trouble and despair')
data3 = X.samples(n, seed='happiness and joy')
data4 = X.samples(n, seed='trouble and despair')
self.assertEqual(data1, data3)
self.assertEqual(data2, data4)
self.assertNotEqual(data1, data2)
# verify that subclass type is honored
class NewNormalDist(NormalDist):
pass
nnd = NewNormalDist(200, 5)
self.assertEqual(type(nnd), NewNormalDist)
def test_pdf(self):
NormalDist = statistics.NormalDist
X = NormalDist(100, 15)
# Verify peak around center
self.assertLess(X.pdf(99), X.pdf(100))
self.assertLess(X.pdf(101), X.pdf(100))
# Test symmetry
self.assertAlmostEqual(X.pdf(99), X.pdf(101))
self.assertAlmostEqual(X.pdf(98), X.pdf(102))
self.assertAlmostEqual(X.pdf(97), X.pdf(103))
# Test vs CDF
dx = 2.0 ** -10
for x in range(90, 111):
est_pdf = (X.cdf(x + dx) - X.cdf(x)) / dx
self.assertAlmostEqual(X.pdf(x), est_pdf, places=4)
# Error case: variance is zero
Y = NormalDist(100, 0)
with self.assertRaises(statistics.StatisticsError):
Y.pdf(90)
def test_cdf(self):
NormalDist = statistics.NormalDist
X = NormalDist(100, 15)
cdfs = [X.cdf(x) for x in range(1, 200)]
self.assertEqual(set(map(type, cdfs)), {float})
# Verify montonic
self.assertEqual(cdfs, sorted(cdfs))
# Verify center
self.assertAlmostEqual(X.cdf(100), 0.50)
# Error case: variance is zero
Y = NormalDist(100, 0)
with self.assertRaises(statistics.StatisticsError):
Y.cdf(90)
def test_same_type_addition_and_subtraction(self):
NormalDist = statistics.NormalDist
X = NormalDist(100, 12)
Y = NormalDist(40, 5)
self.assertEqual(X + Y, NormalDist(140, 13)) # __add__
self.assertEqual(X - Y, NormalDist(60, 13)) # __sub__
def test_translation_and_scaling(self):
NormalDist = statistics.NormalDist
X = NormalDist(100, 15)
y = 10
self.assertEqual(+X, NormalDist(100, 15)) # __pos__
self.assertEqual(-X, NormalDist(-100, 15)) # __neg__
self.assertEqual(X + y, NormalDist(110, 15)) # __add__
self.assertEqual(y + X, NormalDist(110, 15)) # __radd__
self.assertEqual(X - y, NormalDist(90, 15)) # __sub__
self.assertEqual(y - X, NormalDist(-90, 15)) # __rsub__
self.assertEqual(X * y, NormalDist(1000, 150)) # __mul__
self.assertEqual(y * X, NormalDist(1000, 150)) # __rmul__
self.assertEqual(X / y, NormalDist(10, 1.5)) # __truediv__
with self.assertRaises(TypeError):
y / X
def test_equality(self):
NormalDist = statistics.NormalDist
nd1 = NormalDist()
nd2 = NormalDist(2, 4)
nd3 = NormalDist()
self.assertNotEqual(nd1, nd2)
self.assertEqual(nd1, nd3)
# Test NotImplemented when types are different
class A:
def __eq__(self, other):
return 10
a = A()
self.assertEqual(nd1.__eq__(a), NotImplemented)
self.assertEqual(nd1 == a, 10)
self.assertEqual(a == nd1, 10)
# All subclasses to compare equal giving the same behavior
# as list, tuple, int, float, complex, str, dict, set, etc.
class SizedNormalDist(NormalDist):
def __init__(self, mu, sigma, n):
super().__init__(mu, sigma)
self.n = n
s = SizedNormalDist(100, 15, 57)
nd4 = NormalDist(100, 15)
self.assertEqual(s, nd4)
# Don't allow duck type equality because we wouldn't
# want a lognormal distribution to compare equal
# to a normal distribution with the same parameters
class LognormalDist:
def __init__(self, mu, sigma):
self.mu = mu
self.sigma = sigma
lnd = LognormalDist(100, 15)
nd = NormalDist(100, 15)
self.assertNotEqual(nd, lnd)
def test_pickle_and_copy(self):
nd = statistics.NormalDist(37.5, 5.625)
nd1 = copy.copy(nd)
self.assertEqual(nd, nd1)
nd2 = copy.deepcopy(nd)
self.assertEqual(nd, nd2)
nd3 = pickle.loads(pickle.dumps(nd))
self.assertEqual(nd, nd3)
def test_repr(self):
nd = statistics.NormalDist(37.5, 5.625)
self.assertEqual(repr(nd), 'NormalDist(mu=37.5, sigma=5.625)')
# === Run tests === # === Run tests ===
......
Add statistics.NormalDist, a tool for creating and manipulating normal
distributions of random variable. Features a composite class that treats
the mean and standard deviation of measurement data as single entity.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment