bpo-36018: Add the NormalDist class to the statistics module (GH-11973)

11c79531 · Raymond Hettinger · GitHub · 64d6cc82 · 11c79531 · 11c79531
Commit 11c79531 authored Feb 23, 2019 by Raymond Hettinger Committed by GitHub Feb 23, 2019
5 changed files
--- a/Doc/library/statistics.rst
+++ b/Doc/library/statistics.rst
@@ -467,6 +467,201 @@ A single exception is defined:

   Subclass of :exc:`ValueError` for statistics-related exceptions.

+
+:class:`NormalDist` objects
+===========================
+
+A :class:`NormalDist` is a a composite class that treats the mean and standard
+deviation of data measurements as a single entity.  It is a tool for creating
+and manipulating normal distributions of a random variable.
+
+Normal distributions arise from the `Central Limit Theorem
+<https://en.wikipedia.org/wiki/Central_limit_theorem>`_ and have a wide range
+of applications in statistics, including simulations and hypothesis testing.
+
+.. class:: NormalDist(mu=0.0, sigma=1.0)
+
+    Returns a new *NormalDist* object where *mu* represents the `arithmetic
+    mean <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ of data and *sigma*
+    represents the `standard deviation
+    <https://en.wikipedia.org/wiki/Standard_deviation>`_ of the data.
+
+    If *sigma* is negative, raises :exc:`StatisticsError`.
+
+    .. attribute:: mu
+
+        The mean of a normal distribution.
+
+    .. attribute:: sigma
+
+        The standard deviation of a normal distribution.
+
+    .. attribute:: variance
+
+       A read-only property representing the `variance
+       <https://en.wikipedia.org/wiki/Variance>`_ of a normal
+       distribution. Equal to the square of the standard deviation.
+
+    .. classmethod:: NormalDist.from_samples(data)
+
+       Class method that makes a normal distribution instance
+       from sample data.  The *data* can be any :term:`iterable`
+       and should consist of values that can be converted to type
+       :class:`float`.
+
+       If *data* does not contain at least two elements, raises
+       :exc:`StatisticsError` because it takes at least one point to estimate
+       a central value and at least two points to estimate dispersion.
+
+    .. method:: NormalDist.samples(n, seed=None)
+
+       Generates *n* random samples for a given mean and standard deviation.
+       Returns a :class:`list` of :class:`float` values.
+
+       If *seed* is given, creates a new instance of the underlying random
+       number generator.  This is useful for creating reproducible results,
+       even in a multi-threading context.
+
+    .. method:: NormalDist.pdf(x)
+
+       Using a `probability density function (pdf)
+       <https://en.wikipedia.org/wiki/Probability_density_function>`_,
+       compute the relative likelihood that a random sample *X* will be near
+       the given value *x*.  Mathematically, it is the ratio ``P(x <= X <
+       x+dx) / dx``.
+
+       Note the relative likelihood of *x* can be greater than `1.0`.  The
+       probability for a specific point on a continuous distribution is `0.0`,
+       so the :func:`pdf` is used instead.  It gives the probability of a
+       sample occurring in a narrow range around *x* and then dividing that
+       probability by the width of the range (hence the word "density").
+
+    .. method:: NormalDist.cdf(x)
+
+       Using a `cumulative distribution function (cdf)
+       <https://en.wikipedia.org/wiki/Cumulative_distribution_function>`_,
+       compute the probability that a random sample *X* will be less than or
+       equal to *x*.  Mathematically, it is written ``P(X <= x)``.
+
+    Instances of :class:`NormalDist` support addition, subtraction,
+    multiplication and division by a constant.  These operations
+    are used for translation and scaling.  For example:
+
+    .. doctest::
+
+        >>> temperature_february = NormalDist(5, 2.5)             # Celsius
+        >>> temperature_february * (9/5) + 32                     # Fahrenheit
+        NormalDist(mu=41.0, sigma=4.5)
+
+    Dividing a constant by an instance of :class:`NormalDist` is not supported.
+
+    Since normal distributions arise from additive effects of independent
+    variables, it is possible to `add and subtract two normally distributed
+    random variables
+    <https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables>`_
+    represented as instances of :class:`NormalDist`.  For example:
+
+    .. doctest::
+
+        >>> birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5])
+        >>> drug_effects = NormalDist(0.4, 0.15)
+        >>> combined = birth_weights + drug_effects
+        >>> f'mu={combined.mu :.1f}   sigma={combined.sigma :.1f}'
+        'mu=3.1   sigma=0.5'
+
+    .. versionadded:: 3.8
+
+
+:class:`NormalDist` Examples and Recipes
+----------------------------------------
+
+A :class:`NormalDist` readily solves classic probability problems.
+
+For example, given `historical data for SAT exams
+<https://blog.prepscholar.com/sat-standard-deviation>`_ showing that scores
+are normally distributed with a mean of 1060 and standard deviation of 192,
+determine the percentage of students with scores between 1100 and 1200:
+
+.. doctest::
+
+    >>> sat = NormalDist(1060, 195)
+    >>> fraction = sat.cdf(1200) - sat.cdf(1100)
+    >>> f'{fraction * 100 :.1f}% score between 1100 and 1200'
+    '18.2% score between 1100 and 1200'
+
+To estimate the distribution for a model than isn't easy to solve
+analytically, :class:`NormalDist` can generate input samples for a `Monte
+Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_ of the
+model:
+
+.. doctest::
+
+    >>> n = 100_000
+    >>> X = NormalDist(350, 15).samples(n)
+    >>> Y = NormalDist(47, 17).samples(n)
+    >>> Z = NormalDist(62, 6).samples(n)
+    >>> model_simulation = [x * y / z for x, y, z in zip(X, Y, Z)]
+    >>> NormalDist.from_samples(model_simulation)           # doctest: +SKIP
+    NormalDist(mu=267.6516398754636, sigma=101.357284306067)
+
+Normal distributions commonly arise in machine learning problems.
+
+Wikipedia has a `nice example with a Naive Bayesian Classifier
+<https://en.wikipedia.org/wiki/Naive_Bayes_classifier>`_.  The challenge
+is to guess a person's gender from measurements of normally distributed
+features including height, weight, and foot size.
+
+The `prior probability <https://en.wikipedia.org/wiki/Prior_probability>`_ of
+being male or female is 50%:
+
+.. doctest::
+
+    >>> prior_male = 0.5
+    >>> prior_female = 0.5
+
+We also have a training dataset with measurements for eight people.  These
+measurements are assumed to be normally distributed, so we summarize the data
+with :class:`NormalDist`:
+
+.. doctest::
+
+    >>> height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92])
+    >>> height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75])
+    >>> weight_male = NormalDist.from_samples([180, 190, 170, 165])
+    >>> weight_female = NormalDist.from_samples([100, 150, 130, 150])
+    >>> foot_size_male = NormalDist.from_samples([12, 11, 12, 10])
+    >>> foot_size_female = NormalDist.from_samples([6, 8, 7, 9])
+
+We observe a new person whose feature measurements are known but whose gender
+is unknown:
+
+.. doctest::
+
+    >>> ht = 6.0        # height
+    >>> wt = 130        # weight
+    >>> fs = 8          # foot size
+
+The posterior is the product of the prior times each likelihood of a
+feature measurement given the gender:
+
+.. doctest::
+
+   >>> posterior_male = (prior_male * height_male.pdf(ht) *
+   ...                   weight_male.pdf(wt) * foot_size_male.pdf(fs))
+
+   >>> posterior_female = (prior_female * height_female.pdf(ht) *
+   ...                     weight_female.pdf(wt) * foot_size_female.pdf(fs))
+
+The final prediction is awarded to the largest posterior -- this is known as
+the `maximum a posteriori
+<https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation>`_ or MAP:
+
+.. doctest::
+
+  >>> 'male' if posterior_male > posterior_female else 'female'
+  'female'
+
+
 ..
   # This modelines must appear within the last ten lines of the file.
   kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;
--- a/Doc/whatsnew/3.8.rst
+++ b/Doc/whatsnew/3.8.rst
@@ -278,6 +278,32 @@ Added :func:`statistics.fmean` as a faster, floating point variant of
 :func:`statistics.mean()`.  (Contributed by Raymond Hettinger and
 Steven D'Aprano in :issue:`35904`.)

+Added :class:`statistics.NormalDist`, a tool for creating
+and manipulating normal distributions of a random variable.
+(Contributed by Raymond Hettinger in :issue:`36018`.)
+
+::
+
+    >>> temperature_feb = NormalDist.from_samples([4, 12, -3, 2, 7, 14])
+    >>> temperature_feb
+    NormalDist(mu=6.0, sigma=6.356099432828281)
+
+    >>> temperature_feb.cdf(3)            # Chance of being under 3 degrees
+    0.3184678262814532
+    >>> # Relative chance of being 7 degrees versus 10 degrees
+    >>> temperature_feb.pdf(7) / temperature_feb.pdf(10)
+    1.2039930378537762
+
+    >>> el_nino = NormalDist(4, 2.5)
+    >>> temperature_feb += el_nino        # Add in a climate effect
+    >>> temperature_feb
+    NormalDist(mu=10.0, sigma=6.830080526611674)
+
+    >>> temperature_feb * (9/5) + 32      # Convert to Fahrenheit
+    NormalDist(mu=50.0, sigma=12.294144947901014)
+    >>> temperature_feb.samples(3)        # Generate random samples
+    [7.672102882379219, 12.000027119750287, 4.647488369766392]
+

 tokenize
 --------

--- a/Lib/statistics.py
+++ b/Lib/statistics.py
@@ -76,7 +76,7 @@ A single exception is defined: StatisticsError is a subclass of ValueError.

 """

-__all__ = [ 'StatisticsError',
+__all__ = [ 'StatisticsError', 'NormalDist',
            'pstdev', 'pvariance', 'stdev', 'variance',
            'median',  'median_low', 'median_high', 'median_grouped',
            'mean', 'mode', 'harmonic_mean', 'fmean',
@@ -85,11 +85,13 @@ __all__ = [ 'StatisticsError',
 import collections
 import math
 import numbers
+import random

 from fractions import Fraction
 from decimal import Decimal
 from itertools import groupby
 from bisect import bisect_left, bisect_right
+from math import hypot, sqrt, fabs, exp, erf, tau



@@ -694,3 +696,155 @@ def pstdev(data, mu=None):
        return var.sqrt()
    except AttributeError:
        return math.sqrt(var)
+
+## Normal Distribution #####################################################
+
+class NormalDist:
+    'Normal distribution of a random variable'
+    # https://en.wikipedia.org/wiki/Normal_distribution
+    # https://en.wikipedia.org/wiki/Variance#Properties
+
+    __slots__ = ('mu', 'sigma')
+
+    def __init__(self, mu=0.0, sigma=1.0):
+        'NormalDist where mu is the mean and sigma is the standard deviation'
+        if sigma < 0.0:
+            raise StatisticsError('sigma must be non-negative')
+        self.mu = mu
+        self.sigma = sigma
+
+    @classmethod
+    def from_samples(cls, data):
+        'Make a normal distribution instance from sample data'
+        if not isinstance(data, (list, tuple)):
+            data = list(data)
+        xbar = fmean(data)
+        return cls(xbar, stdev(data, xbar))
+
+    def samples(self, n, seed=None):
+        'Generate *n* samples for a given mean and standard deviation'
+        gauss = random.gauss if seed is None else random.Random(seed).gauss
+        mu, sigma = self.mu, self.sigma
+        return [gauss(mu, sigma) for i in range(n)]
+
+    def pdf(self, x):
+        'Probability density function:  P(x <= X < x+dx) / dx'
+        variance = self.sigma ** 2.0
+        if not variance:
+            raise StatisticsError('pdf() not defined when sigma is zero')
+        return exp((x - self.mu)**2.0 / (-2.0*variance)) / sqrt(tau * variance)
+
+    def cdf(self, x):
+        'Cumulative density function:  P(X <= x)'
+        if not self.sigma:
+            raise StatisticsError('cdf() not defined when sigma is zero')
+        return 0.5 * (1.0 + erf((x - self.mu) / (self.sigma * sqrt(2.0))))
+
+    @property
+    def variance(self):
+        'Square of the standard deviation'
+        return self.sigma ** 2.0
+
+    def __add__(x1, x2):
+        if isinstance(x2, NormalDist):
+            return NormalDist(x1.mu + x2.mu, hypot(x1.sigma, x2.sigma))
+        return NormalDist(x1.mu + x2, x1.sigma)
+
+    def __sub__(x1, x2):
+        if isinstance(x2, NormalDist):
+            return NormalDist(x1.mu - x2.mu, hypot(x1.sigma, x2.sigma))
+        return NormalDist(x1.mu - x2, x1.sigma)
+
+    def __mul__(x1, x2):
+        return NormalDist(x1.mu * x2, x1.sigma * fabs(x2))
+
+    def __truediv__(x1, x2):
+        return NormalDist(x1.mu / x2, x1.sigma / fabs(x2))
+
+    def __pos__(x1):
+        return x1
+
+    def __neg__(x1):
+        return NormalDist(-x1.mu, x1.sigma)
+
+    __radd__ = __add__
+
+    def __rsub__(x1, x2):
+        return -(x1 - x2)
+
+    __rmul__ = __mul__
+
+    def __eq__(x1, x2):
+        if not isinstance(x2, NormalDist):
+            return NotImplemented
+        return (x1.mu, x2.sigma) == (x2.mu, x2.sigma)
+
+    def __repr__(self):
+        return f'{type(self).__name__}(mu={self.mu!r}, sigma={self.sigma!r})'
+
+
+if __name__ == '__main__':
+
+    # Show math operations computed analytically in comparsion
+    # to a monte carlo simulation of the same operations
+
+    from math import isclose
+    from operator import add, sub, mul, truediv
+    from itertools import repeat
+
+    g1 = NormalDist(10, 20)
+    g2 = NormalDist(-5, 25)
+
+    # Test scaling by a constant
+    assert (g1 * 5 / 5).mu == g1.mu
+    assert (g1 * 5 / 5).sigma == g1.sigma
+
+    n = 100_000
+    G1 = g1.samples(n)
+    G2 = g2.samples(n)
+
+    for func in (add, sub):
+        print(f'\nTest {func.__name__} with another NormalDist:')
+        print(func(g1, g2))
+        print(NormalDist.from_samples(map(func, G1, G2)))
+
+    const = 11
+    for func in (add, sub, mul, truediv):
+        print(f'\nTest {func.__name__} with a constant:')
+        print(func(g1, const))
+        print(NormalDist.from_samples(map(func, G1, repeat(const))))
+
+    const = 19
+    for func in (add, sub, mul):
+        print(f'\nTest constant with {func.__name__}:')
+        print(func(const, g1))
+        print(NormalDist.from_samples(map(func, repeat(const), G1)))
+
+    def assert_close(G1, G2):
+        assert isclose(G1.mu, G1.mu, rel_tol=0.01), (G1, G2)
+        assert isclose(G1.sigma, G2.sigma, rel_tol=0.01), (G1, G2)
+
+    X = NormalDist(-105, 73)
+    Y = NormalDist(31, 47)
+    s = 32.75
+    n = 100_000
+
+    S = NormalDist.from_samples([x + s for x in X.samples(n)])
+    assert_close(X + s, S)
+
+    S = NormalDist.from_samples([x - s for x in X.samples(n)])
+    assert_close(X - s, S)
+
+    S = NormalDist.from_samples([x * s for x in X.samples(n)])
+    assert_close(X * s, S)
+
+    S = NormalDist.from_samples([x / s for x in X.samples(n)])
+    assert_close(X / s, S)
+
+    S = NormalDist.from_samples([x + y for x, y in zip(X.samples(n),
+                                                       Y.samples(n))])
+    assert_close(X + Y, S)
+
+    S = NormalDist.from_samples([x - y for x, y in zip(X.samples(n),
+                                                       Y.samples(n))])
+    assert_close(X - Y, S)
--- a/Lib/test/test_statistics.py
+++ b/Lib/test/test_statistics.py
@@ -5,9 +5,11 @@ approx_equal function.

 import collections
 import collections.abc
+import copy
 import decimal
 import doctest
 import math
+import pickle
 import random
 import sys
 import unittest
@@ -2025,6 +2027,181 @@ class TestStdev(VarianceStdevMixin, NumericTestCase):
        expected = math.sqrt(statistics.variance(data))
        self.assertEqual(self.func(data), expected)

+class TestNormalDist(unittest.TestCase):
+
+    def test_slots(self):
+        nd = statistics.NormalDist(300, 23)
+        with self.assertRaises(TypeError):
+            vars(nd)
+        self.assertEqual(nd.__slots__, ('mu', 'sigma'))
+
+    def test_instantiation_and_attributes(self):
+        nd = statistics.NormalDist(500, 17)
+        self.assertEqual(nd.mu, 500)
+        self.assertEqual(nd.sigma, 17)
+        self.assertEqual(nd.variance, 17**2)
+
+        # default arguments
+        nd = statistics.NormalDist()
+        self.assertEqual(nd.mu, 0)
+        self.assertEqual(nd.sigma, 1)
+        self.assertEqual(nd.variance, 1**2)
+
+        # error case: negative sigma
+        with self.assertRaises(statistics.StatisticsError):
+            statistics.NormalDist(500, -10)
+
+    def test_alternative_constructor(self):
+        NormalDist = statistics.NormalDist
+        data = [96, 107, 90, 92, 110]
+        # list input
+        self.assertEqual(NormalDist.from_samples(data), NormalDist(99, 9))
+        # tuple input
+        self.assertEqual(NormalDist.from_samples(tuple(data)), NormalDist(99, 9))
+        # iterator input
+        self.assertEqual(NormalDist.from_samples(iter(data)), NormalDist(99, 9))
+        # error cases
+        with self.assertRaises(statistics.StatisticsError):
+            NormalDist.from_samples([])                      # empty input
+        with self.assertRaises(statistics.StatisticsError):
+            NormalDist.from_samples([10])                    # only one input
+
+    def test_sample_generation(self):
+        NormalDist = statistics.NormalDist
+        mu, sigma = 10_000, 3.0
+        X = NormalDist(mu, sigma)
+        n = 1_000
+        data = X.samples(n)
+        self.assertEqual(len(data), n)
+        self.assertEqual(set(map(type, data)), {float})
+        # mean(data) expected to fall within 8 standard deviations
+        xbar = statistics.mean(data)
+        self.assertTrue(mu - sigma*8 <= xbar <= mu + sigma*8)
+
+        # verify that seeding makes reproducible sequences
+        n = 100
+        data1 = X.samples(n, seed='happiness and joy')
+        data2 = X.samples(n, seed='trouble and despair')
+        data3 = X.samples(n, seed='happiness and joy')
+        data4 = X.samples(n, seed='trouble and despair')
+        self.assertEqual(data1, data3)
+        self.assertEqual(data2, data4)
+        self.assertNotEqual(data1, data2)
+
+        # verify that subclass type is honored
+        class NewNormalDist(NormalDist):
+            pass
+        nnd = NewNormalDist(200, 5)
+        self.assertEqual(type(nnd), NewNormalDist)
+
+    def test_pdf(self):
+        NormalDist = statistics.NormalDist
+        X = NormalDist(100, 15)
+        # Verify peak around center
+        self.assertLess(X.pdf(99), X.pdf(100))
+        self.assertLess(X.pdf(101), X.pdf(100))
+        # Test symmetry
+        self.assertAlmostEqual(X.pdf(99), X.pdf(101))
+        self.assertAlmostEqual(X.pdf(98), X.pdf(102))
+        self.assertAlmostEqual(X.pdf(97), X.pdf(103))
+        # Test vs CDF
+        dx = 2.0 ** -10
+        for x in range(90, 111):
+            est_pdf = (X.cdf(x + dx) - X.cdf(x)) / dx
+            self.assertAlmostEqual(X.pdf(x), est_pdf, places=4)
+        # Error case: variance is zero
+        Y = NormalDist(100, 0)
+        with self.assertRaises(statistics.StatisticsError):
+            Y.pdf(90)
+
+    def test_cdf(self):
+        NormalDist = statistics.NormalDist
+        X = NormalDist(100, 15)
+        cdfs = [X.cdf(x) for x in range(1, 200)]
+        self.assertEqual(set(map(type, cdfs)), {float})
+        # Verify montonic
+        self.assertEqual(cdfs, sorted(cdfs))
+        # Verify center
+        self.assertAlmostEqual(X.cdf(100), 0.50)
+        # Error case: variance is zero
+        Y = NormalDist(100, 0)
+        with self.assertRaises(statistics.StatisticsError):
+            Y.cdf(90)
+
+    def test_same_type_addition_and_subtraction(self):
+        NormalDist = statistics.NormalDist
+        X = NormalDist(100, 12)
+        Y = NormalDist(40, 5)
+        self.assertEqual(X + Y, NormalDist(140, 13))        # __add__
+        self.assertEqual(X - Y, NormalDist(60, 13))         # __sub__
+
+    def test_translation_and_scaling(self):
+        NormalDist = statistics.NormalDist
+        X = NormalDist(100, 15)
+        y = 10
+        self.assertEqual(+X, NormalDist(100, 15))           # __pos__
+        self.assertEqual(-X, NormalDist(-100, 15))          # __neg__
+        self.assertEqual(X + y, NormalDist(110, 15))        # __add__
+        self.assertEqual(y + X, NormalDist(110, 15))        # __radd__
+        self.assertEqual(X - y, NormalDist(90, 15))         # __sub__
+        self.assertEqual(y - X, NormalDist(-90, 15))        # __rsub__
+        self.assertEqual(X * y, NormalDist(1000, 150))      # __mul__
+        self.assertEqual(y * X, NormalDist(1000, 150))      # __rmul__
+        self.assertEqual(X / y, NormalDist(10, 1.5))        # __truediv__
+        with self.assertRaises(TypeError):
+            y / X
+
+    def test_equality(self):
+        NormalDist = statistics.NormalDist
+        nd1 = NormalDist()
+        nd2 = NormalDist(2, 4)
+        nd3 = NormalDist()
+        self.assertNotEqual(nd1, nd2)
+        self.assertEqual(nd1, nd3)
+
+        # Test NotImplemented when types are different
+        class A:
+            def __eq__(self, other):
+                return 10
+        a = A()
+        self.assertEqual(nd1.__eq__(a), NotImplemented)
+        self.assertEqual(nd1 == a, 10)
+        self.assertEqual(a == nd1, 10)
+
+        # All subclasses to compare equal giving the same behavior
+        # as list, tuple, int, float, complex, str, dict, set, etc.
+        class SizedNormalDist(NormalDist):
+            def __init__(self, mu, sigma, n):
+                super().__init__(mu, sigma)
+                self.n = n
+        s = SizedNormalDist(100, 15, 57)
+        nd4 = NormalDist(100, 15)
+        self.assertEqual(s, nd4)
+
+        # Don't allow duck type equality because we wouldn't
+        # want a lognormal distribution to compare equal
+        # to a normal distribution with the same parameters
+        class LognormalDist:
+            def __init__(self, mu, sigma):
+                self.mu = mu
+                self.sigma = sigma
+        lnd = LognormalDist(100, 15)
+        nd = NormalDist(100, 15)
+        self.assertNotEqual(nd, lnd)
+
+    def test_pickle_and_copy(self):
+        nd = statistics.NormalDist(37.5, 5.625)
+        nd1 = copy.copy(nd)
+        self.assertEqual(nd, nd1)
+        nd2 = copy.deepcopy(nd)
+        self.assertEqual(nd, nd2)
+        nd3 = pickle.loads(pickle.dumps(nd))
+        self.assertEqual(nd, nd3)
+
+    def test_repr(self):
+        nd = statistics.NormalDist(37.5, 5.625)
+        self.assertEqual(repr(nd), 'NormalDist(mu=37.5, sigma=5.625)')
+

 # === Run tests ===


--- a/Misc/NEWS.d/next/Library/2019-02-21-15-47-00.bpo-36018.qt7QUe.rst
+++ b/Misc/NEWS.d/next/Library/2019-02-21-15-47-00.bpo-36018.qt7QUe.rst
+Add statistics.NormalDist, a tool for creating and manipulating normal
+distributions of random variable.  Features a composite class that treats
+the mean and standard deviation of measurement data as single entity.