Mathematical Modules in Python: Statistics

Statistical analysis of data helps us make sense of the information as a whole. This has applications in a lot of fields, like biostatistics and business analytics.

Instead of going through individual data points, just one look at their collective mean value or variance can reveal trends and features that we might have missed by observing all the data in raw format. It also makes the comparison between two large data sets way easier and more meaningful.

Keeping these needs in mind, Python has provided us with the statistics module.

In this tutorial, you will learn about different ways of calculating averages and measuring the spread of a given set of data. Unless stated otherwise, all the functions in this module support int, float, decimal, and fraction based data sets as input.

Statistics Task	Typical Functions
Calculating the Mean	`mean()`, `fmean()`, `geometric_mean()`, `harmonic_mean()`
Calculating the Mode	`mode()`, `multimode()`
Calculating the Median	`median()`
Measuring the Spread of Data	`pvariance()`, `variance()`, `pstdev()`, `stdev()`

Calculating the Mean

You can use the mean(data) function to calculate the mean of some given data. It is calculated by dividing the sum of all the data points by the number of data points. If the data is empty, a StatisticsError will be raised. Here are a few examples:

import statistics
from fractions import Fraction as F
from decimal import Decimal as D

statistics.mean([11, 2, 13, 14, 44])
# returns 16.8

statistics.mean([F(8, 10), F(11, 20), F(2, 5), F(28, 5)])
# returns Fraction(147, 80)

statistics.mean([D("1.5"), D("5.75"), D("10.625"), D("2.375")])
# returns Decimal('5.0625')

You learned about a lot of functions to generate random numbers in our last tutorial. Let's use them now to generate our data and see if the final mean is equal to what we expect it to be.

import random
import statistics

data_points = [ random.randint(1, 100) for x in range(1,1001) ]
statistics.mean(data_points)
# returns 50.618

data_points = [ random.triangular(1, 100, 80) for x in range(1,1001) ]
statistics.mean(data_points)
# returns 59.93292281437689

With the randint() function, the mean is expected to be close to the mid-point of both extremes, and with the triangular distribution, it is supposed to be close to low + high + mode / 3. Therefore, the mean in the first and second cases should be 50 and 60.33 respectively, which is close to what we actually got.

One thing that you will realize when using the mean() function in the statistics module is that it has been written to prioritize accuracy over speed. This implies that you will get much better results with wildly varying data by using the mean() function instead of doing regular average computation with a simple sum.

You can consider using the fmean() function introduced in Python 3.8 if you prefer speed over absolute accuracy. The results will still be accurate in most situations. This function will convert all the data to floats and then return the mean as a float as well.

import random
import statistics
from fractions import Fraction as F

int_values = [random.randrange(100) for x in range(9)]
frac_values = [F(1, 2), F(1, 3), F(1, 4), F(1, 5), F(1, 6), F(1, 7), F(1, 8), F(1, 9)]

mix_values = [*int_values, *frac_values]

print(statistics.mean(mix_values))
# 929449/42840

print(statistics.fmean(mix_values))
# 21.69582166199813

Starting from version 3.8, Python also supports the calculation of the geometric and harmonic means of data using the geometric_mean(data) and harmonic_mean(data, weights=None) functions.

The geometric mean is calculated by multiplying all the n values in the data and then taking the n^throot of the product. The results may be slightly off in some cases due to floating-point errors.

One application of the geometric mean is in the quick calculation of compound annual growth rates. For example, let's say the sales of a company in four years are 100, 120, 150, and 200. The percentage growth for three years will then be 20%, 25%, and 33.33%. The average growth rate of sales for the company will be more accurately represented by the geometric mean of the percentages. The arithmetic mean will always give us a wrong and slightly higher growth rate.

import statistics

growth_rates = [20, 25, 33.33]

print(statistics.mean(growth_rates))
# 26.11

print(statistics.geometric_mean(growth_rates))
# 25.542796263143476

The harmonic mean is simply the reciprocal of the arithmetic mean of the reciprocal of the data. Since the harmonic_mean() function calculates the mean of reciprocals, a value of 0 in the data creates problems, and we'll get a StatisticsError exception.

The harmonic mean is useful for calculating the averages of ratios and rates, such as calculating the average speed, density, or resistance in parallel. Here is some code that calculates the average speed when someone covers a fixed portion of a journey (100km in this case) with specific speeds.

import statistics


speeds = [30, 40, 60]
distance = 100

total_distance = len(speeds)*distance
total_time = 0

for speed in speeds:
    total_time += distance/speed

average_speed = total_distance/total_time

print(average_speed)
# 39.99999999999999

print(statistics.harmonic_mean(speeds))
# 40.0

Two things worth noticing here are that the harmonic_mean() function reduces all the calculations to a single one-liner and at the same time gives more accurate results without floating-point errors.

We can use the weights argument to specify how much corresponding distance was covered with certain speeds.

import statistics

speeds = [30, 40, 60]
distances = [100, 120, 160]

print(statistics.harmonic_mean(speeds, distances))
# 42.222222222

Calculating the Mode

The mean is a good indicator of the average, but a few extreme values can result in an average that is far from the actual central location. In some cases, it is more desirable to determine the most frequent data point in a data set. The mode() function will return the most common data point from discrete numerical or non-numerical data. This is the only statistical function that can be used with non-numeric data.

import random
import statistics

data_points = [ random.randint(1, 100) for x in range(1,1001) ]
statistics.mode(data_points)
# returns 94

data_points = [ random.randint(1, 100) for x in range(1,1001) ]
statistics.mode(data_points)
# returns 49

data_points = [ random.randint(1, 100) for x in range(1,1001) ]
statistics.mode(data_points)
# returns 32

mode(["cat", "dog", "dog", "cat", "monkey", "monkey", "dog"])
# returns 'dog'

The mode of randomly generated integers in a given range can be any of those numbers as the frequency of occurrence of each number is unpredictable. The three examples in the above code snippet prove that point. The last example shows us how we can calculate the mode of non-numeric data.

A newer multimode() function in Python 3.8 allows us to return more than one result when there are multiple values that occur with the same top frequency.

import statistics

favorite_pet = ['cat', 'dog', 'dog', 'mouse', 'cat', 'cat', 'turtle', 'dog']

print(statistics.multimode(favorite_pet))
# ['cat', 'dog']

Calculating the Median

Relying on the mode to calculate a central value can be a bit misleading. As we just saw in the previous section, it will always be the most frequently occurring data point, irrespective of all other values in the data set. Another way of determining the central location is by using the median() function. It will return the median value of given numeric data by calculating the mean of two middle points if necessary. If the number of data points is odd, it returns the middle point. If the number of data points is even, it returns the average of the two median values.

The problem with the median() function is that the final value may not be an actual data point when the number of data points is even. In such cases, you can either use median_low() or median_high() to calculate the median. With an even number of data points, these functions will return the smaller and larger value of the two middle points respectively.

import random
import statistics

data_points = [ random.randint(1, 100) for x in range(1,50) ]
statistics.median(data_points)
# returns 53

data_points = [ random.randint(1, 100) for x in range(1,51) ]
statistics.median(data_points)
# returns 51.0

data_points = [ random.randint(1, 100) for x in range(1,51) ]
statistics.median(data_points)
# returns 49.0

data_points = [ random.randint(1, 100) for x in range(1,51) ]
statistics.median_low(data_points)
# returns 50

statistics.median_high(data_points)
# returns 52

statistics.median(data_points)
# returns 51.0

In the last case, the low and high medians were 50 and 52. This means that there was no data point with a value of 51 in our data set, but the median() function still calculated the median to be 51.0.

Measuring the Spread of Data

Determining how much the data points deviate from the typical or average value of the data set is just as important as calculating the central or average value itself. The statistics module has four different functions to help us calculate this spread of data.

You can use the pvariance(data, mu=None) function to calculate the population variance of a given data set.

The second argument in this case is optional. The value of mu, when provided, should be equal to the mean of the given data. The mean is calculated automatically if the value is missing. This function is helpful when you want to calculate the variance of an entire population. If your data is only a sample of the population, you can use the variance(data, xBar=None) function to calculate the sample variance. Here, xBar is the mean of the given sample and is calculated automatically if not provided.

To calculate the population standard definition and sample standard deviation, you can use the pstdev(data, mu=None) and stdev(data, xBar=None) functions respectively.

import statistics
from fractions import Fraction as F

data = [1, 2, 3, 4, 5, 6, 7, 8, 9]

statistics.pvariance(data)     # returns 6.666666666666667
statistics.pstdev(data)        # returns 2.581988897471611
statistics.variance(data)      # returns 7.5
statistics.stdev(data)         # returns 2.7386127875258306

more_data = [3, 4, 5, 5, 5, 5, 5, 6, 6]

statistics.pvariance(more_data)   # returns 0.7654320987654322
statistics.pstdev(more_data)      # returns 0.8748897637790901

some_fractions = [F(5, 6), F(2, 3), F(11, 12)]
statistics.variance(some_fractions)
# returns Fraction(7, 432)


As evident from the above example, a smaller variance implies that more data points are closer in value to the mean. You can also calculate the standard deviation of decimals and fractions.

Final Thoughts

In this last tutorial of the series, we learned about different functions available in the statistics module. You might have observed that the data given to the functions was sorted in most cases, but it doesn't have to be. I have used sorted lists in this tutorial because they make it easier to understand how the value returned by different functions is related to the input data.