Statistical analysis of data helps us make sense of the information as a whole. This has applications in a lot of fields, like biostatistics and business analytics.
Instead of going through individual data points, just one look at their collective mean value or variance can reveal trends and features that we might have missed by observing all the data in raw format. It also makes the comparison between two large data sets way easier and more meaningful.
Keeping these needs in mind, Python has provided us with the statistics module.
In this tutorial, you will learn about different ways of calculating averages and measuring the spread of a given set of data. Unless stated otherwise, all the functions in this module support int
, float
, decimal
, and fraction
based data sets as input.
Statistics Task  Typical Functions 

Calculating the Mean 
mean() , fmean() , geometric_mean() , harmonic_mean()

Calculating the Mode 
mode() , multimode()

Calculating the Median  median() 
Measuring the Spread of Data 
pvariance() , variance() , pstdev() , stdev()

Calculating the Mean
You can use the mean(data)
function to calculate the mean of some given data. It is calculated by dividing the sum of all the data points by the number of data points. If the data is empty, a StatisticsError will be raised. Here are a few examples:
import statistics from fractions import Fraction as F from decimal import Decimal as D statistics.mean([11, 2, 13, 14, 44]) # returns 16.8 statistics.mean([F(8, 10), F(11, 20), F(2, 5), F(28, 5)]) # returns Fraction(147, 80) statistics.mean([D("1.5"), D("5.75"), D("10.625"), D("2.375")]) # returns Decimal('5.0625')
You learned about a lot of functions to generate random numbers in our last tutorial. Let's use them now to generate our data and see if the final mean is equal to what we expect it to be.
import random import statistics data_points = [ random.randint(1, 100) for x in range(1,1001) ] statistics.mean(data_points) # returns 50.618 data_points = [ random.triangular(1, 100, 80) for x in range(1,1001) ] statistics.mean(data_points) # returns 59.93292281437689
With the randint()
function, the mean is expected to be close to the midpoint of both extremes, and with the triangular distribution, it is supposed to be close to low + high + mode / 3
. Therefore, the mean in the first and second cases should be 50 and 60.33 respectively, which is close to what we actually got.
One thing that you will realize when using the mean()
function in the statistics
module is that it has been written to prioritize accuracy over speed. This implies that you will get much better results with wildly varying data by using the mean()
function instead of doing regular average computation with a simple sum.
You can consider using the fmean()
function introduced in Python 3.8 if you prefer speed over absolute accuracy. The results will still be accurate in most situations. This function will convert all the data to floats and then return the mean as a float
as well.
import random import statistics from fractions import Fraction as F int_values = [random.randrange(100) for x in range(9)] frac_values = [F(1, 2), F(1, 3), F(1, 4), F(1, 5), F(1, 6), F(1, 7), F(1, 8), F(1, 9)] mix_values = [*int_values, *frac_values] print(statistics.mean(mix_values)) # 929449/42840 print(statistics.fmean(mix_values)) # 21.69582166199813
Starting from version 3.8, Python also supports the calculation of the geometric and harmonic means of data using the geometric_mean(data)
and harmonic_mean(data, weights=None)
functions.
The geometric mean is calculated by multiplying all the n values in the data and then taking the n^{th }root of the product. The results may be slightly off in some cases due to floatingpoint errors.
One application of the geometric mean is in the quick calculation of compound annual growth rates. For example, let's say the sales of a company in four years are 100, 120, 150, and 200. The percentage growth for three years will then be 20%, 25%, and 33.33%. The average growth rate of sales for the company will be more accurately represented by the geometric mean of the percentages. The arithmetic mean will always give us a wrong and slightly higher growth rate.
import statistics growth_rates = [20, 25, 33.33] print(statistics.mean(growth_rates)) # 26.11 print(statistics.geometric_mean(growth_rates)) # 25.542796263143476
The harmonic mean is simply the reciprocal of the arithmetic mean of the reciprocal of the data. Since the harmonic_mean()
function calculates the mean of reciprocals, a value of 0 in the data creates problems, and we'll get a StatisticsError
exception.
The harmonic mean is useful for calculating the averages of ratios and rates, such as calculating the average speed, density, or resistance in parallel. Here is some code that calculates the average speed when someone covers a fixed portion of a journey (100km in this case) with specific speeds.
import statistics speeds = [30, 40, 60] distance = 100 total_distance = len(speeds)*distance total_time = 0 for speed in speeds: total_time += distance/speed average_speed = total_distance/total_time print(average_speed) # 39.99999999999999 print(statistics.harmonic_mean(speeds)) # 40.0
Two things worth noticing here are that the harmonic_mean()
function reduces all the calculations to a single oneliner and at the same time gives more accurate results without floatingpoint errors.
We can use the weights argument to specify how much corresponding distance was covered with certain speeds.
import statistics speeds = [30, 40, 60] distances = [100, 120, 160] print(statistics.harmonic_mean(speeds, distances)) # 42.222222222
Calculating the Mode
The mean is a good indicator of the average, but a few extreme values can result in an average that is far from the actual central location. In some cases, it is more desirable to determine the most frequent data point in a data set. The mode()
function will return the most common data point from discrete numerical or nonnumerical data. This is the only statistical function that can be used with nonnumeric data.
import random import statistics data_points = [ random.randint(1, 100) for x in range(1,1001) ] statistics.mode(data_points) # returns 94 data_points = [ random.randint(1, 100) for x in range(1,1001) ] statistics.mode(data_points) # returns 49 data_points = [ random.randint(1, 100) for x in range(1,1001) ] statistics.mode(data_points) # returns 32 mode(["cat", "dog", "dog", "cat", "monkey", "monkey", "dog"]) # returns 'dog'
The mode of randomly generated integers in a given range can be any of those numbers as the frequency of occurrence of each number is unpredictable. The three examples in the above code snippet prove that point. The last example shows us how we can calculate the mode of nonnumeric data.
A newer multimode()
function in Python 3.8 allows us to return more than one result when there are multiple values that occur with the same top frequency.
import statistics favorite_pet = ['cat', 'dog', 'dog', 'mouse', 'cat', 'cat', 'turtle', 'dog'] print(statistics.multimode(favorite_pet)) # ['cat', 'dog']
Calculating the Median
Relying on the mode to calculate a central value can be a bit misleading. As we just saw in the previous section, it will always be the most frequently occurring data point, irrespective of all other values in the data set. Another way of determining the central location is by using the median()
function. It will return the median value of given numeric data by calculating the mean of two middle points if necessary. If the number of data points is odd, it returns the middle point. If the number of data points is even, it returns the average of the two median values.
The problem with the median()
function is that the final value may not be an actual data point when the number of data points is even. In such cases, you can either use median_low()
or median_high()
to calculate the median. With an even number of data points, these functions will return the smaller and larger value of the two middle points respectively.
import random import statistics data_points = [ random.randint(1, 100) for x in range(1,50) ] statistics.median(data_points) # returns 53 data_points = [ random.randint(1, 100) for x in range(1,51) ] statistics.median(data_points) # returns 51.0 data_points = [ random.randint(1, 100) for x in range(1,51) ] statistics.median(data_points) # returns 49.0 data_points = [ random.randint(1, 100) for x in range(1,51) ] statistics.median_low(data_points) # returns 50 statistics.median_high(data_points) # returns 52 statistics.median(data_points) # returns 51.0
In the last case, the low and high medians were 50 and 52. This means that there was no data point with a value of 51 in our data set, but the median()
function still calculated the median to be 51.0.
Measuring the Spread of Data
Determining how much the data points deviate from the typical or average value of the data set is just as important as calculating the central or average value itself. The statistics module has four different functions to help us calculate this spread of data.
You can use the pvariance(data, mu=None)
function to calculate the population variance of a given data set.
The second argument in this case is optional. The value of mu, when provided, should be equal to the mean of the given data. The mean is calculated automatically if the value is missing. This function is helpful when you want to calculate the variance of an entire population. If your data is only a sample of the population, you can use the variance(data, xBar=None)
function to calculate the sample variance. Here, xBar
is the mean of the given sample and is calculated automatically if not provided.
To calculate the population standard definition and sample standard deviation, you can use the pstdev(data, mu=None)
and stdev(data, xBar=None)
functions respectively.
import statistics from fractions import Fraction as F data = [1, 2, 3, 4, 5, 6, 7, 8, 9] statistics.pvariance(data) # returns 6.666666666666667 statistics.pstdev(data) # returns 2.581988897471611 statistics.variance(data) # returns 7.5 statistics.stdev(data) # returns 2.7386127875258306 more_data = [3, 4, 5, 5, 5, 5, 5, 6, 6] statistics.pvariance(more_data) # returns 0.7654320987654322 statistics.pstdev(more_data) # returns 0.8748897637790901 some_fractions = [F(5, 6), F(2, 3), F(11, 12)] statistics.variance(some_fractions) # returns Fraction(7, 432)
As evident from the above example, a smaller variance implies that more data points are closer in value to the mean. You can also calculate the standard deviation of decimals and fractions.
Final Thoughts
In this last tutorial of the series, we learned about different functions available in the statistics module. You might have observed that the data given to the functions was sorted in most cases, but it doesn't have to be. I have used sorted lists in this tutorial because they make it easier to understand how the value returned by different functions is related to the input data.