How to Use Python to Find the Zipf Distribution of a Text File

This tutorial will show how we can use Python to work with a statistical concept, namely Zipf's law. It also demonstrates Python's efficiency in reading and sorting large text files when working with this law.

You might be wondering about the term Zipf distribution. To understand what we mean by this term, we need to define Zipf's law first. Don't worry, I'll keep everything simple.

Zipf's Law

Zipf's law simply states that given a corpus (large and structured set of texts) of natural language utterances, the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, four times as often as the fourth most frequent word, and so forth.

Let's look at an example of that. If you look into the Brown Corpus of American English, you will notice that the most frequent word is the (69,971 occurrences). The second most frequent word, of, occurs 36,411 times.

The word the accounts for around 7% of the Brown Corpus words (69,971 of slightly over 1 million words). If we come to the word of, we will notice that it accounts for around 3.6% of the corpus (around half of the). Thus, we can notice that Zipf's law applies to this situation.

Thus, Zipf's law is trying to tell us that a small number of items usually account for the bulk of activities we observe. For instance, a small number of diseases (cancer, cardiovascular diseases) account for the bulk of deaths. This also applies to words that account for the bulk of all word occurrences in literature, and many other examples in our lives.

Data Preparation

Before moving forward, let me refer you to the data we will be using to experiment with in our tutorial. Our data will come from the text version of the book Dracula, hosted on the Project Gutenberg website.

Building the Program

After you have downloaded the data in the above section, let's now start building our Python script that will find the Zipf distribution of the data in dracula.txt.

The first normal step to perform is to open the file:

1	with open('dracula.txt', 'r') as content:
2	text_string = content.read()

In order to carry out the necessary operations on the text file, we need to load the file in a string variable using the read() function.

Since we will be looking for a pattern (i.e. words), regular expressions come into play. We will thus be making use of Python's re module.

At this point, we have already read the bin file and loaded its content in a string variable. Finding the Zipf distribution means finding the frequency of occurrence of words in the bin file. The regular expression will thus be used to locate the words in the file.

The method we will be using to make such a match is the findall() method. As mentioned in the re module documentation about findall(), the method will:

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

Writing a Regular Expression for Matching Words

What we want to do is write a regular expression that will locate all the individual words in the text string variable. The regular expression that can perform this task is:

1	\b[A-Za-z][a-z]{2,9}\b

The \b in the regular expression above is an anchor for word boundaries. This is a zero-length match. A word boundary match can occur in three places in a string.

At the beginning of the string or before the first character if the first character is a word character. A word character can be a letter (a-z, a-Z), a number (0-9), or the underscore (_).
At the end of the string or after the last character if the last character is a word character.
Between any two characters of a string if one of them is a word character while the other one is not.

This means that robotoics_89, Elephant, BIGmango, and 40_pie_40 are all matches for simple word boundaries.

The use of [A-Za-z][a-z] helps us get rid of any words that are not actually words in the traditional sense. For example, it will not match robotics_89, 40_pie_40, and BIGmango. The word BIGmango did not match because it contains more than one capital letter at the beginning.

This regular expression is basically telling us to find all the words that start with a letter (upper-case or lower-case) followed by a sequence of letters which consist of at least 2 characters and no more than 9 characters. In other words, the size of the words that will be included in the output will range from 3 to 10 characters long.

In Python, this can be represented as follows:

1	words = re.findall(r'(\b[A-Za-z][a-z]{2,9}\b)', file_to_string)

We can now run a loop which aims at calculating the frequency of occurrence of each word:

1	for word in words:
2	count = frequency.get(word,0)
3	frequency[word] = count + 1

Here, if the word is not found yet in the list of words, instead of raising a KeyError, the default value 0 is returned. Otherwise, the count is incremented by 1, representing the number of times the word has occurred in the list so far.

Finally, we will print the key-value pair of the dictionary, showing the word (key) and the number of times it appeared in the list (value):

most_frequent = dict(sorted(frequency.items(), key=lambda elem: elem[1], reverse=True))

    top_count = 0
        
    for idx, (words, frequency) in enumerate(most_frequent.items()):
        if idx == 0:
            top_count = frequency
        print(words, frequency, round(top_count/frequency, 2))

The first line sorts the dictionary of words by their frequency in descending order; that is, it shows the words from the most frequent occurrence to the least frequent occurrence. Inside the for loop, we use the enumerate() function to loop over the values so that we can also keep track of the index position of different words.

The frequency of the most frequent word is then divided by the frequency of other words to calculate their ratio. This allows us to see how closely Zipf's law is being followed.

Putting It All Together

After going through the different building blocks of the program, let's see how it all looks together:

import re
 
frequency = {}

with open('dracula.txt', 'r') as content:
    text_string = content.read()

    words = re.findall(r'\b[A-Za-z][a-z]{2,9}\b', text_string)
        
    for word in words:
        count = frequency.get(word,0)
        frequency[word] = count + 1

    most_frequent = dict(sorted(frequency.items(), key=lambda elem: elem[1], reverse=True))

    top_count = 0
        
    for idx, (words, frequency) in enumerate(most_frequent.items()):
        if idx == 0:
            top_count = frequency
        print(words, frequency, round(top_count/frequency, 2))

I will show here the first ten words and their frequencies returned by the program:

1	the 7473 1.0
2	and 5796 1.29
3	that 2434 3.07
4	was 1869 4.0
5	for 1480 5.05
6	not 1399 5.34
7	his 1384 5.4
8	with 1281 5.83
9	you 1267 5.9
10	all 1118 6.68

From this Zipf distribution, we can validate Zipf's law in that some words (high-frequency words) represent the bulk of words, such as the, and, that, was, and for.

Conclusion

In this tutorial, we have seen how Python makes it easy to work with statistical concepts such as Zipf's law. Python comes in very handy in particular when working with large text files, which would require a lot of time and effort if we were to find Zipf's distribution manually. As we saw, we were able to quickly load, parse, and find the Zipf's distribution of a file of size 28 MB. And it was simple to sort the output thanks to Python's dictionaries.

This post has been updated with contributions from Monty Shokeen. Monty is a full-stack developer who also loves to write tutorials and to learn about new JavaScript libraries.