Introducing the Natural Language Toolkit (NLTK)

Natural language processing (NLP) is the automatic or semi-automatic processing of human language. NLP is closely related to linguistics and has links to research in cognitive science, psychology, physiology, and mathematics. In the computer science domain in particular, NLP is related to compiler techniques, formal language theory, human-computer interaction, machine learning, and theorem proving. This Quora question shows the different advantages of NLP.

In this tutorial I'm going to walk you through an interesting Python platform for NLP called the Natural Language Toolkit (NLTK). Before we see how to work with this platform, let me first tell you what NLTK is.

What Is NLTK?

The Natural Language Toolkit (NLTK) is a platform used for building programs for text analysis. The platform was originally released by Steven Bird and Edward Loper in conjunction with a computational linguistics course at the University of Pennsylvania in 2001. There is an accompanying book for the platform called Natural Language Processing with Python.

Installing NLTK

Let's now install NLTK to start experimenting with natural language processing. It will be fun!

Installing NLTK is very simple. I'm using Windows 10, so in my Command Prompt (MS-DOS) I type the following command:

1	pip install nltk

If you are using Ubuntu or macOS, you run the command from the Terminal. More information about installing NLTK on different platforms can be found in the documentation.

If you are wondering what pip is, it is a package management system used to install and manage software packages written in Python. If you are using Python 2 >=2.7.9 or Python 3 >=3.4, you already have pip installed! To check your Python version, simply type the following in your command prompt:

1	python --version

Let's go ahead and check if we have installed NLTK successfully. To do that, open up Python's IDLE and type the two lines shown in the figure below:

Check if we have installed NLTK successfully

If you get the version of your NLTK returned, then congratulations, you have NLTK installed successfully!

So what we have done in the above step is that we installed NLTK from the Python Package Index (pip) locally into our virtual environment.

Notice that you might have a different version of NLTK depending on when you have installed the platform, but that shouldn't cause a problem.

Working With NLTK

The first thing we need to do to work with NLTK is to download what's called the NLTK corpora. I'm going to download the whole corpora. I know it is very large (10.9 GB), but we are going to do it only once. If you know which corpora you need, you don't need to download the whole corpora.

In your Python's IDLE, type the following:

1	import nltk
2	nltk.download()

In this case, you will get a GUI from which you can specify the destination and what to download, as shown in the figure below:

I'm going to download everything at this point. Click the Download button at the bottom left of the window, and wait for a while until everything gets downloaded to your destination directory.

Before moving forward, you might be wondering what a corpus (singular of corpora) is. A corpus can be defined as follows:

Corpus, plural corpora; A collection of linguistic data, either compiled as written texts or as a transcription of recorded speech. The main purpose of a corpus is to verify a hypothesis about language - for example, to determine how the usage of a particular sound, word, or syntactic construction varies. Corpus linguistics deals with the principles and practice of using corpora in language study. A computer corpus is a large body of machine-readable texts.

( Crystal, David. 1992. An Encyclopedic Dictionary of Language and Languages. Oxford: Blackwell.)

A text corpus is thus simply any large body of text.

Tokenization

Tokenization, as defined in Wikipedia, is:

The process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens.

Sentence Tokenizer

Sentence tokenization is when text is split into sentences using the sent_tokenize() method.

Consider the following text.

1	"Python is a very high-level programming language. Python is interpreted."

Let's tokenize it using the sent_tokenize() method.

1	from nltk.tokenize import sent_tokenize
2	text = "Python is a very high-level programming language. Python is interpreted."
3	print(sent_tokenize(text))

Here is the output, which returns a list of the text split into two sentences.

1	['Python is a very high-level programming language.', 'Python is interpreted.']

Word Tokenizer

Word tokenization is when text is split into words using the word_tokenize() method. Let's use the same text and pass it through the word_tokenize() method.

1	from nltk.tokenize import word_tokenize
2	text = "Python is a very high-level programming language. Python is interpreted."
3	print(word_tokenize(text))

Here is the output:

1	['Python', 'is', 'a', 'very', 'high-level', 'programming', 'language', '.', 'Python', 'is', 'interpreted', '.']

As you can see from the output, punctuation marks are also considered to be words.

Stop Words

Sometimes we need to filter out useless data to make the data more understandable by the computer. In natural language processing (NLP), such useless data (words) are called stop words. These words have no meaning to us, so we would like to remove them.

NLTK provides us with some stop words to start with. To see those words, use the following script:

1	from nltk.corpus import stopwords
2	print(set(stopwords.words('English')))

In which case you will get the following output:

What we did is that we printed out a set (unordered collection of items) of stop words in the English language. If you were using another language, for example German, you have to define it as follows:

1	from nltk.corpus import stopwords
2	print(set(stopwords.words('german')))

How can we remove the stop words from our own text? The example below shows how we can perform this task:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

text = 'In this tutorial, I\'m learning NLTK. It is an interesting platform.'
stop_words = set(stopwords.words('english'))
words = word_tokenize(text)

new_sentence = []

for word in words:
    if word not in stop_words:
		new_sentence.append(word)

print(new_sentence)

The output of the above script is:

So what the word_tokenize() function does is:

Tokenize a string to split off punctuation other than periods

Searching

Let's say we have the following text file (download the text file from Dropbox). We would like to look for (search) the word language. We can simply do this using the NLTK platform as follows:

import nltk

file = open('NLTK.txt', 'r')
read_file = file.read()
text = nltk.Text(nltk.word_tokenize(read_file))

match = text.concordance('language')

In which case you will get the following output:

Notice that concordance() returns every occurrence of the word language, in addition to some context. Before that, as shown in the script above, we tokenize the read file and then convert it into an nltk.Text object.

I just want to note that the first time I ran the program, I got the following error, which seems to be related to the encoding the console uses:

File "test.py", line 7, in <module>
    match = text.concordance('language').decode('utf-8')
  File "C:\Python35\lib\site-packages\nltk\text.py", line 334, in concordance
    self._concordance_index.print_concordance(word, width, lines)
  File "C:\Python35\lib\site-packages\nltk\text.py", line 200, in print_concordance
    print(left, self._tokens[i], right)
  File "C:\Python35\lib\encodings\cp437.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2014' in position 11: character maps to <undefined>

What I simply did to solve this issue is to run this command in my console before running the program: chcp 65001.

The Gutenberg Corpus

As mentioned in Wikipedia:

Project Gutenberg (PG) is a volunteer effort to digitize and archive cultural works, to "encourage the creation and distribution of eBooks". It was founded in 1971 by Michael S. Hart and is the oldest digital library. Most of the items in its collection are the full texts of public domain books. The project tries to make these as free as possible, in long-lasting, open formats that can be used on almost any computer. As of 3 October 2015, Project Gutenberg reached 50,000 items in its collection.

NLTK contains a small selection of texts from Project Gutenberg. To see the included files from Project Gutenberg, we do the following:

1	import nltk
2
3	gutenberg_files = nltk.corpus.gutenberg.fileids()
4	print(gutenberg_files)

The output of the above script will be as follows:

If we want to find the number of words for the text file bryant-stories.txt for instance, we can do the following:

1	import nltk
2
3	bryant_words = nltk.corpus.gutenberg.words('bryant-stories.txt')
4	print(len(bryant_words))

The above script should return the following number of words: 55563.

Conclusion

As we have seen in this tutorial, the NLTK platform provides us with a powerful tool for working with natural language processing (NLP). I have only scratched the surface in this tutorial. If you would like to go deeper into using NLTK for different NLP tasks, you can refer to NLTK's accompanying book: Natural Language Processing with Python.

This post has been updated with contributions from Esther Vaati. Esther is a software developer and writer for Envato Tuts+.