Introduction to Natural Language Processing with NLTK

What is Natural Language Processing?

Natural Language Processing (NLP) helps computers (machines) “read and understand” text or speech by simulating human language abilities.

However, in recent years, NLP has grown rapidly because of an abundance of data. Given that more and more unstructured data is available, NLP has gained immense popularity.


  • Python 3.+
  • Jupyter Notebook

Natural Language Tool Kit (NLTK)

Natural Language Tool Kit (NLTK) is by far the most popular Python toolkit for dealing with NLP-related tasks. It comes with numerous examples and a really great API that’s very clear and concise. It also has numerous corpora and other tools to cater to most NLP-related tasks.

To install NLTK on Anaconda, follow the given link:

Also, we need to install all the packages and corpora that comes with NLTK separately. Following this link will help with that process:

Basic Tasks of Natural Language Processing 

We’ll be discussing each task in detail and also demonstrating how to perform it using NLTK.

  1. Tokenization
  2. Word Stemming and Lemmatization
  3. Part-Of-Speech (POS) Tagging
  4. Chunking
  5. Stop Word Removal
  6. Named Entity Recognition


Tokenization (also known as word segmentation) is the process of breaking text into smaller meaningful elements called tokens. These so-called tokens can be words, numbers, or punctuation marks.

This process is done by using a tokenization algorithm, which identifies the word or sentence boundaries and splits at the boundary. Tokenization is a crucial step in most NLP-related tasks. In most cases, it functions as a pre-processing step.

Since tokenization is relatively easy and uninteresting compared to other NLP tasks, it’s overlooked most of the time. However, errors made in this phase will propagate into later stages and cause problems.

Sentense Tokenization 

Sentence tokenization is the process of tokenizing a text into sentences. To perform sentence-level tokenization, NLTK provides a method called 

<strong class="markup--strong markup--p-strong">sent_tokenize</strong>.

 This method uses an instance of 

<strong class="markup--strong markup--p-strong">PunktSentenceTokenizer</strong>.

We import the method

<strong class="markup--strong markup--p-strong">sent_tokenize</strong>

as depicted in the code snippet below. The method takes a string as a parameter and returns an array of sentences. The tokenizer is already trained for English and a few other European languages.

Screenshot 2018-12-25 at 1.06.40 PM

Word Tokenization

Word tokenization is the process of tokenizing sentences or text into words and punctuation. NLTK provides several ways to perform word-level tokenization.

It provides a method called,

<strong class="markup--strong markup--p-strong">word_tokenize</strong>

 which splits text using punctuation and non-alphabetic characters. This method is a wrapper method for the 

<strong class="markup--strong markup--p-strong">TreebankWordTokenizer</strong>.

 Therefore, the result from both are identical.

NLTK also provides other tokenizers, such as and

<strong class="markup--strong markup--p-strong">WordPunctTokenizer</strong>
<strong class="markup--strong markup--p-strong">WhitespaceTokenizer</strong>.
<strong class="markup--strong markup--p-strong">WordPunctTokenizer</strong>

 also splits the text from the punctuation. But unlike this

<strong class="markup--strong markup--p-strong">TreebankWordTokenizer</strong>,

tokenizer splits the punctuation into separate tokens. 

<strong class="markup--strong markup--p-strong">WhitespaceTokenizer</strong>

, as the name suggests, splits the text using white spaces. There are a few other tokenizers available as well.

Screenshot 2018-12-25 at 1.11.45 PM

Word Stemming and Lemmatization

The goal of both stemming and lemmatization is to reduce an inflected (or derived) word’s form to its root or base form. It’s essential for many NLP-related tasks such as information retrieval, text summarization, topic extraction, and more.

am, are, is => be
car, cars, car's, cars' => car

Even though the goal is similar, the process by which it’s done is different.


Stemming is a heuristic process in which a word’s endings are chopped off in hope of achieving its base form. Stemming acts on words without knowing the context. Therefore, it’s faster but doesn’t always yield the desired result.

Stemming isn’t as easy as we presume. If it was, there would be only one implementation. Sadly, stemming is an imprecise science, which leads to issues such as understemming and overstemming.

Understemming is the failure to reduce words with the same meaning to the same root. For example, 




 may be reduced to 


, while 


 may be reduced to 



Overstemming is the failure to keep two words with distinct meanings separate. For instance, 




 may both be stemmed to 



NLTK provides several stemmers, the most prominent being 

<strong class="markup--strong markup--p-strong">PorterStemmer</strong>

, which is based on the Porter Stemming Algorithm. This is mainly because it provides better results than the rest of the stemmers.

Other stemmers include 

<strong class="markup--strong markup--p-strong">SnowballStemmer </strong>


<strong class="markup--strong markup--p-strong">LancasterStemmer</strong>.

 It’s worth mentioning that

<strong class="markup--strong markup--p-strong">SnowballStemmer</strong>

 supports other languages as well. The following code snippet compares the aforementioned stemmers.

Screenshot 2018-12-25 at 1.23.14 PM


Lemmatization is a process that uses vocabulary and morphological analysis of words to remove the inflected endings to achieve its base form (dictionary form), which is known as the lemma.

It’s a much more complicated and expensive process that requires an understanding of the context in which words appear in order to make decisions about what they mean. Hence, it uses a lexical vocabulary to derive the root form, is more time consuming than stemming, and is most likely to yield accurate results.

Lemmatization can be done with NLTK using,

<strong class="markup--strong markup--p-strong">WordNetLemmatizer</strong>

 which uses a lexical database called WordNet.

NLTK provides an interface for the WordNet database. 

<strong class="markup--strong markup--p-strong">WordNetLemmatizer</strong>

uses the interface to derive the lemma of a given word.

When using the

<strong class="markup--strong markup--p-strong">WordNetLemmatizer,</strong>

 we should specify which part of speech should be used in order to derive the accurate lemma. Words can be in the form of Noun(n), Adjective(a), Verb(v), or Adverb(r). The following code snippet shows how lemmatization in action.

<strong class="markup--strong markup--p-strong">lemmatize</strong>

 is a function to demonstrate how the lemma changes with the part of speech given.

Screenshot 2018-12-25 at 1.26.25 PM

Stemming vs Lemmatization

Usage of either stemming or lemmatization will mostly depend on the situation at hand. If speed is required, it’s better to resort to stemming. But if accuracy is required it’s best to use lemmatization.

The following code snippet shows the comparison between stemming and lemmatization.

Screenshot 2018-12-25 at 1.28.41 PM

Part-Of-Speech (POS) Tagging

Part-Of-Speech tagging (or POS tagging) is also a very import component of NLP. The purpose of the POS tagging is to assign labels for each token (a word in this case) with its respective grammatical components, such as noun, verb, adjective, or adverb. Most POS are divided into sub-classes.

POS tagging can be identified as a supervised machine learning solution, mainly because it takes features like the previous word, next word, and capitalization of the first word into consideration when assigning a POS tag to a word.

The most popular tag set for POS tagging is Penn Treebank tagset. Most of the trained POS taggers for English are trained on this tag set. The following link shows the available POS Tags in Penn Treebank tagset.

NLTK provides a function called 

<strong class="markup--strong markup--p-strong">pos_tag</strong>

 , to perform POS tagging of sentences, but this requires the sentence to be tokenized first. The following code snippet shows how POS tagging can be performed with NLTK:

Screenshot 2018-12-25 at 1.32.33 PM


Chunking or shallow parsing is a process that extracts phrases from a text sample. Here we extract chunks of sentences that constitute meaning rather than identifying the sentence’s structure. This is different and more advanced than tokenization because it extracts phrases instead of tokens.

As an example, the word “North America” can be extracted as a single phrase using chunking rather than two separate words “North” and “America” as tokenization does.

Chunking is a process that requires POS tagged input, and it provides chunks of phrases as output. Same as in POS tags, there is a standard set of chunk tags like Noun Phrase(NP), Verb Phrase (VP), etc.

As an example, let’s consider noun phrase chunking. In order to do this, we search for chunks corresponding to an individual noun phrase for a given rule. To create a NP chunk, we define the chunk grammar rule using POS tags. We will define this using a regular expression rule:

NP: {<DT>?<JJ>*<NN>} #NP

The rule states that whenever the chunk finds an optional determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN), then the NP chunk should be formed.

This way we can use grammar rules to extract NPs from POS tagged sentences:

Screenshot 2018-12-25 at 1.34.59 PM

Stop Word Removal 

Stop words are simply words that have very little meaning and are mostly used as part of the grammatical structure of a sentence. Words like “the”, “a”, “an”, “in”, etc. are considered stop-words.

Even though it doesn’t seem like much, stop word removal plays an important role when dealing with tasks such as sentiment analysis. This process is also used by search engines when indexing entries of a search query.

NLTK comes with the corpora 

<strong class="markup--strong markup--p-strong">stopwords</strong>

 which contains stop word lists for 16 different languages. No direct function is given by NLTK to remove stop words, but we can use the list to programmatically remove them from sentences.

If we are dealing with many sentences, first the text must be split into sentences using 

<strong class="markup--strong markup--p-strong">sent_tokenize</strong>

. Then using 

<strong class="markup--strong markup--p-strong">word_tokenize</strong>

, we can further break the sentences into words, and then remove the stop words using the list. The following code snippet depicts this process:

Screenshot 2018-12-25 at 1.38.16 PM

Named Entity Recognition

Named entity recognition (NER), is the process of identifying entities such as NamesLocationsDates, or Organizations that exist in an unstructured text sample.

The purpose of NER is to be able to map the extracted entities against a knowledge base, or to extract relationships between different entities. Eg: Who did what? or Where something take place? or At what time something occur?

It’s a very important task when dealing with information extraction. Other applications where NER is used:

  • Classifying content (in news, law domains)
  • For efficient search algorithms
  • In content recommendation algorithms
  • Chatbots, voice assistants, etc.

For domain-specific entities, in a field like medicine or law, we’ll need to train our own NER algorithm.

For casual use, NLTK provides us with a method called 

<strong class="markup--strong markup--p-strong">ne_chunk</strong>

 to perform NER on a given text. In order to use 

<strong class="markup--strong markup--p-strong">ne_chunk</strong>

, the text needs to first be tokenized into words and then POS tagged. After NER, the tagged words depict their respective entity type. In this case, Mark and John are of type PERSON, Google and Yahoo are of type ORGANIZATION, and New York City is of type GPE (which indicates location).

Screenshot 2018-12-25 at 2.45.07 PM

WordNet Interface 

WordNet is a large English lexical database. Nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept.

Synset or “synonym set” is a collection of synonymous words.

NLTK provides an interface for the NLTK database, and it comes with the corpora module. WordNet is composed of approximately 155,200 words and 117,600 synonym sets that are logically related to each other.

As an example, in WordNet, a word like computer has two possible contexts (one being a machine for performing computation, and the other being a calculator: which is associated to computer in a lexical sense). It is identified by 


 (is known as the “lemma code name”. And letter 


 depicts that the word is a noun).


OUTPUT: [Synset('computer.n.01'), Synset('calculator.n.01')]

We can further analyze the synset to find other words associated with it. As you can see all the words that are closely associated (and in the same context) with the word computer are listed:

OUTPUT: ['computer',

Using WordNet, we’re able to find the definition of a particular word and also the usages of a word (the database may or may not contain usages for words):

OUTPUT: 'a machine for performing calculations automatically'

OUTPUT: ['he needs a car to get to work']

Also, we can use it to find synonyms and antonyms of words. The following snippet contains all the code mentioned here and also shows how to retrieve synonyms and antonyms for a particular word:

Screenshot 2018-12-25 at 2.52.10 PM



we discussed how to use NLTK in order to perform some basic but useful tasks in Natural Language Processing. We learned tasks such as tokenization, stemming, lemmatization, stop word removal, POS tagging, chunking, named entity recognition, and some basics surrounding the WordNet interface.

Hope you found the article useful!

The source code that created this post can be found below:

If you have any problems or questions regarding this article, please do not hesitate to leave a comment below or drop me an email.

Email address:



Leave a Reply