Introduction to Natural Language Processing with NLTK

What is Natural Language Processing?

Natural Language Processing (NLP) helps computers (machines) “read and understand” text or speech by simulating human language abilities.

However, in recent years, NLP has grown rapidly because of an abundance of data. Given that more and more unstructured data is available, NLP has gained immense popularity.

Prerequisites 

  • Python 3.+
  • Jupyter Notebook

Natural Language Tool Kit (NLTK)

Natural Language Tool Kit (NLTK) is by far the most popular Python toolkit for dealing with NLP-related tasks. It comes with numerous examples and a really great API that’s very clear and concise. It also has numerous corpora and other tools to cater to most NLP-related tasks.

To install NLTK on Anaconda, follow the given link:

https://anaconda.org/anaconda/nltk

Also, we need to install all the packages and corpora that comes with NLTK separately. Following this link will help with that process:

https://www.nltk.org/data.html

Basic Tasks of Natural Language Processing 

We’ll be discussing each task in detail and also demonstrating how to perform it using NLTK.

  1. Tokenization
  2. Word Stemming and Lemmatization
  3. Part-Of-Speech (POS) Tagging
  4. Chunking
  5. Stop Word Removal
  6. Named Entity Recognition

Tokenization

Tokenization (also known as word segmentation) is the process of breaking text into smaller meaningful elements called tokens. These so-called tokens can be words, numbers, or punctuation marks.

This process is done by using a tokenization algorithm, which identifies the word or sentence boundaries and splits at the boundary. Tokenization is a crucial step in most NLP-related tasks. In most cases, it functions as a pre-processing step.

Since tokenization is relatively easy and uninteresting compared to other NLP tasks, it’s overlooked most of the time. However, errors made in this phase will propagate into later stages and cause problems.

Sentense Tokenization 

Sentence tokenization is the process of tokenizing a text into sentences. To perform sentence-level tokenization, NLTK provides a method called sent_tokenize. This method uses an instance of PunktSentenceTokenizer.

We import the methodsent_tokenize as depicted in the code snippet below. The method takes a string as a parameter and returns an array of sentences. The tokenizer is already trained for English and a few other European languages.

Screenshot 2018-12-25 at 1.06.40 PM

Word Tokenization

Word tokenization is the process of tokenizing sentences or text into words and punctuation. NLTK provides several ways to perform word-level tokenization.

It provides a method called,word_tokenize which splits text using punctuation and non-alphabetic characters. This method is a wrapper method for the TreebankWordTokenizer. Therefore, the result from both are identical.

NLTK also provides other tokenizers, such as andWordPunctTokenizerWhitespaceTokenizer.WordPunctTokenizer also splits the text from the punctuation. But unlike thisTreebankWordTokenizer, tokenizer splits the punctuation into separate tokens. WhitespaceTokenizer, as the name suggests, splits the text using white spaces. There are a few other tokenizers available as well.

Screenshot 2018-12-25 at 1.11.45 PM

Word Stemming and Lemmatization

The goal of both stemming and lemmatization is to reduce an inflected (or derived) word’s form to its root or base form. It’s essential for many NLP-related tasks such as information retrieval, text summarization, topic extraction, and more.

am, are, is => be
car, cars, car's, cars' => car

Even though the goal is similar, the process by which it’s done is different.

Stemming 

Stemming is a heuristic process in which a word’s endings are chopped off in hope of achieving its base form. Stemming acts on words without knowing the context. Therefore, it’s faster but doesn’t always yield the desired result.

Stemming isn’t as easy as we presume. If it was, there would be only one implementation. Sadly, stemming is an imprecise science, which leads to issues such as understemming and overstemming.

Understemming is the failure to reduce words with the same meaning to the same root. For example, jumped and jumps may be reduced to jump, while jumpiness may be reduced to jumpi.

Overstemming is the failure to keep two words with distinct meanings separate. For instance, general and generate may both be stemmed to gener.

NLTK provides several stemmers, the most prominent being PorterStemmer, which is based on the Porter Stemming Algorithm. This is mainly because it provides better results than the rest of the stemmers.

Other stemmers include SnowballStemmer and LancasterStemmer. It’s worth mentioning thatSnowballStemmer supports other languages as well. The following code snippet compares the aforementioned stemmers.

Screenshot 2018-12-25 at 1.23.14 PM

Lemmatization

Lemmatization is a process that uses vocabulary and morphological analysis of words to remove the inflected endings to achieve its base form (dictionary form), which is known as the lemma.

It’s a much more complicated and expensive process that requires an understanding of the context in which words appear in order to make decisions about what they mean. Hence, it uses a lexical vocabulary to derive the root form, is more time consuming than stemming, and is most likely to yield accurate results.

Lemmatization can be done with NLTK using,WordNetLemmatizer which uses a lexical database called WordNet.

NLTK provides an interface for the WordNet database. WordNetLemmatizeruses the interface to derive the lemma of a given word.

When using the WordNetLemmatizer, we should specify which part of speech should be used in order to derive the accurate lemma. Words can be in the form of Noun(n), Adjective(a), Verb(v), or Adverb(r). The following code snippet shows how lemmatization in action.

lemmatize is a function to demonstrate how the lemma changes with the part of speech given.

Screenshot 2018-12-25 at 1.26.25 PM

Stemming vs Lemmatization

Usage of either stemming or lemmatization will mostly depend on the situation at hand. If speed is required, it’s better to resort to stemming. But if accuracy is required it’s best to use lemmatization.

The following code snippet shows the comparison between stemming and lemmatization.

Screenshot 2018-12-25 at 1.28.41 PM

Part-Of-Speech (POS) Tagging

Part-Of-Speech tagging (or POS tagging) is also a very import component of NLP. The purpose of the POS tagging is to assign labels for each token (a word in this case) with its respective grammatical components, such as noun, verb, adjective, or adverb. Most POS are divided into sub-classes.

POS tagging can be identified as a supervised machine learning solution, mainly because it takes features like the previous word, next word, and capitalization of the first word into consideration when assigning a POS tag to a word.

The most popular tag set for POS tagging is Penn Treebank tagset. Most of the trained POS taggers for English are trained on this tag set. The following link shows the available POS Tags in Penn Treebank tagset.

https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

NLTK provides a function called pos_tag , to perform POS tagging of sentences, but this requires the sentence to be tokenized first. The following code snippet shows how POS tagging can be performed with NLTK:

Screenshot 2018-12-25 at 1.32.33 PM

Chunking 

Chunking or shallow parsing is a process that extracts phrases from a text sample. Here we extract chunks of sentences that constitute meaning rather than identifying the sentence’s structure. This is different and more advanced than tokenization because it extracts phrases instead of tokens.

As an example, the word “North America” can be extracted as a single phrase using chunking rather than two separate words “North” and “America” as tokenization does.

Chunking is a process that requires POS tagged input, and it provides chunks of phrases as output. Same as in POS tags, there is a standard set of chunk tags like Noun Phrase(NP), Verb Phrase (VP), etc.

As an example, let’s consider noun phrase chunking. In order to do this, we search for chunks corresponding to an individual noun phrase for a given rule. To create a NP chunk, we define the chunk grammar rule using POS tags. We will define this using a regular expression rule:

NP: {<DT>?<JJ>*<NN>} #NP

The rule states that whenever the chunk finds an optional determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN), then the NP chunk should be formed.

This way we can use grammar rules to extract NPs from POS tagged sentences:

Screenshot 2018-12-25 at 1.34.59 PM

Stop Word Removal 

Stop words are simply words that have very little meaning and are mostly used as part of the grammatical structure of a sentence. Words like “the”, “a”, “an”, “in”, etc. are considered stop-words.

Even though it doesn’t seem like much, stop word removal plays an important role when dealing with tasks such as sentiment analysis. This process is also used by search engines when indexing entries of a search query.

NLTK comes with the corpora stopwords which contains stop word lists for 16 different languages. No direct function is given by NLTK to remove stop words, but we can use the list to programmatically remove them from sentences.

If we are dealing with many sentences, first the text must be split into sentences using sent_tokenize. Then using word_tokenize, we can further break the sentences into words, and then remove the stop words using the list. The following code snippet depicts this process:

Screenshot 2018-12-25 at 1.38.16 PM

Named Entity Recognition

Named entity recognition (NER), is the process of identifying entities such as NamesLocationsDates, or Organizations that exist in an unstructured text sample.

The purpose of NER is to be able to map the extracted entities against a knowledge base, or to extract relationships between different entities. Eg: Who did what? or Where something take place? or At what time something occur?

It’s a very important task when dealing with information extraction. Other applications where NER is used:

  • Classifying content (in news, law domains)
  • For efficient search algorithms
  • In content recommendation algorithms
  • Chatbots, voice assistants, etc.

For domain-specific entities, in a field like medicine or law, we’ll need to train our own NER algorithm.

For casual use, NLTK provides us with a method called ne_chunk to perform NER on a given text. In order to use ne_chunk, the text needs to first be tokenized into words and then POS tagged. After NER, the tagged words depict their respective entity type. In this case, Mark and John are of type PERSON, Google and Yahoo are of type ORGANIZATION, and New York City is of type GPE (which indicates location).

Screenshot 2018-12-25 at 2.45.07 PM

WordNet Interface 

WordNet is a large English lexical database. Nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept.

Synset or “synonym set” is a collection of synonymous words.

NLTK provides an interface for the NLTK database, and it comes with the corpora module. WordNet is composed of approximately 155,200 words and 117,600 synonym sets that are logically related to each other.

As an example, in WordNet, a word like computer has two possible contexts (one being a machine for performing computation, and the other being a calculator: which is associated to computer in a lexical sense). It is identified by computer.n.01 (is known as the “lemma code name”. And letter n depicts that the word is a noun).

wordnet.synsets("computer")

OUTPUT: [Synset('computer.n.01'), Synset('calculator.n.01')]

We can further analyze the synset to find other words associated with it. As you can see all the words that are closely associated (and in the same context) with the word computer are listed:

wordnet.synset('computer.n.01').lemma_names()
OUTPUT: ['computer',
 'computing_machine',
 'computing_device',
 'data_processor',
 'electronic_computer',
 'information_processing_system']

Using WordNet, we’re able to find the definition of a particular word and also the usages of a word (the database may or may not contain usages for words):

syn.definition()
OUTPUT: 'a machine for performing calculations automatically'

wordnet.synset("car.n.01").examples()
OUTPUT: ['he needs a car to get to work']

Also, we can use it to find synonyms and antonyms of words. The following snippet contains all the code mentioned here and also shows how to retrieve synonyms and antonyms for a particular word:

Screenshot 2018-12-25 at 2.52.10 PM

References

Conclusion

we discussed how to use NLTK in order to perform some basic but useful tasks in Natural Language Processing. We learned tasks such as tokenization, stemming, lemmatization, stop word removal, POS tagging, chunking, named entity recognition, and some basics surrounding the WordNet interface.

Hope you found the article useful!

The source code that created this post can be found below:

https://github.com/nagilla748/Natural-Language-Processing

If you have any problems or questions regarding this article, please do not hesitate to leave a comment below or drop me an email.

Email address: venkatesh.nagilla@outlook.com

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s