To catch a quick idea of a long document, we will always to do a summarization when we read an article or book. In English, the first (or first two) sentence(s) of each article has a very high chance of representing the whole article. Of course, the topic sentence can be the last sentence in sometimes.
In NLP, there are two approaches to do the text summarization. The first one, extractive approach, is a simple approach which is extracting keywords or sentences from the article. There are some limitations and proved that the performance is not very good. The second one, abstractive approach, is generating a new sentences base on the given article. It needs a more advanced technique.
After reading this article:
- Understand the PageRank algorithm
- Understand the TextRank algorithm
- How can we use the TextRank algorithm to have a summarization
PageRank algorithm is developed by Google for searching the most importance of website so that Google search result is relevant to query.
In PageRank, it is a directed graph. In the beginning, all node have an equal score (1 / total number of the node).
The first formula is the simplified version of PageRank and we will use this one for demo. The second one is a little bit complicated as it involved one more parameter which is damping factor, “d”. By default d is 0.85
Let take a look in the simplified version. In iteration 1, here is how PageRank calculate:
- A: (1/4)/3. As only C is pointing to A, so we use previous C score (iteration 0) divided by number of node (i.e. 3) that C is pointing
- B: (1/4)/2 + (1/4)/3. Both A and C are pointing to B, so previous A score (iteration 0) divided by number of node (i.e. 2) that A is pointing. For C, it is same as previous one which is (1/4)/3.
For detail, you may checkout the video for full explantation.
Question: When should we stop the iteration?
According to theory, it should calculate until no big update on score.
Why we need to introduce PageRank before TextRank? Because the idea of TextRank comes from PageRank and using similar algorithm (graph concept) to calculate the importance.
- TextRank graph is undirected. Meaning that all edge are bidirectional
- The weight of edge is difference while it is 1 in PageRank. There are different way to calculate such as BM25, TF-IDF.
There are a lot of different document similarity implementation such as BM25, cosine similarity, IDF-modified-cosine. You may choose the best fit for your problem.
gensim provides a simple API to calculate TextRank by using BM25 (Best Match 25).
Step 1: Environment Setup
pip install gensim==3.4.0
Ratio parameter use for deciding the number of import sentences are returned.
For entire code, you may check out from GitHub. Let us know if you also want to understand about the abstractive approach. Will arrange an article later on
- According to the gensim source code, at least 10 sentences is recommended for the input
- No training data or model building is required.
- It fits not only English but also any other a bag of input (Symbol, Japanese etc). You may also read TextRank research paper for detail understanding.
- From my experience, the result is not good in most of the time. It may due to a variety of words and the result is only a subset of input.