Extract Keywords from Text Data Using TextRank
This example shows to extract keywords from text data using TextRank.
The TextRank keyword extraction algorithm extracts keywords using a part-of-speech tag-based approach to identify candidate keywords and scores them using word co-occurrences determined by a sliding window. Keywords can contain multiple tokens. Furthermore, the TextRank keyword extraction algorithm also merges keywords when they appear consecutively in a document.
Extract Keywords
Create an array of tokenized document containing the text data.
textData = [ "MATLAB provides really useful tools for engineers. Scientists use many useful MATLAB toolboxes." "MATLAB and Simulink have many features. MATLAB and Simulink makes it easy to develop models." "You can easily import data in MATLAB. In particular, you can easily import text data."]; documents = tokenizedDocument(textData);
Extract the keywords using the textrankKeywords
function.
tbl = textrankKeywords(documents)
tbl=6×3 table
Keyword DocumentNumber Score
_____________________________________ ______________ ______
"useful" "MATLAB" "toolboxes" 1 4.8695
"useful" "" "" 1 2.3612
"MATLAB" "" "" 1 1.6212
"many" "features" "" 2 4.6152
"text" "data" "" 3 3.4781
"data" "" "" 3 1.7391
If a keyword contains multiple words, then the ith element of the string array corresponds to the ith word of the keyword. If the keyword has fewer words that the longest keyword, then remaining entries of the string array are the empty string ""
.
For readability, transform the multi-word keywords into a single string using the join
and strip
functions.
if size(tbl.Keyword,2) > 1 tbl.Keyword = strip(join(tbl.Keyword)); end head(tbl)
ans=6×3 table
Keyword DocumentNumber Score
_________________________ ______________ ______
"useful MATLAB toolboxes" 1 4.8695
"useful" 1 2.3612
"MATLAB" 1 1.6212
"many features" 2 4.6152
"text data" 3 3.4781
"data" 3 1.7391
Specify Maximum Number of Keywords Per Document
The textrankKeywords
function, by default, returns all identified keywords. To reduce the number of keywords, use the 'MaxNumKeywords'
option.
Extract the top two keywords for each document by setting the 'MaxNumKeywords'
option to 2.
tbl = textrankKeywords(documents,'MaxNumKeywords',2)
tbl=5×3 table
Keyword DocumentNumber Score
_____________________________________ ______________ ______
"useful" "MATLAB" "toolboxes" 1 4.8695
"useful" "" "" 1 2.3612
"many" "features" "" 2 4.6152
"text" "data" "" 3 3.4781
"data" "" "" 3 1.7391
Specify Part-of-Speech Tags
Notice that in the extracted keywords above, the function does not consider the word "import" as a keyword. This is because the TextRank keyword extraction algorithm, by default, uses tokens with the part-of-speech tags "noun", "proper-noun" and "adjective" as candidate keywords. Because the word "import" is a verb, the algorithm does not consider this as a candidate keyword. Similarly, the algorithm does not consider the adverb "easily" as a candidate keyword.
To specify which part-of-speech tags to use for identifying candidate keywords, use the 'PartOfSpeech'
option.
Extract keywords from the same text as before and also specify also specify the part-of-speech tags "adverb"
and "verb"
.
newTags = ["adverb" "verb"]; tags = ["noun" "proper-noun" "adjective" newTags]; tbl = textrankKeywords(documents,'PartOfSpeech', tags)
tbl=7×3 table
Keyword DocumentNumber Score
____________________________________________ ______________ ______
"use" "many" "useful" "MATLAB" 1 5.8839
"useful" "" "" "" 1 2.0169
"MATLAB" "" "" "" 1 1.5478
"Simulink" "have" "many" "" 2 4.5058
"Simulink" "" "" "" 2 1.5161
"import" "text" "data" "" 3 4.7921
"import" "data" "" "" 3 3.4195
Notice here that the function treats the token "import" as a candidate keyword and merges it into the multi-word keywords "import data" and "import text data".
Specify Windows Size
Notice that in the extracted keywords above, that the function does not extract the adverb "easily" as a keyword. This is because of the proximity of these words in the text to other candidate keywords.
The TextRank keyword extraction algorithm scores candidate keywords using the number of pairwise co-occurrences within a sliding window. To increase the window size, use the 'Window'
option. Increasing the window size enables the function to find more co-occurrences between keywords which increases the keyword importance scores. This can result in finding more relevant keywords at the cost of potentially over-scoring less relevant keywords.
Extract keywords from the same text as before and also specify also specify a window size of 3.
tbl = textrankKeywords(documents, ... 'PartOfSpeech', tags, ... 'Window',3)
tbl=8×3 table
Keyword DocumentNumber Score
____________________________________________ ______________ ______
"many" "useful" "MATLAB" "" 1 4.2185
"really" "useful" "" "" 1 2.8851
"MATLAB" "" "" "" 1 1.3154
"Simulink" "" "" "" 2 1.4526
"develop" "" "" "" 2 1.0912
"features" "" "" "" 2 1.0794
"easily" "import" "text" "data" 3 5.2989
"easily" "import" "data" "" 3 4.0842
Notice here that the function treats the tokens "easily" as keywords and merges it into the multi-word keywords "easily import text data" and "easily import data".
To learn more about the TextRank keyword extraction algorithm, see TextRank Keyword Extraction.
Alternatives
You can experiment with different keyword extraction algorithms to see what works best with your data. Because the TextRank keywords algorithm uses a part-of-speech tag-based approach to extract candidate keywords, the extracted keywords can be short. Alternatively, you can try extracting keywords using RAKE algorithm which extracts sequences of tokens appearing between delimiters as candidate keywords. To extract keywords using RAKE, use the rakeKeywords
function. To learn more, see Extract Keywords from Text Data Using RAKE.
References
[1] Mihalcea, Rada, and Paul Tarau. "Textrank: Bringing order into text." In Proceedings of the 2004 conference on empirical methods in natural language processing, pp. 404-411. 2004.
See Also
tokenizedDocument
| rakeKeywords
| textrankKeywords
| extractSummary