Term Frequency–Inverse Document Frequency (tf-idf) matrix
Create a Term Frequency–Inverse Document Frequency (tf-idf) matrix from a bag-of-words model.
Load the example data. The file sonnetsPreprocessed.txt
contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt
, split the text into documents at newline characters, and then tokenize the documents.
filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);
Create a bag-of-words model using bagOfWords
.
bag = bagOfWords(documents)
bag = bagOfWords with properties: Counts: [154x3092 double] Vocabulary: [1x3092 string] NumWords: 3092 NumDocuments: 154
Create a tf-idf matrix. View the first 10 rows and columns.
M = tfidf(bag); full(M(1:10,1:10))
ans = 10×10
3.6507 4.3438 2.7344 3.6507 4.3438 2.2644 3.2452 3.8918 2.4720 2.5520
0 0 0 0 0 4.5287 0 0 0 0
0 0 0 0 0 0 0 0 0 2.5520
0 0 0 0 0 2.2644 0 0 0 0
0 0 0 0 0 2.2644 0 0 0 0
0 0 0 0 0 2.2644 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 2.2644 0 0 0 2.5520
0 0 2.7344 0 0 0 0 0 0 0
Create a Term Frequency-Inverse Document Frequency (tf-idf) matrix from a bag-of-words model and an array of new documents.
Load the example data. The file sonnetsPreprocessed.txt
contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt
, split the text into documents at newline characters, and then tokenize the documents.
filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);
Create a bag-of-words model from the documents.
bag = bagOfWords(documents)
bag = bagOfWords with properties: Counts: [154x3092 double] Vocabulary: [1x3092 string] NumWords: 3092 NumDocuments: 154
Create a tf-idf matrix for an array of new documents using the inverse document frequency (IDF) factor computed from bag
.
newDocuments = tokenizedDocument([ "what's in a name? a rose by any other name would smell as sweet." "if music be the food of love, play on."]); M = tfidf(bag,newDocuments)
M = (1,7) 3.2452 (1,36) 1.2303 (2,197) 3.4275 (2,313) 3.6507 (2,387) 0.6061 (1,1205) 4.7958 (1,1835) 3.6507 (2,1917) 5.0370
Load the example data. The file sonnetsPreprocessed.txt
contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt
, split the text into documents at newline characters, and then tokenize the documents.
filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);
Create a bag-of-words model using bagOfWords
.
bag = bagOfWords(documents)
bag = bagOfWords with properties: Counts: [154x3092 double] Vocabulary: [1x3092 string] NumWords: 3092 NumDocuments: 154
Create a tf-idf matrix. View the first 10 rows and columns.
M = tfidf(bag); full(M(1:10,1:10))
ans = 10×10
3.6507 4.3438 2.7344 3.6507 4.3438 2.2644 3.2452 3.8918 2.4720 2.5520
0 0 0 0 0 4.5287 0 0 0 0
0 0 0 0 0 0 0 0 0 2.5520
0 0 0 0 0 2.2644 0 0 0 0
0 0 0 0 0 2.2644 0 0 0 0
0 0 0 0 0 2.2644 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 2.2644 0 0 0 2.5520
0 0 2.7344 0 0 0 0 0 0 0
You can change the contributions made by the TF and IDF factors to the tf-idf matrix by specifying the TF and IDF weight formulas.
To ignore how many times a word appears in a document, use the binary option of 'TFWeight'
. Create a tf-idf matrix and set 'TFWeight'
to 'binary'
. View the first 10 rows and columns.
M = tfidf(bag,'TFWeight','binary'); full(M(1:10,1:10))
ans = 10×10
3.6507 4.3438 2.7344 3.6507 4.3438 2.2644 3.2452 1.9459 2.4720 2.5520
0 0 0 0 0 2.2644 0 0 0 0
0 0 0 0 0 0 0 0 0 2.5520
0 0 0 0 0 2.2644 0 0 0 0
0 0 0 0 0 2.2644 0 0 0 0
0 0 0 0 0 2.2644 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 2.2644 0 0 0 2.5520
0 0 2.7344 0 0 0 0 0 0 0
bag
— Input bag-of-words or bag-of-n-grams modelbagOfWords
object | bagOfNgrams
objectInput bag-of-words or bag-of-n-grams model, specified as a bagOfWords
object or a bagOfNgrams
object.
documents
— Input documentstokenizedDocument
array | string array of words | cell array of character vectorsInput documents, specified as a tokenizedDocument
array, a string array of words, or a cell array of
character vectors. If documents
is a string array or a cell array
of character vectors, then it must be a row vector representing a single document, where
each element is a word.
Specify optional
comma-separated pairs of Name,Value
arguments. Name
is
the argument name and Value
is the corresponding value.
Name
must appear inside quotes. You can specify several name and value
pair arguments in any order as
Name1,Value1,...,NameN,ValueN
.
'Normalized',true
specifies to normalize the frequency
counts.'TFWeight'
— Method to set term frequency factor'raw'
(default) | 'binary'
| 'log'
Method to set term frequency (TF) factor, specified as the
comma-separated pair consisting of 'TFWeight'
and one
of the following:
'raw'
– Set the TF factor to the
unchanged term counts.
'binary'
– Set the TF factor to the
matrix of ones and zeros where the ones indicate whether a
term is in a document.
'log'
– Set the TF factor to 1
+ log(bag.Counts)
.
Example: 'TFWeight','binary'
Data Types: char
'IDFWeight'
— Method to set inverse document frequency factor'normal'
(default) | 'unary'
| 'smooth'
| 'max'
| 'probabilistic'
Method to set inverse document frequency (IDF) factor, specified as
the comma-separated pair consisting of 'IDFWeight'
and one of the following:
'normal'
– Set the IDF factor to
log(N/NT)
.
'unary'
– Set the IDF factor to
1
.
'smooth'
– Set the IDF factor to
log(1+N/NT)
.
'max'
– Set the IDF factor to
log(1+max(NT)/NT)
.
'probabilistic'
– Set the IDF factor to
log((N-NT)/NT)
.
where N
is the number of documents in
the bag
, and NT
is the number of
documents containing each term which is equivalent to
sum(bag.Counts)
.
Example: 'IDFWeight','smooth'
Data Types: char
'Normalized'
— Option to normalize term countsfalse
(default) | true
Option to normalize term counts, specified as the comma-separated pair
consisting of 'Normalized'
and
true
or false
. If
true
, then the function normalizes each vector of
term counts in the Euclidean norm.
Example: 'Normalized',true
Data Types: logical
'DocumentsIn'
— Orientation of output documents'rows'
(default) | 'columns'
Orientation of output documents in the frequency count matrix, specified as the
comma-separated pair consisting of 'DocumentsIn'
and one of the
following:
'rows'
– Return a matrix of frequency counts with rows
corresponding to documents.
'columns'
– Return a transposed matrix of frequency
counts with columns corresponding to documents.
Data Types: char
'ForceCellOutput'
— Indicator for forcing output to be returned as cell arrayfalse
(default) | true
Indicator for forcing output to be returned as cell array, specified as the comma separated pair consisting of 'ForceCellOutput'
and true
or false
.
Data Types: logical
M
— Output Term Frequency-Inverse Document Frequency matrixOutput Term Frequency-Inverse Document Frequency matrix, specified as a sparse matrix or a cell array of sparse matrices.
If bag
is a non-scalar array or
'ForceCellOutput'
is true
, then
the function returns the outputs as a cell array of sparse matrices. Each
element in the cell array is the tf-idf matrix calculated from the
corresponding element of bag
.
bagOfNgrams
| bagOfWords
| encode
| tokenizedDocument
| topkngrams
| topkwords
A modified version of this example exists on your system. Do you want to open this version instead?
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
Select web siteYou can also select a web site from the following list:
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.