encode
Encode documents as matrix of word or n-gram counts
Description
Use encode
to encode an array of tokenized documents as a
matrix of word or n-gram counts according to a bag-of-words or bag-of-n-grams model. To
encode documents as vectors or word indices, use a wordEncoding
object.
specifies additional options using one or more name-value pair arguments.counts
= encode(___,Name,Value
)
Examples
Encode Documents as Word Count Matrix
Encode an array of documents as a matrix of word counts.
documents = tokenizedDocument([ "an example of a short sentence" "a second short sentence"]); bag = bagOfWords(documents)
bag = bagOfWords with properties: Counts: [2x7 double] Vocabulary: ["an" "example" "of" "a" "short" "sentence" "second"] NumWords: 7 NumDocuments: 2
documents = tokenizedDocument([ "a new sentence" "a second new sentence"])
documents = 2x1 tokenizedDocument: 3 tokens: a new sentence 4 tokens: a second new sentence
View the documents encoded as a matrix of word counts. The word "new" does not appear in bag
, so it is not counted.
counts = encode(bag,documents); full(counts)
ans = 2×7
0 0 0 1 0 1 0
0 0 0 1 0 1 1
The columns correspond to the vocabulary of the bag-of-words model.
bag.Vocabulary
ans = 1x7 string
"an" "example" "of" "a" "short" "sentence" "second"
Encode Words as Word Count Vector
Encode an array of words as a vector of word counts.
documents = tokenizedDocument([ "an example of a short sentence" "a second short sentence"]); bag = bagOfWords(documents)
bag = bagOfWords with properties: Counts: [2x7 double] Vocabulary: ["an" "example" "of" "a" "short" "sentence" "second"] NumWords: 7 NumDocuments: 2
words = ["another" "example" "of" "a" "short" "example" "sentence"]; counts = encode(bag,words)
counts = (1,2) 2 (1,3) 1 (1,4) 1 (1,5) 1 (1,6) 1
Output Document Word Counts in Columns
Encode an array of documents as a matrix of word counts with documents in columns.
documents = tokenizedDocument([ "an example of a short sentence" "a second short sentence"]); bag = bagOfWords(documents)
bag = bagOfWords with properties: Counts: [2x7 double] Vocabulary: ["an" "example" "of" "a" "short" "sentence" "second"] NumWords: 7 NumDocuments: 2
documents = tokenizedDocument([ "a new sentence" "a second new sentence"])
documents = 2x1 tokenizedDocument: 3 tokens: a new sentence 4 tokens: a second new sentence
View the documents encoded as a matrix of word counts with documents in columns. The word "new" does not appear in bag
, so it is not counted.
counts = encode(bag,documents,'DocumentsIn','columns'); full(counts)
ans = 7×2
0 0
0 0
0 0
1 1
0 0
1 1
0 1
Input Arguments
bag
— Input bag-of-words or bag-of-n-grams model
bagOfWords
object | bagOfNgrams
object
Input bag-of-words or bag-of-n-grams model, specified as a bagOfWords
object or a bagOfNgrams
object.
documents
— Input documents
tokenizedDocument
array | string array of words | cell array of character vectors
Input documents, specified as a tokenizedDocument
array, a string array of words, or a cell
array of character vectors. If documents
is a string
array or a cell array of character vectors, then it must be a row vector
representing a single document, where each element is a word.
Tip
To ensure that the documents are encoded correctly, you must preprocess the input documents using the same steps as the documents used to create the input model. For an example showing how to create a function to preprocess text data, see Prepare Text Data for Analysis.
words
— Input words
string vector | character vector | cell array of character vectors
Input words, specified as a string vector, character vector, or cell array of character vectors. If you specify words
as a character vector, then the function treats the argument as a single word.
Data Types: string
| char
| cell
Name-Value Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN
, where Name
is
the argument name and Value
is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose
Name
in quotes.
Example: 'DocumentsIn','rows'
specifies the orientation of the
output documents as rows.
DocumentsIn
— Orientation of output documents
'rows'
(default) | 'columns'
Orientation of output documents in the frequency count matrix, specified as the
comma-separated pair consisting of 'DocumentsIn'
and one of the
following:
'rows'
– Return a matrix of frequency counts with rows corresponding to documents.'columns'
– Return a transposed matrix of frequency counts with columns corresponding to documents.
Data Types: char
ForceCellOutput
— Indicator for forcing output to be returned as cell array
false
(default) | true
Indicator for forcing output to be returned as cell array, specified as the comma separated pair consisting of 'ForceCellOutput'
and true
or false
.
Data Types: logical
Output Arguments
counts
— Word or n-gram counts
sparse matrix | cell array of sparse matrices
Word or n-gram counts, returned as a sparse matrix of nonnegative integers or a cell array of sparse matrices.
If bag
is a non-scalar array or
'ForceCellOutput'
is true
, then
the function returns the outputs as a cell array of sparse matrices. Each
element in the cell array is matrix of word or n-gram counts of the
corresponding element of bag
.
Version History
Introduced in R2017b
Comando MATLAB
Hai fatto clic su un collegamento che corrisponde a questo comando MATLAB:
Esegui il comando inserendolo nella finestra di comando MATLAB. I browser web non supportano i comandi MATLAB.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)