Main Content

bagOfWords

Bag-of-words model

Description

A bag-of-words model (also known as a term-frequency counter) records the number of times that words appear in each document of a collection.

bagOfWords does not split text into words. To create an array of tokenized documents, see tokenizedDocument.

Creation

Description

bag = bagOfWords creates an empty bag-of-words model.

example

bag = bagOfWords(documents) counts the words appearing in documents and returns a bag-of-words model.

example

bag = bagOfWords(uniqueWords,counts) creates a bag-of-words model using the words in uniqueWords and the corresponding frequency counts in counts.

Input Arguments

expand all

Input documents, specified as a tokenizedDocument array, a string array of words, or a cell array of character vectors. If documents is not a tokenizedDocument array, then it must be a row vector representing a single document, where each element is a word. To specify multiple documents, use a tokenizedDocument array.

Unique word list, specified as a string vector or a cell array of character vectors. If uniqueWords contains <missing>, then the function ignores the missing values. The size of uniqueWords must be 1-by-V where V is the number of columns of counts.

Example: ["an" "example" "list"]

Data Types: string | cell

Frequency counts of words corresponding to uniqueWords, specified as a matrix of nonnegative integers. The value counts(i,j) corresponds to the number of times the word uniqueWords(j) appears in the ith document.

counts must have numel(uniqueWords) columns.

Properties

expand all

Word counts per document, specified as a sparse matrix.

Number of documents seen, specified as a nonnegative integer.

Number of unique words in the model, specified as a nonnegative integer.

Unique words in the model, specified as a string vector.

Data Types: string

Object Functions

encodeEncode documents as matrix of word or n-gram counts
tfidfTerm Frequency–Inverse Document Frequency (tf-idf) matrix
topkwordsMost important words in bag-of-words model or LDA topic
addDocumentAdd documents to bag-of-words or bag-of-n-grams model
removeDocumentRemove documents from bag-of-words or bag-of-n-grams model
removeEmptyDocumentsRemove empty documents from tokenized document array, bag-of-words model, or bag-of-n-grams model
removeWordsRemove selected words from documents or bag-of-words model
removeInfrequentWordsRemove words with low counts from bag-of-words model
joinCombine multiple bag-of-words or bag-of-n-grams models
wordcloudCreate word cloud chart from text, bag-of-words model, bag-of-n-grams model, or LDA model

Examples

collapse all

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Create a bag-of-words model using bagOfWords.

bag = bagOfWords(documents)
bag = 
  bagOfWords with properties:

          Counts: [154x3092 double]
      Vocabulary: ["fairest"    "creatures"    "desire"    "increase"    "thereby"    "beautys"    "rose"    "might"    "never"    "die"    "riper"    "time"    "decease"    "tender"    "heir"    "bear"    "memory"    "thou"    ...    ] (1x3092 string)
        NumWords: 3092
    NumDocuments: 154

View the top 10 words and their total counts.

tbl = topkwords(bag,10)
tbl=10×2 table
     Word      Count
    _______    _____

    "thy"       281 
    "thou"      234 
    "love"      162 
    "thee"      161 
    "doth"       88 
    "mine"       63 
    "shall"      59 
    "eyes"       56 
    "sweet"      55 
    "time"       53 

Create a bag-of-words model using a string array of unique words and a matrix of word counts.

uniqueWords = ["a" "an" "another" "example" "final" "sentence" "third"];
counts = [ ...
    1 2 0 1 0 1 0;
    0 0 3 1 0 4 0;
    1 0 0 5 0 3 1;
    1 0 0 1 7 0 0];
bag = bagOfWords(uniqueWords,counts)
bag = 
  bagOfWords with properties:

          Counts: [4x7 double]
      Vocabulary: ["a"    "an"    "another"    "example"    "final"    "sentence"    "third"]
        NumWords: 7
    NumDocuments: 4

If your text data is contained in multiple files in a folder, then you can import the text data into MATLAB using a file datastore.

Create a file datastore for the example sonnet text files. The examples sonnets have file names "exampleSonnetN.txt", where N is the number of the sonnet. Specify the read function to be extractFileText.

readFcn = @extractFileText;
fds = fileDatastore('exampleSonnet*.txt','ReadFcn',readFcn);

Create an empty bag-of-words model.

bag = bagOfWords
bag = 
  bagOfWords with properties:

          Counts: []
      Vocabulary: [1x0 string]
        NumWords: 0
    NumDocuments: 0

Loop over the files in the datastore and read each file. Tokenize the text in each file and add the document to bag.

while hasdata(fds)
    str = read(fds);
    document = tokenizedDocument(str);
    bag = addDocument(bag,document);
end

View the updated bag-of-words model.

bag
bag = 
  bagOfWords with properties:

          Counts: [4x276 double]
      Vocabulary: ["From"    "fairest"    "creatures"    "we"    "desire"    "increase"    ","    "That"    "thereby"    "beauty's"    "rose"    "might"    "never"    "die"    "But"    "as"    "the"    "riper"    "should"    ...    ] (1x276 string)
        NumWords: 276
    NumDocuments: 4

Remove the stop words from a bag-of-words model by inputting a list of stop words to removeWords. Stop words are words such as "a", "the", and "in" which are commonly removed from text before analysis.

documents = tokenizedDocument([
    "an example of a short sentence" 
    "a second short sentence"]);
bag = bagOfWords(documents);
newBag = removeWords(bag,stopWords)
newBag = 
  bagOfWords with properties:

          Counts: [2x4 double]
      Vocabulary: ["example"    "short"    "sentence"    "second"]
        NumWords: 4
    NumDocuments: 2

Create a table of the most frequent words of a bag-of-words model.

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Create a bag-of-words model using bagOfWords.

bag = bagOfWords(documents) 
bag = 
  bagOfWords with properties:

          Counts: [154x3092 double]
      Vocabulary: ["fairest"    "creatures"    "desire"    "increase"    "thereby"    "beautys"    "rose"    "might"    "never"    "die"    "riper"    "time"    "decease"    "tender"    "heir"    "bear"    "memory"    "thou"    ...    ] (1x3092 string)
        NumWords: 3092
    NumDocuments: 154

Find the top five words.

T = topkwords(bag);

Find the top 20 words in the model.

k = 20;
T = topkwords(bag,k)
T=20×2 table
      Word      Count
    ________    _____

    "thy"        281 
    "thou"       234 
    "love"       162 
    "thee"       161 
    "doth"        88 
    "mine"        63 
    "shall"       59 
    "eyes"        56 
    "sweet"       55 
    "time"        53 
    "beauty"      52 
    "nor"         52 
    "art"         51 
    "yet"         51 
    "o"           50 
    "heart"       50 
      ⋮

Create a Term Frequency–Inverse Document Frequency (tf-idf) matrix from a bag-of-words model.

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Create a bag-of-words model using bagOfWords.

bag = bagOfWords(documents)
bag = 
  bagOfWords with properties:

          Counts: [154x3092 double]
      Vocabulary: ["fairest"    "creatures"    "desire"    "increase"    "thereby"    "beautys"    "rose"    "might"    "never"    "die"    "riper"    "time"    "decease"    "tender"    "heir"    "bear"    "memory"    "thou"    ...    ] (1x3092 string)
        NumWords: 3092
    NumDocuments: 154

Create a tf-idf matrix. View the first 10 rows and columns.

M = tfidf(bag);
full(M(1:10,1:10))
ans = 10×10

    3.6507    4.3438    2.7344    3.6507    4.3438    2.2644    3.2452    3.8918    2.4720    2.5520
         0         0         0         0         0    4.5287         0         0         0         0
         0         0         0         0         0         0         0         0         0    2.5520
         0         0         0         0         0    2.2644         0         0         0         0
         0         0         0         0         0    2.2644         0         0         0         0
         0         0         0         0         0    2.2644         0         0         0         0
         0         0         0         0         0         0         0         0         0         0
         0         0         0         0         0         0         0         0         0         0
         0         0         0         0         0    2.2644         0         0         0    2.5520
         0         0    2.7344         0         0         0         0         0         0         0

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Create a bag-of-words model using bagOfWords.

bag = bagOfWords(documents)
bag = 
  bagOfWords with properties:

          Counts: [154×3092 double]
      Vocabulary: ["fairest"    "creatures"    "desire"    "increase"    "thereby"    "beautys"    "rose"    "might"    "never"    "die"    "riper"    "time"    "decease"    "tender"    "heir"    "bear"    "memory"    "thou"    "contracted"    …    ]
        NumWords: 3092
    NumDocuments: 154

Visualize the bag-of-words model using a word cloud.

figure
wordcloud(bag);

If your text data is contained in multiple files in a folder, then you can import the text data and create a bag-of-words model in parallel using parfor. If you have Parallel Computing Toolbox™ installed, then the parfor loop runs in parallel, otherwise, it runs in serial. Use join to combine an array of bag-of-words models into one model.

Create a list of filenames. The examples sonnets have file names "exampleSonnetN.txt", where N is the number of the sonnet.

filenames = [
    "exampleSonnet1.txt"
    "exampleSonnet2.txt"
    "exampleSonnet3.txt"
    "exampleSonnet4.txt"];

Create a bag-of-words model from a collection of files. Initialize an empty bag-of-words model and then loop over the files and create a bag-of-words model for each file.

bag = bagOfWords;

numFiles = numel(filenames);
parfor i = 1:numFiles
    filename = filenames(i);
    
    textData = extractFileText(filename);
    document = tokenizedDocument(textData);
    bag(i) = bagOfWords(document);
end
Starting parallel pool (parpool) using the 'Processes' profile ...
Connected to parallel pool with 4 workers.

Combine the bag-of-words models using join.

bag = join(bag)
bag = 
  bagOfWords with properties:

          Counts: [4x276 double]
      Vocabulary: ["From"    "fairest"    "creatures"    "we"    "desire"    "increase"    ","    "That"    "thereby"    "beauty's"    "rose"    "might"    "never"    "die"    "But"    "as"    "the"    "riper"    "should"    ...    ] (1x276 string)
        NumWords: 276
    NumDocuments: 4

Tips

  • If you intend to use a held out test set for your work, then partition your text data before using bagOfWords. Otherwise, the bag-of-words model may bias your analysis.

Version History

Introduced in R2017b