Main Content

join

Combine multiple bag-of-words or bag-of-n-grams models

Description

newBag = join(bag) combines the elements in the array bag by merging the frequency counts. The function combines the elements along the first dimension not equal to 1.

example

newBag = join(bag,dim) combines the elements in the array bag along the dimension dim.

Examples

collapse all

Create an array of two bags-of-words models from tokenized documents.

str = [ ...
    "an example of a short sentence"
    "a second short sentence"];
documents = tokenizedDocument(str);
bag(1) = bagOfWords(documents(1));
bag(2) = bagOfWords(documents(2))
bag=1×2 bagOfWords array with properties:
    Counts
    Vocabulary
    NumWords
    NumDocuments

Combine the bag-of-words models using join.

bag = join(bag)
bag = 
  bagOfWords with properties:

          Counts: [2x7 double]
      Vocabulary: ["an"    "example"    "of"    "a"    "short"    "sentence"    "second"]
        NumWords: 7
    NumDocuments: 2

If your text data is contained in multiple files in a folder, then you can import the text data and create a bag-of-words model in parallel using parfor. If you have Parallel Computing Toolbox™ installed, then the parfor loop runs in parallel, otherwise, it runs in serial. Use join to combine an array of bag-of-words models into one model.

Create a list of filenames. The examples sonnets have file names "exampleSonnetN.txt", where N is the number of the sonnet.

filenames = [
    "exampleSonnet1.txt"
    "exampleSonnet2.txt"
    "exampleSonnet3.txt"
    "exampleSonnet4.txt"];

Create a bag-of-words model from a collection of files. Initialize an empty bag-of-words model and then loop over the files and create a bag-of-words model for each file.

bag = bagOfWords;

numFiles = numel(filenames);
parfor i = 1:numFiles
    filename = filenames(i);
    
    textData = extractFileText(filename);
    document = tokenizedDocument(textData);
    bag(i) = bagOfWords(document);
end
Starting parallel pool (parpool) using the 'Processes' profile ...
Connected to parallel pool with 4 workers.

Combine the bag-of-words models using join.

bag = join(bag)
bag = 
  bagOfWords with properties:

          Counts: [4x276 double]
      Vocabulary: ["From"    "fairest"    "creatures"    "we"    "desire"    "increase"    ","    "That"    "thereby"    "beauty's"    "rose"    "might"    "never"    "die"    "But"    "as"    "the"    "riper"    "should"    ...    ] (1x276 string)
        NumWords: 276
    NumDocuments: 4

Input Arguments

collapse all

Array of bag-of-words or bag-of-n-grams models, specified as a bagOfWords array or a bagOfNgrams array. If bag is a bagOfNgrams array, then each element to be joined must have the same value for the NgramLengths property.

Dimension along which to join models, specified as a positive integer. If dim is not specified, then the default is the first dimension with a size that does not equal 1.

Output Arguments

collapse all

Output model, returned as a bagOfWords object or a bagOfNgrams object. The type of newBag is the same as the type of bag. newBag has the same data type as the input model and has a size of 1 along the dimension being joined.

Version History

Introduced in R2018a