Main Content

removeInfrequentNgrams

Remove infrequently seen n-grams from bag-of-n-grams model

Description

example

newBag = removeInfrequentNgrams(bag,count) removes the n-grams that appear at most count times in total from the bag-of-n-grams model bag. The function, by default, is case sensitive.

example

newBag = removeInfrequentNgrams(bag,count,'NgramLengths',lengths) only removes n-grams with lengths specified by lengths. The function, by default, is case sensitive.

newBag = removeInfrequentNgrams(___,'IgnoreCase',true) removes the n-grams that appear at most count times ignoring case. If n-grams differ only by case, then the corresponding counts are merged.

Examples

collapse all

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Create a bag-of-n-grams model. Specify to count bigrams (pairs of words) and trigrams (triples of words).

bag = bagOfNgrams(documents,'NgramLengths',[2 3])
bag = 
  bagOfNgrams with properties:

          Counts: [154x18022 double]
      Vocabulary: ["fairest"    "creatures"    "desire"    "increase"    "thereby"    "beautys"    "rose"    "might"    "never"    "die"    "riper"    "time"    "decease"    "tender"    "heir"    "bear"    "memory"    "thou"    ...    ] (1x3092 string)
          Ngrams: [18022x3 string]
    NgramLengths: [2 3]
       NumNgrams: 18022
    NumDocuments: 154

Remove n-grams of any length that appear two or fewer times in total.

bag = removeInfrequentNgrams(bag,2)
bag = 
  bagOfNgrams with properties:

          Counts: [154x103 double]
      Vocabulary: ["thine"    "thy"    "self"    "sweet"    "thou"    "time"    "why"    "dost"    "upon"    "eye"    "thee"    "ten"    "beauty"    "love"    "wilt"    "dear"    "truth"    "own"    "yet"    "hast"    "mens"    ...    ] (1x73 string)
          Ngrams: [103x3 string]
    NgramLengths: [2 3]
       NumNgrams: 103
    NumDocuments: 154

Remove bigrams that appear four or fewer times in total.

bag = removeInfrequentNgrams(bag,4,'NgramLengths',2)
bag = 
  bagOfNgrams with properties:

          Counts: [154x41 double]
      Vocabulary: ["thine"    "thy"    "sweet"    "thou"    "dost"    "upon"    "why"    "thee"    "ten"    "love"    "dear"    "hast"    "true"    "mine"    "beauty"    "fair"    "own"    "self"    "art"    "times"    "shouldst"    ...    ] (1x30 string)
          Ngrams: [41x3 string]
    NgramLengths: [2 3]
       NumNgrams: 41
    NumDocuments: 154

Input Arguments

collapse all

Input bag-of-n-grams model, specified as a bagOfNgrams object.

Count threshold, specified as a positive integer. The function removes the n-grams that appear count times in total or fewer.

N-gram lengths, specified as a positive integer or a vector of positive integers.

If you specify lengths, the function removes infrequent n-grams of the specified lengths only. If you do not specify lengths, then the function removes infrequent n-grams regardless of length.

Example: [1 2 3]

Output Arguments

collapse all

Output bag-of-n-grams model, returned as a bagOfNgrams object.

Version History

Introduced in R2018a