# Analyze Text Data Using Multiword Phrases

This example shows how to analyze text using n-gram frequency counts.

An n-gram is a tuple of $\mathit{n}$ consecutive words. For example, a bigram (the case when $\mathit{n}=2$) is a pair of consecutive words such as "heavy rainfall". A unigram (the case when $\mathit{n}=1$) is a single word. A bag-of-n-grams model records the number of times that different n-grams appear in document collections.

Using a bag-of-n-grams model, you can retain more information on word ordering in the original text data. For example, a bag-of-n-grams model is better suited for capturing short phrases which appear in the text, such as "heavy rainfall" and "thunderstorm winds".

To create a bag-of-n-grams model, use `bagOfNgrams`. You can input `bagOfNgrams` objects into other Text Analytics Toolbox functions such as `wordcloud` and `fitlda`.

### Load and Extract Text Data

Load the example data. The file `factoryReports.csv` contains factory reports, including a text description and categorical labels for each event. Remove the rows with empty reports.

```filename = "factoryReports.csv"; data = readtable(filename,TextType="string");```

Extract the text data from the table and view the first few reports.

```textData = data.Description; textData(1:5)```
```ans = 5×1 string "Items are occasionally getting stuck in the scanner spools." "Loud rattling and banging sounds are coming from assembler pistons." "There are cuts to the power when starting the plant." "Fried capacitors in the assembler." "Mixer tripped the fuses." ```

### Prepare Text Data for Analysis

Create a function which tokenizes and preprocesses the text data so it can be used for analysis. The function `preprocessText` listed at the end of the example, performs the following steps:

1. Convert the text data to lowercase using `lower`.

2. Tokenize the text using `tokenizedDocument`.

3. Erase punctuation using `erasePunctuation`.

4. Remove a list of stop words (such as "and", "of", and "the") using `removeStopWords`.

5. Remove words with 2 or fewer characters using `removeShortWords`.

6. Remove words with 15 or more characters using `removeLongWords`.

7. Lemmatize the words using `normalizeWords`.

Use the example preprocessing function `preprocessText` to prepare the text data.

```documents = preprocessText(textData); documents(1:5)```
```ans = 5×1 tokenizedDocument: 6 tokens: item occasionally get stuck scanner spool 7 tokens: loud rattling bang sound come assembler piston 4 tokens: cut power start plant 3 tokens: fry capacitor assembler 3 tokens: mixer trip fuse ```

### Create Word Cloud of Bigrams

Create a word cloud of bigrams by first creating a bag-of-n-grams model using `bagOfNgrams`, and then inputting the model to `wordcloud`.

To count the n-grams of length 2 (bigrams), use `bagOfNgrams` with the default options.

`bag = bagOfNgrams(documents)`
```bag = bagOfNgrams with properties: Counts: [480×921 double] Vocabulary: ["item" "occasionally" "get" "stuck" "scanner" "loud" "rattling" "bang" "sound" "come" "assembler" "cut" "power" "start" "fry" "capacitor" "mixer" "trip" "burst" "pipe" … ] Ngrams: [921×2 string] NgramLengths: 2 NumNgrams: 921 NumDocuments: 480 ```

Visualize the bag-of-n-grams model using a word cloud.

```figure wordcloud(bag); title("Text Data: Preprocessed Bigrams")```

### Fit Topic Model to Bag-of-N-Grams

A Latent Dirichlet Allocation (LDA) model is a topic model which discovers underlying topics in a collection of documents and infers the word probabilities in topics.

Create an LDA topic model with 10 topics using `fitlda`. The function fits an LDA model by treating the n-grams as single words.

`mdl = fitlda(bag,10,Verbose=0);`

Visualize the first four topics as word clouds.

```figure tiledlayout("flow"); for i = 1:4 nexttile wordcloud(mdl,i); title("LDA Topic " + i) end```

The word clouds highlight commonly co-occurring bigrams in the LDA topics. The function plots the bigrams with sizes according to their probabilities for the specified LDA topics.

### Analyze Text Using Longer Phrases

To analyze text using longer phrases, specify the `NGramLengths` option in `bagOfNgrams` to be a larger value.

When working with longer phrases, it can be useful to keep stop words in the model. For example, to detect the phrase "is not happy", keep the stop words "is" and "not" in the model.

Preprocess the text. Erase the punctuation using `erasePunctuation`, and tokenize using `tokenizedDocument`.

```cleanTextData = erasePunctuation(textData); documents = tokenizedDocument(cleanTextData);```

To count the n-grams of length 3 (trigrams), use `bagOfNgrams` and specify `NGramLengths` to be 3.

`bag = bagOfNgrams(documents,NGramLengths=3);`

Visualize the bag-of-n-grams model using a word cloud. The word cloud of trigrams better shows the context of the individual words.

```figure wordcloud(bag); title("Text Data: Trigrams")```

View the top 10 trigrams and their frequency counts using `topkngrams`.

`tbl = topkngrams(bag,10)`
```tbl=10×3 table Ngram Count NgramLength __________________________________ _____ ___________ "in" "the" "mixer" 14 3 "in" "the" "scanner" 13 3 "blown" "in" "the" 9 3 "the" "robot" "arm" 7 3 "stuck" "in" "the" 6 3 "is" "spraying" "coolant" 6 3 "from" "time" "to" 6 3 "time" "to" "time" 6 3 "heard" "in" "the" 6 3 "on" "the" "floor" 6 3 ```

### Example Preprocessing Function

The function `preprocessText` performs the following steps in order:

1. Convert the text data to lowercase using `lower`.

2. Tokenize the text using `tokenizedDocument`.

3. Erase punctuation using `erasePunctuation`.

4. Remove a list of stop words (such as "and", "of", and "the") using `removeStopWords`.

5. Remove words with 2 or fewer characters using `removeShortWords`.

6. Remove words with 15 or more characters using `removeLongWords`.

7. Lemmatize the words using `normalizeWords`.

```function documents = preprocessText(textData) % Convert the text data to lowercase. cleanTextData = lower(textData); % Tokenize the text. documents = tokenizedDocument(cleanTextData); % Erase punctuation. documents = erasePunctuation(documents); % Remove a list of stop words. documents = removeStopWords(documents); % Remove words with 2 or fewer characters, and words with 15 or greater % characters. documents = removeShortWords(documents,2); documents = removeLongWords(documents,15); % Lemmatize the words. documents = addPartOfSpeechDetails(documents); documents = normalizeWords(documents,Style="lemma"); end```