trainWordEmbedding

Train word embedding

Syntax

emb = trainWordEmbedding(filename)

emb = trainWordEmbedding(documents)

emb = trainWordEmbedding(___,Name,Value)

Description

emb = trainWordEmbedding(filename) trains a word embedding using the training data stored in the text file filename. The file is a collection of documents stored in UTF-8 with one document per line and words separated by whitespace.

example

emb = trainWordEmbedding(documents) trains a word embedding using documents by creating a temporary file with writeTextDocument, and then trains an embedding using the temporary file.

example

emb = trainWordEmbedding(___,Name,Value) specifies additional options using one or more name-value pair arguments. For example, 'Dimension',50 specifies the word embedding dimension to be 50.

example

Examples

collapse all

Train Word Embedding from File

Open Live Script

Train a word embedding of dimension 100 using the example text file sonnetsPreprocessed.txt. This file contains preprocessed versions of Shakespeare's sonnets, with one sonnet per line and words separated by a space.

filename = "sonnetsPreprocessed.txt";
emb = trainWordEmbedding(filename)

Training: 100% Loss: 2.72083  Remaining time: 0 hours 0 minutes.

emb = 
  wordEmbedding with properties:

     Dimension: 100
    Vocabulary: ["thy"    "thou"    "love"    "thee"    "doth"    "mine"    "shall"    "eyes"    "sweet"    "time"    "nor"    "beauty"    "yet"    "art"    "heart"    "o"    "thine"    "hath"    "fair"    "make"    "still"    …    ] (1×401 string)

View the word embedding in a text scatter plot using tsne.

words = emb.Vocabulary;
V = word2vec(emb,words);
XY = tsne(V);
textscatter(XY,words)

Figure contains an axes object. The axes object contains an object of type textscatter.

Train Word Embedding from Documents

Open Live Script

Train a word embedding using the example data sonnetsPreprocessed.txt. This file contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Train a word embedding using trainWordEmbedding.

emb = trainWordEmbedding(documents)

Training: 100% Loss: 2.99591  Remaining time: 0 hours 0 minutes.

emb = 
  wordEmbedding with properties:

     Dimension: 100
    Vocabulary: ["thy"    "thou"    "love"    "thee"    "doth"    "mine"    "shall"    "eyes"    "sweet"    "time"    "nor"    "beauty"    "yet"    "art"    "heart"    "o"    "thine"    "hath"    "fair"    "make"    "still"    …    ] (1×401 string)

Visualize the word embedding in a text scatter plot using tsne.

words = emb.Vocabulary;
V = word2vec(emb,words);
XY = tsne(V);
textscatter(XY,words)

Figure contains an axes object. The axes object contains an object of type textscatter.

Specify Word Embedding Options

Open Live Script

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Specify the word embedding dimension to be 50. To reduce the number of words discarded by the model, set 'MinCount' to 3. To train for longer, set the number of epochs to 10.

emb = trainWordEmbedding(documents, ...
    'Dimension',50, ...
    'MinCount',3, ...
    'NumEpochs',10)

Training: 100% Loss: 3.02244  Remaining time: 0 hours 0 minutes.

emb = 
  wordEmbedding with properties:

     Dimension: 50
    Vocabulary: ["thy"    "thou"    "love"    "thee"    "doth"    "mine"    "shall"    "eyes"    "sweet"    "time"    "nor"    "beauty"    "yet"    "art"    "heart"    "o"    "thine"    "hath"    "fair"    "make"    "still"    …    ] (1×750 string)

View the word embedding in a text scatter plot using tsne.

words = emb.Vocabulary;
V = word2vec(emb, words);
XY = tsne(V);
textscatter(XY,words)

Figure contains an axes object. The axes object contains an object of type textscatter.

Input Arguments

collapse all

`filename` — Name of file
string scalar | character vector | 1-by-1 cell array containing a character vector

Name of the file, specified as a string scalar, character vector, or a 1-by-1 cell array containing a character vector.

Data Types: string | char | cell

`documents` — Input documents
`tokenizedDocument` array

Input documents, specified as a tokenizedDocument array.

Name-Value Arguments

collapse all

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: 'Dimension',50 specifies the word embedding dimension to be 50.

`Dimension` — Dimension of word embedding
100 (default) | positive integer

Dimension of the word embedding, specified as the comma-separated pair consisting of 'Dimension' and a nonnegative integer.

Example: 300

`Window` — Size of context window
5 (default) | nonnegative integer

Size of the context window, specified as the comma-separated pair consisting of 'Window' and a nonnegative integer.

Example: 10

`Model` — Model
`'skipgram'` (default) | `'cbow'`

Model, specified as the comma-separated pair consisting of 'Model' and 'skipgram' (skip gram) or 'cbow' (continuous bag-of-words).

Example: 'cbow'

`DiscardFactor` — Factor to determine word discard rate
`1e-4` (default) | positive scalar

Factor to determine the word discard rate, specified as the comma-separated pair consisting of 'DiscardFactor' and a positive scalar. The function discards a word from the input window with probability 1-sqrt(t/f) - t/f where f is the unigram probability of the word, and t is DiscardFactor. Usually, DiscardFactor is in the range of 1e-3 through 1e-5.

Example: 0.005

`LossFunction` — Loss function
`'ns'` (default) | `'hs'` | `'softmax'`

Loss function, specified as the comma-separated pair consisting of 'LossFunction' and 'ns' (negative sampling), 'hs' (hierarchical softmax), or 'softmax' (softmax).

Example: 'hs'

`NumNegativeSamples` — Number of negative samples
5 (default) | positive integer

Number of negative samples for the negative sampling loss function, specified as the comma-separated pair consisting of 'NumNegativeSamples' and a positive integer. This option is only valid when LossFunction is 'ns'.

Example: 10

`NumEpochs` — Number of epochs
5 (default) | positive integer

Number of epochs for training, specified as the comma-separated pair consisting of 'NumEpochs' and a positive integer.

Example: 10

`MinCount` — Minimum count of words
5 (default) | positive integer

Minimum count of words to include in the embedding, specified as the comma-separated pair consisting of 'MinCount' and a positive integer. The function discards words that appear fewer than MinCount times in the training data from the vocabulary.

Example: 10

`NGramRange` — Inclusive range for subword n-grams
`[3 6]` (default) | vector of two nonnegative integers

Inclusive range for subword n-grams, specified as the comma-separated pair consisting of 'NGramRange' and a vector of two nonnegative integers [min max]. If you do not want to use n-grams, then set 'NGramRange' to [0 0].

Example: [5 10]

`InitialLearnRate` — Initial learn rate
0.05 (default) | positive scalar

Initial learn rate, specified as the comma-separated pair consisting of 'InitialLearnRate' and a positive scalar.

Example: 0.01

`UpdateRate` — Rate for updating learn rate
100 (default) | positive integer

Rate for updating the learn rate, specified as the comma-separated pair consisting of 'UpdateRate' and a positive integer. The learn rate decreases to zero linearly in steps every N words where N is the UpdateRate.

Example: 50

`Verbose` — Verbosity level
1 (default) | 0

Verbosity level, specified as the comma-separated pair consisting of 'Verbose' and one of the following:

0 – Do not display verbose output.
1 – Display progress information.

Example: 'Verbose',0

Output Arguments

collapse all

`emb` — Output word embedding
word embedding

Output word embedding, returned as a wordEmbedding object.

More About

collapse all

Language Considerations

File input to the trainWordEmbedding function requires words separated by whitespace.

For files containing non-English text, you might need to input a tokenizedDocument array to trainWordEmbedding.

To create a tokenizedDocument array from pretokenized text, use the tokenizedDocument function and set the 'TokenizeMethod' option to 'none'.

Tips

The training algorithm uses the number of threads given by the function maxNumCompThreads. To learn how to change the number of threads used by MATLAB^®, see maxNumCompThreads.

References

[1] Bojanowsku, P., E. Grave, A. Joulin, T. Mikolov. "Enriching Word Vectors with Subword Information." Transactions of the Association for Computational Linguistics 5 135-146, 2017. https://doi.org/10.1162/tacl_a_00051.

Version History

Introduced in R2017b

trainWordEmbedding

Syntax

Description

Examples

Train Word Embedding from File

Train Word Embedding from Documents

Specify Word Embedding Options

Input Arguments

`filename` — Name of file
string scalar | character vector | 1-by-1 cell array containing a character vector

`documents` — Input documents
`tokenizedDocument` array

Name-Value Arguments

`Dimension` — Dimension of word embedding
100 (default) | positive integer

`Window` — Size of context window
5 (default) | nonnegative integer

`Model` — Model
`'skipgram'` (default) | `'cbow'`

`DiscardFactor` — Factor to determine word discard rate
`1e-4` (default) | positive scalar

`LossFunction` — Loss function
`'ns'` (default) | `'hs'` | `'softmax'`

`NumNegativeSamples` — Number of negative samples
5 (default) | positive integer

`NumEpochs` — Number of epochs
5 (default) | positive integer

`MinCount` — Minimum count of words
5 (default) | positive integer

`NGramRange` — Inclusive range for subword n-grams
`[3 6]` (default) | vector of two nonnegative integers

`InitialLearnRate` — Initial learn rate
0.05 (default) | positive scalar

`UpdateRate` — Rate for updating learn rate
100 (default) | positive integer

`Verbose` — Verbosity level
1 (default) | 0

Output Arguments

`emb` — Output word embedding
word embedding

More About

Language Considerations

Tips

References

Version History

See Also

Topics

trainWordEmbedding

Syntax

Description

Examples

Train Word Embedding from File

Train Word Embedding from Documents

Specify Word Embedding Options

Input Arguments

filename — Name of file string scalar | character vector | 1-by-1 cell array containing a character vector

documents — Input documents tokenizedDocument array

Name-Value Arguments

Dimension — Dimension of word embedding 100 (default) | positive integer

Window — Size of context window 5 (default) | nonnegative integer

Model — Model 'skipgram' (default) | 'cbow'

DiscardFactor — Factor to determine word discard rate 1e-4 (default) | positive scalar

LossFunction — Loss function 'ns' (default) | 'hs' | 'softmax'

NumNegativeSamples — Number of negative samples 5 (default) | positive integer

NumEpochs — Number of epochs 5 (default) | positive integer

MinCount — Minimum count of words 5 (default) | positive integer

NGramRange — Inclusive range for subword n-grams [3 6] (default) | vector of two nonnegative integers

InitialLearnRate — Initial learn rate 0.05 (default) | positive scalar

UpdateRate — Rate for updating learn rate 100 (default) | positive integer

Verbose — Verbosity level 1 (default) | 0

Output Arguments

emb — Output word embedding word embedding

More About

Language Considerations

Tips

References

Version History

See Also

Topics

`filename` — Name of file
string scalar | character vector | 1-by-1 cell array containing a character vector

`documents` — Input documents
`tokenizedDocument` array

`Dimension` — Dimension of word embedding
100 (default) | positive integer

`Window` — Size of context window
5 (default) | nonnegative integer

`Model` — Model
`'skipgram'` (default) | `'cbow'`

`DiscardFactor` — Factor to determine word discard rate
`1e-4` (default) | positive scalar

`LossFunction` — Loss function
`'ns'` (default) | `'hs'` | `'softmax'`

`NumNegativeSamples` — Number of negative samples
5 (default) | positive integer

`NumEpochs` — Number of epochs
5 (default) | positive integer

`MinCount` — Minimum count of words
5 (default) | positive integer

`NGramRange` — Inclusive range for subword n-grams
`[3 6]` (default) | vector of two nonnegative integers

`InitialLearnRate` — Initial learn rate
0.05 (default) | positive scalar

`UpdateRate` — Rate for updating learn rate
100 (default) | positive integer

`Verbose` — Verbosity level
1 (default) | 0

`emb` — Output word embedding
word embedding