Main Content

This example shows how to define a text decoder model function.

In the context of deep learning, a decoder is the part of a deep learning network that maps a latent vector to some sample space. You can use decode the vectors for various tasks. For example,

Text generation by initializing a recurrent network with the encoded vector.

Sequence-to-sequence translation by using the encoded vector as a context vector.

Image captioning by using the encoded vector as a context vector.

Load the encoded data from `sonnetsEncoded.mat`

. This MAT file contains the word encoding, a mini-batch of sequences `dlX`

, and the corresponding encoded data `dlZ`

output by the encoder used in the example Define Text Encoder Model Function (Deep Learning Toolbox).

```
s = load("sonnetsEncoded.mat");
enc = s.enc;
dlX = s.dlX;
dlZ = s.dlZ;
[latentDimension,miniBatchSize] = size(dlZ,1:2);
```

The goal of the decoder is to generate sequences given some initial input data and network state.

Initialize the parameters for the following model.

The decoder reconstructs the input using an LSTM initialized the encoder output. For each time step, the decoder predicts the next time step and uses the output for the next time-step predictions. Both the encoder and the decoder use the same embedding.

This model uses three operations:

The embedding maps word indices in the range 1 though

`vocabularySize`

to vectors of dimension`embeddingDimension`

, where`vocabularySize`

is the number of words in the encoding vocabulary and`embeddingDimension`

is the number of components learned by the embedding.The LSTM operation takes as input a single word vector and outputs 1-by-

`numHiddenUnits`

vector, where`numHiddenUnits`

is the number of hidden units in the LSTM operation. The initial state of the LSTM network (the state at the first time-step) is the encoded vector, so the number of hidden units must match the latent dimension of the encoder.The fully connected operation multiplies the input by a weight matrix adding bias and outputs vectors of size

`vocabularySize`

.

Specify the dimensions of the parameters. The embedding sizes must match the encoder.

embeddingDimension = 100; vocabularySize = enc.NumWords; numHiddenUnits = latentDimension;

Create a struct for the parameters.

parameters = struct;

Initialize the weights of the embedding using the Gaussian using the `initializeGaussian`

function which is attached to this example as a supporting file. Specify a mean of 0 and a standard deviation of 0.01. To learn more, see Gaussian Initialization (Deep Learning Toolbox).

mu = 0; sigma = 0.01; parameters.emb.Weights = initializeGaussian([embeddingDimension vocabularySize],mu,sigma);

Initialize the learnable parameters for the decoder LSTM operation:

Initialize the input weights with the Glorot initializer using the

`initializeGlorot`

function which is attached to this example as a supporting file. To learn more, see Glorot Initialization (Deep Learning Toolbox).Initialize the recurrent weights with the orthogonal initializer using the

`initializeOrthogonal`

function which is attached to this example as a supporting file. To learn more, see Orthogonal Initialization (Deep Learning Toolbox).Initialize the bias with the unit forget gate initializer using the

`initializeUnitForgetGate`

function which is attached to this example as a supporting file. To learn more, see Unit Forget Gate Initialization (Deep Learning Toolbox).

The sizes of the learnable parameters depend on the size of the input. Because the inputs to the LSTM operation are sequences of word vectors from the embedding operation, the number of input channels is `embeddingDimension`

.

The input weight matrix has size

`4*numHiddenUnits`

-by-`inputSize`

, where`inputSize`

is the dimension of the input data.The recurrent weight matrix has size

`4*numHiddenUnits`

-by-`numHiddenUnits`

.The bias vector has size

`4*numHiddenUnits`

-by-1.

sz = [4*numHiddenUnits embeddingDimension]; numOut = 4*numHiddenUnits; numIn = embeddingDimension; parameters.lstmDecoder.InputWeights = initializeGlorot(sz,numOut,numIn); parameters.lstmDecoder.RecurrentWeights = initializeOrthogonal([4*numHiddenUnits numHiddenUnits]); parameters.lstmDecoder.Bias = initializeUnitForgetGate(numHiddenUnits);

Initialize the learnable parameters for the encoder fully connected operation:

Initialize the weights with the Glorot initializer.

Initialize the bias with zeros using the

`initializeZeros`

function which is attached to this example as a supporting file. To learn more, see Zeros Initialization (Deep Learning Toolbox).

The sizes of the learnable parameters depend on the size of the input. Because the inputs to the fully connected operation are the outputs of the LSTM operation, the number of input channels is `numHiddenUnits`

. To make the fully connected operation output vectors with size `latentDimension`

, specify an output size of `latentDimension`

.

The weights matrix has size

`outputSize`

-by-`inputSize`

, where`outputSize`

and`inputSize`

correspond to the output and input dimensions, respectively.The bias vector has size

`outputSize`

-by-1.

To make the fully connected operation output vectors with size `vocabularySize`

, specify an output size of `vocabularySize`

.

inputSize = numHiddenUnits; outputSize = vocabularySize; parameters.fcDecoder.Weights = dlarray(randn(outputSize,inputSize,'single')); parameters.fcDecoder.Bias = dlarray(zeros(outputSize,1,'single'));

Create the function `modelDecoder`

, listed in the Decoder Model Function section of the example, that computes the output of the decoder model. The `modelDecoder`

function, takes as input sequences of word indices, the model parameters, and the sequence lengths, and returns the corresponding latent feature vector.

When training a deep learning model with a custom training loop, you must calculate the gradients of the loss with respect to the learnable parameters. This calculation depends on the output of a forward pass of the model function.

There are two common approaches to generating text data with a decoder:

Closed loop — For each time step, make predictions using the previous prediction as input.

Open loop — For each time step, make predictions using inputs from an external source (for example, training targets).

Closed loop generation is when the model generates data one time-step at a time and uses the previous prediction as input for the next prediction. Unlike open loop generation, this process does not require any input between predictions and is best suited for scenarios without supervision. For example, a language translation model that generates output text in one go.

To use closed loop

Initialize the hidden state of the LSTM network with the encoder output `dlZ`

.

```
state = struct;
state.HiddenState = dlZ;
state.CellState = zeros(size(dlZ),'like',dlZ);
```

For the first time step, use an array of start tokens as input for the decoder. For simplicity, extract an array of start tokens from the first time-step of the training data.

decoderInput = dlX(:,:,1);

Preallocate the decoder output to have size `numClasses`

-by-`miniBatchSize`

-by-`sequenceLength`

with the same datatype as `dlX`

, where `sequenceLength`

is the desired length of the generation, for example, the length of the training targets. For this example, specify a sequence length of 16.

sequenceLength = 16; dlY = zeros(vocabularySize,miniBatchSize,sequenceLength,'like',dlX); dlY = dlarray(dlY,'CBT');

For each time step, predict the next time step of the sequence using the `modelDecoder`

function. After each prediction, find the indices corresponding to the maximum values of the decoder output and use these indices as the decoder input for the next time step.

for t = 1:sequenceLength [dlY(:,:,t), state] = modelDecoder(parameters,decoderInput,state); [~,idx] = max(dlY(:,:,t)); decoderInput = idx; end

The output is a `vocabularySize`

-by-`miniBatchSize`

-by-`sequenceLength`

array.

size(dlY)

`ans = `*1×3*
3595 32 16

This code snippet shows an example of performing closed loop generation in a model gradients function.

function gradients = modelGradients(parameters,dlX,sequenceLengths) % Encode input. dlZ = modelEncoder(parameters,dlX,sequenceLengths); % Initialize LSTM state. state = struct; state.HiddenState = dlZ; state.CellState = zeros(size(dlZ),'like',dlZ); % Initialize decoder input. decoderInput = dlX(:,:,1); % Closed loop prediction. sequenceLength = size(dlX,3); dlY = zeros(numClasses,miniBatchSize,sequenceLength,'like',dlX); for t = 1:sequenceLength [dlY(:,:,t), state] = modelDecoder(parameters,decoderInput,state); [~,idx] = max(dlY(:,:,t)); decoderInput = idx; end % Calculate loss. % ... % Calculate gradients. % ... end

When training with closed loop generation, predicting the most likely word for each step in the sequence can lead to suboptimal results. For example, in an image captioning workflow, if the decoder predicts the first word of a caption is "a" when given an image of an elephant, then the probability of predicting "elephant" for the next word becomes much more unlikely because of the extremely low probability of the phrase "a elephant" appearing in English text.

To help the network converge faster, you can use *teacher forcing:* use the target values as input to the decoder instead of the previous predictions. Using teacher forcing helps the network to learn characteristics from the later time steps of the sequences without having to wait for the network to correctly generate the earlier time steps of the sequences.

To perform teacher forcing, use the `modelEncoder`

function directly with the target sequence as input.

Initialize the hidden state of the LSTM network with the encoder output `dlZ`

.

```
state = struct;
state.HiddenState = dlZ;
state.CellState = zeros(size(dlZ),'like',dlZ);
```

Make predictions using the target sequence as input.

dlY = modelDecoder(parameters,dlX,state);

The output is a `vocabularySize`

-by-`miniBatchSize`

-by-`sequenceLength`

array, where `sequenceLength`

is the length of the input sequences.

size(dlY)

`ans = `*1×3*
3595 32 14

This code snippet shows an example of performing teacher forcing in a model gradients function.

function gradients = modelGradients(parameters,dlX,sequenceLengths) % Encode input. dlZ = modelEncoder(parameters,dlX,dlZ); % Initialize LSTM state. state = struct; state.HiddenState = dlZ; state.CellState = zeros(size(dlZ),'like',dlZ); % Teacher forcing. dlY = modelDecoder(parameters,dlX,state); % Calculate loss. % ... % Calculate gradients. % ... end

The `modelDecoder`

function, takes as input the model parameters, sequences of word indices, and the network state, and returns the decoded sequences.

Because the `lstm`

function is *stateful* (when given a time series as input, the function propagates and updates the state between each time step) and that the `embed`

and `fullyconnect`

functions are time-distributed by default (when given a time series as input, the functions operate on each time step independently), the `modelDecoder`

function supports both sequence and single time-step inputs.

function [dlY,state] = modelDecoder(parameters,dlX,state) % Embedding. weights = parameters.emb.Weights; dlX = embed(dlX,weights); % LSTM. inputWeights = parameters.lstmDecoder.InputWeights; recurrentWeights = parameters.lstmDecoder.RecurrentWeights; bias = parameters.lstmDecoder.Bias; hiddenState = state.HiddenState; cellState = state.CellState; [dlY,hiddenState,cellState] = lstm(dlX,hiddenState,cellState, ... inputWeights,recurrentWeights,bias); state.HiddenState = hiddenState; state.CellState = cellState; % Fully connect. weights = parameters.fcDecoder.Weights; bias = parameters.fcDecoder.Bias; dlY = fullyconnect(dlY,weights,bias); end

`doc2sequence`

| `tokenizedDocument`

| `word2ind`

| `wordEncoding`