Specify Training Options in Reinforcement Learning Designer

To configure the training of an agent in the Reinforcement Learning Designer app, specify training options on the Train tab.

The Train tab, showing example training options.

Specify Basic Options

On the Train tab, you can specify these basic training options.

Option	Description
Max Episodes	Maximum number of episodes to train the agent, specified as a positive integer.
Max Episode Length	Maximum number of steps to run per episode, specified as a positive integer.
Stopping Criteria	Training termination condition, specified as one of the following values: `AverageSteps` — Stop training when the running average number of steps per episode equals or exceeds the critical value specified by Stopping Value. `AverageReward` — Stop training when the running average reward equals or exceeds the critical value. `EpisodeReward` — Stop training when the reward in the current episode equals or exceeds the critical value. `GlobalStepCount` — Stop training when the total number of steps in all episodes, that is when the total number of times the agent is invoked, equals or exceeds the critical value. `EpisodeCount` — Stop training when the number of training episodes equals or exceeds the critical value.
Stopping Value	Critical value of the training termination condition in Stopping Criteria, specified as a scalar.
Average Window Length	Window length for averaging the scores, rewards, and number of steps for the agent when either Stopping Criteria or Save agent criteria specify an averaging condition.

Specify Parallel Training Options

To enable the use of multiple processes for training, on the Train tab, click the Use Parallel button . Training agents using parallel computing requires Parallel Computing Toolbox™ software. For more information, see Train Agents Using Parallel Computing and GPUs.

To specify options for parallel training, select Use Parallel > Parallel training options.

Parallel training options dialog box

In the Parallel Training Options dialog box, you can specify these training options.

Option	Description
Enable parallel training	Enables using multiple processes to perform environment simulations during training. Select this option by clicking the Use Parallel button .
Parallel computing mode	Parallel computing mode, specified as one of these values: `sync` — Use `parpool` to run synchronous training on the available workers. The parallel pool client (the process that starts the training) updates the parameters of its actor and critic, based on the results from all the workers, and sends the updated parameters to all workers. When you select this option, workers must pause execution until all workers are finished. As a result, the training only advances as fast as the slowest worker allows. `async` — Use `parpool` to run asynchronous training on the available workers. In this case, workers send their data back to the client as soon as they finish and receive updated parameters from the client. The workers then continue with their task.
Transfer workspace variables to workers	Select this option to send model and workspace variables to parallel workers. When you select this option, the parallel pool client sends variables used in models and defined in the MATLAB^® workspace to the workers.
Random seed for workers	Randomizer initialization for workers, specified as one of these values: `–1` — Assign a unique random seed to each worker. The value of the seed is the worker ID. `–2` — Do not assign a random seed to the workers. Vector — Manually specify the random seed for each worker. The number of elements in the vector must match the number of workers.
Files to attach to parallel pool	Additional files to attach to the parallel pool. Specify names of files in the current working directory, with one name on each line.
Worker setup function	Function to run before training starts, specified as a handle to a function having no input arguments. This function runs once for each worker before training begins. Write this function to perform any processing that you need prior to training.
Worker cleanup function	Function to run after training ends, specified as a handle to a function having no input arguments. You can write this function to clean up the workspace or perform other processing after training terminates.

This figure shows an example parallel training configuration for these files and functions:

Data file attached to the parallel pool — workerData.mat
Worker setup function — mySetup.m
Worker cleanup function — myCleanup.m

Parallel training options dialog showing file and function information

For more information on parallel training options, see the UseParallel and ParallelizationOptions properties in rlTrainingOptions. For more information on parallel training, see Train Agents Using Parallel Computing and GPUs.

Specify Agent Evaluation Options

To enable agent evaluation at regular intervals during training, on the Train tab, click .

To specify agent evaluation options, select Evaluate Agent > Agent evaluation options.

Agent evaluation options dialog box

In the Agent Evaluation Options dialog box, you can specify these training options.

Option	Description
Enable agent evaluation	Enables periodic agent evaluation during training. You can select this option by clicking the Evaluate Agent button .
Number of evaluation episodes	Number of consecutive evaluation episodes, specified as a positive integer. After running the number of consecutive training episodes specified in Evaluation frequency, the software runs the number of evaluation episodes specified in this field, consecutively. For example, if you specify `100` in the Evaluation frequency field, and `3` in this field, then three evaluation episodes run, consecutively, after 100 the training episodes. These three evaluation episodes are used to calculate a single statistic, specified by the Evaluation statistic type field. The statistic is returned as the 100th element the of training result object instantiated after training. After 200 training episodes, three new evaluation episodes run, with their statistic returned in the 200th element of the training results object, and so on. The default is value `3`.
Evaluation frequency	Evaluation period, specified as a positive integer. This value is the number of consecutive training episodes after which the number of consecutive evaluation episodes specified in the Number of evaluation episodes field run. For example, if you specify `100` in this field and `3` in the Number of evaluation episodes field, three evaluation episodes run, consecutively, after the 100 training episodes. The default is value `100`.
Max evaluation episode length	Maximum number of steps to run for an evaluation episode, specified as a positive integer. This value is the maximum number of steps to run for an evaluation episode if other termination conditions are not met. To accurately assess the agent stability and performance, it is often useful to specify a larger number of steps for an evaluation episode than for a training episode. If you leave this field empty (default), the maximum number of steps per episode specified in the Max Episode Length field is used.
Evaluation random seeds	Random seeds used for evaluation episodes, specified as follows: Empty field — The random seed is not initialized before an evaluation episode. Nonnegative integer — The random seed is reinitialized to the specified value before every first evaluation episode that occurs after the number of consecutive training episodes specified in the Evaluation frequency field. This is the default behavior, with the seed initialized to `1`. Vector of nonnegative integers — with the same number of elements as the number of evaluation episodes specified in the Number of evaluation episodes field — Before each episode of an evaluation sequence, the random seed is reinitialized to the corresponding element of the specified vector. This reinitialization guarantees that the ith episode of each evaluation sequence always runs with the same random seed, which helps when comparing evaluation episodes that occur at different stages of training. The current random seed used for training is stored before the first episode of an evaluation sequence and reset as the current seed after the evaluation sequence. This behavior ensures that the training results with evaluation are the same as the results without evaluation.
Evaluation statistic type	Type of evaluation statistic for each group of consecutive evaluation episodes, specified as one of these strings: `"MeanEpisodeReward"` — Mean value of the evaluation episodes rewards. This setting is the default. `"MedianEpisodeReward"` — Median value of the evaluation episodes rewards. `"MaxEpisodeReward"` — Maximum value of the evaluation episodes rewards. `"MinEpisodeReward"` — Minimum value of the evaluation episodes rewards. This value is returned, in the training result object, as the element of the `EvaluationStatistics` vector corresponding to the training episode that precedes the group of consecutive evaluation episodes.
Use exploration policy	Option to use exploration policy during evaluation episodes. When this option is disabled (default) the agent uses its base greedy policy when selecting actions during an evaluation episode. When you enable this option, the agent uses its base exploration policy when selecting actions during an evaluation episode.

For more information on evaluation options, see rlEvaluator.

Specify Hyperparameter Tuning Options

Hyperparameter tuning consists in a series of training experiments, also referred to as trials, designed to select the best combination of hyperparameters. To enable a hyperparameter tuning session, on the Train tab, click on the Tune Hyperparameters button.

To specify hyperparameter tuning options, select Tune Hyperparameters > Hyperparameter tuning options.

Hyperparameter tuning options dialog box

In the Tuning Options dialog box, you can specify these training options:

Option	Description
Enable hyperparameter tuning	Enables hyperparameter tuning. This option gets selected also when you click on the Tune Hyperparameters button..
Tuning algorithm	Tuning algorithm, specified as one of these values: `Bayesian optimization` — Use Bayesian optimization. This option requires Statistics and Machine Learning Toolbox™. Bayesian optimization performs an adaptive search on the hyperparameter space. For more information, see Bayesian Optimization Algorithm (Statistics and Machine Learning Toolbox). `Grid search` — Use grid search. Grid search performs an exhaustive sweep over the hyperparameter space.
Hyperparameter selection	The Hyperparameter selection section allows you to select up to six parameters to be tuned, depending on the agent type. You can tune the learning rates used by the critic and, if available, the actor, mini-batch size, discount factor. You can also tune exploration parameters such as the action noise standard deviation and its decay rate or the epsilon decay rate and its decay rate. You can enable the tuning of a hyperparameter by selecting the corresponding Optimize check box. You can then adjust the hyperparameter search range using the corresponding Min and Max fields. Check Log Scaling to space the grid points logarithmically for the search dimension corresponding to the hyperparameter. For more information about the hyperparameter that you can tune in the app and their default range, see Hyperparameter Selection.
Restore Defaults	Click this button to reset the hyperparameter configuration.
Tuning goal	Tuning goal, specified as one of these values: `Learning performance` — When you specify this option, the score computed for each trial (that is, for each training) is the final average episodic reward for that trial. `Grid search` — Specify this option to perform an exhaustive sweep over all the points in the hyperparameter grid.
Random seed	Initial random seeds used for each hyperparameter tuning training run, specified as a nonnegative integer.
Max trial evaluations	Maximum number of trials (training experiments). When the number of completed trials reaches this value, the software selects the combination of hyperparameters that yields the best score.
Max time (hours)	Maximum time in hours to perform tuning. When the time elapsed from the beginning of the hyperparameter tuning session reaches this value, the software does not perform additional training experiments. The software then selects the combination of hyperparameters that yield the best score.
Save artifacts to disk	Enable or disable saving artifacts, (that is, the tuned agent and training result) to disk.
Save directory	Directory in which the software saves the artifacts.

Hyperparameter Selection

The hyperparameter selection section of the hyperparameter Tuning Options dialog box allows you to tune these hyperparameters.

Hyperparameter	Description	Optimize (default value)	Min (default value)	Max (default value)	LogScaling (default value)
ActorLearnRate	Actor learning rate, available for all agents except DQN and TRPO. For more information, see the optimizer options object of the actor used by the agent.	selected	`1e-6`	`0.1`	selected
CriticLearnRate	Critic learning rate, available for all agents. For more information, see the optimizer options object of the critic used by the agent.	selected	`1e-6`	`0.1`	selected
ExperienceHorizon	Number of steps used to calculate the advantage value, available for PPO and TRPO agents. For more information, see `ExperienceHorizon`.	selected	`50`	`1000`	not selected
MiniBatchSize	Size of the mini-batch, available for all agents. For more information, see the agent options object.	selected	`64`	`1024`	not selected
EntropyLossWeight	Weight of the entropy loss, available for PPO and TRPO agents. For more information, see `EntropyLossWeight`.	selected	`0.001`	`0.1`	selected
DiscountFactor	Discount factor, available for all agents. For more information, see the agent options.	not selected	`0.1`	`1`	selected
Epsilon	Value of the epsilon variable, for epsilon-greedy exploration, available for DQN agents. For more information, see `EpsilonGreedyExploration`.	selected	`0.5`	`1`	not selected
EpsilonDecay	Decay rate of the epsilon variable, for epsilon-greedy exploration, available for DQN agents. For more information, see `EpsilonGreedyExploration`.	not selected	`1e-6`	`0.01`	selected
MeanAttractionConstant	Value of the standard deviation for the agent noise model, available for DDPG agents. For more information, see `NoiseOptions`.	not selected	`0.01`	`1`	selected
Standard Deviation	Value of the standard deviation for the agent noise model, available for DDPG and TD3 agents. For more information, see `NoiseOptions`.	selected	`0.1`	`0.5`	not selected
StandardDeviationDecayRate	Decay rate of the standard deviation for the agent noise model, available for DDPG and TD3 agents. For more information, see `NoiseOptions`.	not selected	`1e-6`	`0.01`	selected

Note

You can tune a set of hyperparameters that is different from the set allowed in the Hyperparameter tuning options dialog box using the MATLAB code generated by the app. To do so, generate code by selecting Train > Generate MATLAB function for hyperparameter tuning. Then, customize the generated code to include the hyperparameters to tune.

For an example on how to tune hyperparameters in Reinforcement Learning Designer, see Tune Hyperparameters Using Reinforcement Learning Designer. For an example on how to perform parameter sweeping using the Experiment Manager app, see Train Agent or Tune Environment Parameters Using Parameter Sweeping. For an general example on how to tune a custom set of hyperparameters programmatically, using Bayesian optimization, see Tune Hyperparameters Using Bayesian Optimization.

Specify Additional Options

To specify additional training options, on the Train tab, click More Options.

In the More Training Options dialog box, you can specify these options.

Option	Description
Save agent criteria	Condition for saving agents during training, specified as one of these values: `none` — Do not save any agents during training. `AverageSteps` — Save the agent when the running average number of steps per episode equals or exceeds the critical value specified by Save agent value. `AverageReward` — Save the agent when the running average reward equals or exceeds the critical value. `EpisodeReward` — Save the agent when the reward in the current episode equals or exceeds the critical value. `GlobalStepCount` — Save the agent when the total number of steps in all episodes, that is, the total number of times the agent is invoked, equals or exceeds the critical value. `EpisodeCount` — Save the agent when the number of training episodes equals or exceeds the critical value.
Save agent value	Critical value of the save agent condition in Save agent criteria, specified as a scalar or `"none"`.
Save directory	Folder for saved agents. If you specify a folder name and that folder does not exist, the app creates the folder in the current working directory. To interactively select a folder, click Browse.
Show verbose output	Select this option to display the training progress at the command line.
Stop on error	Select this option to stop training when an error occurs during an episode.

For more information training options, see rlTrainingOptions.