runEpisode

Simulate reinforcement learning environment against policy or agent

Since R2022a

Syntax

output = runEpisode(env,policy)

output = runEpisode(env,agent)

output = runEpisode(___,Name=Value)

Description

Use runEpisode to simulate an environment with a policy or agent for a whole episode. The function can execute a callback to process (and, if needed, learn from) the experience at every step of the episode.

output = runEpisode(env,policy) runs a single simulation of the environment env against the policy policy.

output = runEpisode(env,agent) runs a single simulation of the environment env against the agent agent. During the simulation, the policy of the agent is evaluated to produce actions but, by default, learnable parameters are not updated. However, you can use a callback to process the experience, and, if needed, update parameters, at every step of the episode.

output = runEpisode(___,Name=Value) specifies nondefault simulation options using one or more name-value arguments.

example

Examples

collapse all

Simulate Environment and Agent

Open Live Script

Create a reinforcement learning environment and extract its observation and action specifications.

env = rlPredefinedEnv("CartPole-Discrete");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

To approximate the Q-value function within the critic, use a neural network. Create a network as an array of layer objects.

net = [...
    featureInputLayer(obsInfo.Dimension(1))
    fullyConnectedLayer(24)
    reluLayer
    fullyConnectedLayer(24)
    reluLayer
    fullyConnectedLayer(2)
    softmaxLayer];

Convert the network to a dlnetwork object and display the number of learnable parameters (weights).

net = dlnetwork(net);
summary(net)

   Initialized: true

   Number of learnables: 770

   Inputs:
      1   'input'   4 features

Create a discrete categorical actor using the network.

actor = rlDiscreteCategoricalActor(net,obsInfo,actInfo);

Check your actor with a random observation.

act = getAction(actor,{rand(obsInfo.Dimension)})

act = 1×1 cell array
    {[-10]}

Create a policy object from the actor.

policy = rlStochasticActorPolicy(actor);

Create an experience buffer.

buffer = rlReplayMemory(obsInfo,actInfo);

Set up the environment for running multiple simulations. For this example, configure the training to log any errors rather than send them to the command window.

setup(env,StopOnError="off")

Simulate multiple episodes using the environment and policy. After each episode, append the experiences to the buffer. For this example, run 100 episodes.

for i = 1:100
    output = runEpisode(env,policy,MaxSteps=300);
    append(buffer,output.AgentData.Experiences)
end

Clean up the environment.

cleanup(env)

Sample a mini-batch of experiences from the buffer. For this example, sample 10 experiences.

batch = sample(buffer,10);

You can then learn from the sampled experiences and update the policy and actor.

Input Arguments

collapse all

`env` — Environment
reinforcement learning environment object

Environment, specified as follows:

MATLAB^® environment, represented by one of the following objects.
- Predefined environment created using rlPredefinedEnv.
- rlMDPEnv — Markov decision process environment.
- rlFunctionEnv — Environment defined using custom functions.
- rlMultiAgentFunctionEnv — Multiagent environment in which all agents execute in the same step.
- rlTurnBasedFunctionEnv — Turn-based multiagent environment in which agents execute in turns.
- Custom environment created from a template, using rlCreateEnvTemplate.
- rlNeuralNetworkEnvironment — Environment with neural network transition models.
Among the MATLAB environments, only rlMultiAgentFunctionEnv and rlTurnBasedFunctionEnv support training more agents at the same time.
Simulink^® environment, represented by a SimulinkEnvWithAgent object, and created using:
- rlSimulinkEnv — This environment is created from a model already containing one or more agents block, and supports training multiple agents at the same time.
- createIntegratedEnv — This environment is created from a model that does not already contain an agent block, and does not supports training multiple agents at the same time.
A Simulink-based environment object acts as an interface so that the reinforcement learning simulation or training function calls the (compiled) Simulink model to generate experiences for the agents. Such an environment does not support using the reset and step functions.

Note

env is a handle object, so a function that does not return it as output argument, such as train, can still update its internal states. For more information about handle objects, see Handle Object Behavior.

For more information on reinforcement learning environments, see Reinforcement Learning Environments and Create Custom Simulink Environments.

Example: env = rlPredefinedEnv("DoubleIntegrator-Continuous") creates a predefined environment that implements a continuous-action double-integrator system and assigns it to the variable env.

`policy` — Reinforcement learning policy
reinforcement learning agent object

Reinforcement learning policy, specified as one of the following objects:

For more information on reinforcement learning policies, see Create Actors, Critics, and Policy Objects.

Example: policy = getExplorationPolicy(rlPPOAgent(rlNumericSpec([2 1]),rlNumericSpec([1 1]))) extracts the object that implements the exploration policy from a default PPO agent and assigns it to the variable policy.

`agent` — Agent
reinforcement learning agent object

Agent, specified as one of the following reinforcement learning agent objects:

rlQAgent
rlSARSAAgent
rlLSPIAgent
rlDQNAgent
rlPGAgent
rlACAgent
rlPPOAgent
rlTRPOAgent
rlTD3Agent
rlDDPGAgent
rlSACAgent
rlMBPOAgent
Custom agent — For more information on custom agents, see Create Custom Reinforcement Learning Agents.

Note

agent is a handle object, so a function that does not return it as output argument, such as train, can still update it. For more information about handle objects, see Handle Object Behavior.

For more information on reinforcement learning agents, see Reinforcement Learning Agents.

Example: agent = rlPPOAgent(rlNumericSpec([2 1]),rlNumericSpec([1 1])) creates the default rlPPOAgent object agent for an environment with an observation channel carrying a continuous two-element vector and an action channel carrying a continuous scalar.

Name-Value Arguments

collapse all

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: MaxSteps=1000 assigns the value 1000 to the MaxSteps.

`MaxSteps` — Maximum simulation steps
`500` (default) | positive integer

Maximum simulation steps, specified as a positive integer.

Example: MaxSteps=200.

`ProcessExperienceFcn` — Function for processing experiences
function handle | cell array of function handles

Function for processing experiences and updating the policy or agent based on each experience as it occurs during the simulation, specified as a function handle with the following signature.

[updatedPolicy,updatedData] = myFcn(experience,episodeInfo,policy,data)

Here:

experience is a structure that contains a single experience. For more information on the structure fields, see output.Experiences.
episodeInfo contains data about the current episode and corresponds to output.EpisodeInfo.
policy is the policy or agent object being simulated.
data contains experience processing data. For more information, see ProcessExperienceData.
updatedPolicy is the updated policy or agent.
updatedData is the updated experience processing data, which is used as the data input when processing the next experience.

If env is a Simulink environment configured for multiagent training, specify ProcessExperienceFcn as a cell array of function handles. The order of the function handles in the array must match the agent order used to create env.

Example: "myPolicyUpdateFcn".

`ProcessExperienceData` — Experience processing data
any MATLAB data type | cell array

Experience processing data, specified as any MATLAB data, such as an array or structure. Use this data to pass additional parameters or information to the experience processing function.

You can also update this data within the experience processing function to use different parameters when processing the next experience. The data values that you specify when you call runEpisode are used to process the first experience in the simulation.

If env is a Simulink environment configured for multiagent training, specify ProcessExperienceData as a cell array. The order of the array elements must match the agent order used to create env.

Example: "myDataUpdateFcn".

`CleanupPostSim` — Option to clean up environment
`true` (default) | `false`

Option to clean up the environment after the simulation, specified as true or false. When CleanupPostSim is true, runEpisode calls cleanup(env) when the simulation ends.

To run multiple episodes without cleaning up the environment, set CleanupPostSim to false. You can then call cleanup(env) after running your simulations.

If env is a SimulinkEnvWithAgent object and the associated Simulink model is configured to use fast restart, then the model remains in a compiled state between simulations when CleanUpPostSim is false.

Example: "myCleanupFcn".

`LogExperiences` — Option to log experiences
`true` (default) | `false`

Option to log experiences for each policy or agent, specified as true or false. When LogExperiences is true, the experiences of the policy or agent are logged in output.Experiences.

Example: "myLogFcn".

Output Arguments

collapse all

`output` — Simulation output
structure | `Future` object

Simulation output, returned as a structure with the fields AgentData and SimulationInfo.

The AgentData field is a structure array containing data for each agent or policy. Each AgentData structure has the following fields.

Field	Description
`Experiences`	Logged experience of the policy or agent, returned as a structure array. Each experience contains the following fields. `Observation` — Observation `Action` — Action taken `NextObservation` — Resulting next observation `Reward` — Corresponding reward `IsDone` — Termination signal
`Time`	Simulation times of experiences, returned as a vector.
`EpisodeInfo`	Episode information, returned as a structure with the following fields. `CumulativeReward` — Total reward for all experiences `StepsTaken` — Number of simulation steps taken `InitialObservation` — Initial observation at the start of the simulation
`ProcessExperienceData`	Experience processing data
`Agent`	Policy or agent used in the simulation

The SimulationInfo field is one of the following:

For MATLAB environments — Structure containing the field SimulationError. This structure contains any errors that occurred during simulation.
For Simulink environments — Simulink.SimulationOutput object containing simulation data. Recorded data includes any signals and states that the model is configured to log, simulation metadata, and any errors that occurred.

If env is configured to run simulations on parallel workers, then output is a Future object, which supports deferred outputs for environment simulations that run on workers.

Tips

You can speed up episode simulation by using parallel computing. To do so, use the setup function and set the UseParallel argument to true.
```
setup(env,UseParallel=true)
```

Version History

Introduced in R2022a

runEpisode

Syntax

Description

Examples

Simulate Environment and Agent

Input Arguments

`env` — Environment
reinforcement learning environment object

`policy` — Reinforcement learning policy
reinforcement learning agent object

`agent` — Agent
reinforcement learning agent object

Name-Value Arguments

`MaxSteps` — Maximum simulation steps
`500` (default) | positive integer

`ProcessExperienceFcn` — Function for processing experiences
function handle | cell array of function handles

`ProcessExperienceData` — Experience processing data
any MATLAB data type | cell array

`CleanupPostSim` — Option to clean up environment
`true` (default) | `false`

`LogExperiences` — Option to log experiences
`true` (default) | `false`

Output Arguments

`output` — Simulation output
structure | `Future` object

Tips

Version History

See Also

Objects

Functions

Topics

runEpisode

Syntax

Description

Examples

Simulate Environment and Agent

Input Arguments

env — Environment reinforcement learning environment object

policy — Reinforcement learning policy reinforcement learning agent object

agent — Agent reinforcement learning agent object

Name-Value Arguments

MaxSteps — Maximum simulation steps 500 (default) | positive integer

ProcessExperienceFcn — Function for processing experiences function handle | cell array of function handles

ProcessExperienceData — Experience processing data any MATLAB data type | cell array

CleanupPostSim — Option to clean up environment true (default) | false

LogExperiences — Option to log experiences true (default) | false

Output Arguments

output — Simulation output structure | Future object

Tips

Version History

See Also

Objects

Functions

Topics

`env` — Environment
reinforcement learning environment object

`policy` — Reinforcement learning policy
reinforcement learning agent object

`agent` — Agent
reinforcement learning agent object

`MaxSteps` — Maximum simulation steps
`500` (default) | positive integer

`ProcessExperienceFcn` — Function for processing experiences
function handle | cell array of function handles

`ProcessExperienceData` — Experience processing data
any MATLAB data type | cell array

`CleanupPostSim` — Option to clean up environment
`true` (default) | `false`

`LogExperiences` — Option to log experiences
`true` (default) | `false`

`output` — Simulation output
structure | `Future` object