runEpisode
Syntax
Description
Use runEpisode to simulate an environment with a policy or
agent for a whole episode. The function can execute a callback to process (and, if needed,
learn from) the experience at every step of the episode.
runs a single simulation of the environment output = runEpisode(env,agent)env against the agent
agent. During the simulation, the policy of the agent is evaluated
to produce actions but, by default, learnable parameters are not updated. However, you can
use a callback to process the experience, and, if needed, update parameters, at every step
of the episode.
specifies nondefault simulation options using one or more name-value arguments.output = runEpisode(___,Name=Value)
Examples
Create a reinforcement learning environment and extract its observation and action specifications.
env = rlPredefinedEnv("CartPole-Discrete");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);To approximate the Q-value function within the critic, use a neural network. Create a network as an array of layer objects.
net = [...
featureInputLayer(obsInfo.Dimension(1))
fullyConnectedLayer(24)
reluLayer
fullyConnectedLayer(24)
reluLayer
fullyConnectedLayer(2)
softmaxLayer];Convert the network to a dlnetwork object and display the number of learnable parameters (weights).
net = dlnetwork(net); summary(net)
Initialized: true
Number of learnables: 770
Inputs:
1 'input' 4 features
Create a discrete categorical actor using the network.
actor = rlDiscreteCategoricalActor(net,obsInfo,actInfo);
Check your actor with a random observation.
act = getAction(actor,{rand(obsInfo.Dimension)})act = 1×1 cell array
{[-10]}
Create a policy object from the actor.
policy = rlStochasticActorPolicy(actor);
Create an experience buffer.
buffer = rlReplayMemory(obsInfo,actInfo);
Set up the environment for running multiple simulations. For this example, configure the training to log any errors rather than send them to the command window.
setup(env,StopOnError="off")Simulate multiple episodes using the environment and policy. After each episode, append the experiences to the buffer. For this example, run 100 episodes.
for i = 1:100 output = runEpisode(env,policy,MaxSteps=300); append(buffer,output.AgentData.Experiences) end
Clean up the environment.
cleanup(env)
Sample a mini-batch of experiences from the buffer. For this example, sample 10 experiences.
batch = sample(buffer,10);
You can then learn from the sampled experiences and update the policy and actor.
Input Arguments
Environment, specified as follows:
MATLAB® environment, represented by one of the following objects.
Predefined environment created using
rlPredefinedEnv.rlMDPEnv— Markov decision process environment.rlFunctionEnv— Environment defined using custom functions.rlMultiAgentFunctionEnv— Multiagent environment in which all agents execute in the same step.rlTurnBasedFunctionEnv— Turn-based multiagent environment in which agents execute in turns.Custom environment created from a template, using
rlCreateEnvTemplate.rlNeuralNetworkEnvironment— Environment with neural network transition models.
Among the MATLAB environments, only
rlMultiAgentFunctionEnvandrlTurnBasedFunctionEnvsupport training more agents at the same time.Simulink® environment, represented by a
SimulinkEnvWithAgentobject, and created using:rlSimulinkEnv— This environment is created from a model already containing one or more agents block, and supports training multiple agents at the same time.createIntegratedEnv— This environment is created from a model that does not already contain an agent block, and does not supports training multiple agents at the same time.
A Simulink-based environment object acts as an interface so that the reinforcement learning simulation or training function calls the (compiled) Simulink model to generate experiences for the agents. Such an environment does not support using the
resetandstepfunctions.
Note
env is a handle object, so a function that does not return it
as output argument, such as train,
can still update its internal states. For more information about handle objects, see
Handle Object Behavior.
For more information on reinforcement learning environments, see Reinforcement Learning Environments and Create Custom Simulink Environments.
Example: env = rlPredefinedEnv("DoubleIntegrator-Continuous")
creates a predefined environment that implements a continuous-action double-integrator
system and assigns it to the variable env.
Reinforcement learning policy, specified as one of the following objects:
For more information on reinforcement learning policies, see Create Actors, Critics, and Policy Objects.
Example: policy = getExplorationPolicy(rlPPOAgent(rlNumericSpec([2
1]),rlNumericSpec([1 1]))) extracts the object that implements the
exploration policy from a default PPO agent and assigns it to the variable
policy.
Agent, specified as one of the following reinforcement learning agent objects:
Custom agent — For more information on custom agents, see Create Custom Reinforcement Learning Agents.
Note
agent is a handle object, so a function that does not return
it as output argument, such as train,
can still update it. For more information about handle objects, see Handle Object Behavior.
For more information on reinforcement learning agents, see Reinforcement Learning Agents.
Example: agent = rlPPOAgent(rlNumericSpec([2 1]),rlNumericSpec([1
1])) creates the default rlPPOAgent object
agent for an environment with an observation channel carrying a
continuous two-element vector and an action channel carrying a continuous
scalar.
Name-Value Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN, where Name is
the argument name and Value is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Example: MaxSteps=1000 assigns the value 1000 to
the MaxSteps.
Maximum simulation steps, specified as a positive integer.
Example: MaxSteps=200.
Function for processing experiences and updating the policy or agent based on each experience as it occurs during the simulation, specified as a function handle with the following signature.
[updatedPolicy,updatedData] = myFcn(experience,episodeInfo,policy,data)
Here:
experienceis a structure that contains a single experience. For more information on the structure fields, seeoutput.Experiences.episodeInfocontains data about the current episode and corresponds tooutput.EpisodeInfo.policyis the policy or agent object being simulated.datacontains experience processing data. For more information, seeProcessExperienceData.updatedPolicyis the updated policy or agent.updatedDatais the updated experience processing data, which is used as thedatainput when processing the next experience.
If env is a Simulink environment configured for multiagent training, specify
ProcessExperienceFcn as a cell array of function handles. The
order of the function handles in the array must match the agent order used to create
env.
Example: "myPolicyUpdateFcn".
Experience processing data, specified as any MATLAB data, such as an array or structure. Use this data to pass additional parameters or information to the experience processing function.
You can also update this data within the experience processing function to use
different parameters when processing the next experience. The data values that you
specify when you call runEpisode are used to process the first
experience in the simulation.
If env is a Simulink environment configured for multiagent training, specify
ProcessExperienceData as a cell array. The order of the array
elements must match the agent order used to create env.
Example: "myDataUpdateFcn".
Option to clean up the environment after the simulation, specified as
true or false. When
CleanupPostSim is true,
runEpisode calls cleanup(env) when the
simulation ends.
To run multiple episodes without cleaning up the environment, set
CleanupPostSim to false. You can then call
cleanup(env) after running your simulations.
If env is a SimulinkEnvWithAgent object and
the associated Simulink model is configured to use fast restart, then the model remains in a
compiled state between simulations when CleanUpPostSim is
false.
Example: "myCleanupFcn".
Option to log experiences for each policy or agent, specified as
true or false. When
LogExperiences is true, the experiences of
the policy or agent are logged in output.Experiences.
Example: "myLogFcn".
Output Arguments
Simulation output, returned as a structure with the fields
AgentData and SimulationInfo.
The AgentData field is a structure array containing data for each
agent or policy. Each AgentData structure has the following
fields.
| Field | Description |
|---|---|
Experiences | Logged experience of the policy or agent, returned as a structure array. Each experience contains the following fields.
|
Time | Simulation times of experiences, returned as a vector. |
EpisodeInfo | Episode information, returned as a structure with the following fields.
|
ProcessExperienceData | Experience processing data |
Agent | Policy or agent used in the simulation |
The SimulationInfo field is one of the following:
For MATLAB environments — Structure containing the field
SimulationError. This structure contains any errors that occurred during simulation.For Simulink environments —
Simulink.SimulationOutputobject containing simulation data. Recorded data includes any signals and states that the model is configured to log, simulation metadata, and any errors that occurred.
If env is configured to run simulations on parallel workers,
then output is a Future object,
which supports deferred outputs for environment simulations that run on workers.
Tips
You can speed up episode simulation by using parallel computing. To do so, use the
setupfunction and set theUseParallelargument totrue.setup(env,UseParallel=true)
Version History
Introduced in R2022a
See Also
Objects
Functions
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Seleziona un sito web
Seleziona un sito web per visualizzare contenuto tradotto dove disponibile e vedere eventi e offerte locali. In base alla tua area geografica, ti consigliamo di selezionare: .
Puoi anche selezionare un sito web dal seguente elenco:
Come ottenere le migliori prestazioni del sito
Per ottenere le migliori prestazioni del sito, seleziona il sito cinese (in cinese o in inglese). I siti MathWorks per gli altri paesi non sono ottimizzati per essere visitati dalla tua area geografica.
Americhe
- América Latina (Español)
- Canada (English)
- United States (English)
Europa
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)