Contenuto principale

runEpisode

Simulate reinforcement learning environment against policy or agent

Since R2022a

    Description

    Use runEpisode to simulate an environment with a policy or agent for a whole episode. The function can execute a callback to process (and, if needed, learn from) the experience at every step of the episode.

    output = runEpisode(env,policy) runs a single simulation of the environment env against the policy policy.

    output = runEpisode(env,agent) runs a single simulation of the environment env against the agent agent. During the simulation, the policy of the agent is evaluated to produce actions but, by default, learnable parameters are not updated. However, you can use a callback to process the experience, and, if needed, update parameters, at every step of the episode.

    output = runEpisode(___,Name=Value) specifies nondefault simulation options using one or more name-value arguments.

    example

    Examples

    collapse all

    Create a reinforcement learning environment and extract its observation and action specifications.

    env = rlPredefinedEnv("CartPole-Discrete");
    obsInfo = getObservationInfo(env);
    actInfo = getActionInfo(env);

    To approximate the Q-value function within the critic, use a neural network. Create a network as an array of layer objects.

    net = [...
        featureInputLayer(obsInfo.Dimension(1))
        fullyConnectedLayer(24)
        reluLayer
        fullyConnectedLayer(24)
        reluLayer
        fullyConnectedLayer(2)
        softmaxLayer];

    Convert the network to a dlnetwork object and display the number of learnable parameters (weights).

    net = dlnetwork(net);
    summary(net)
       Initialized: true
    
       Number of learnables: 770
    
       Inputs:
          1   'input'   4 features
    

    Create a discrete categorical actor using the network.

    actor = rlDiscreteCategoricalActor(net,obsInfo,actInfo);

    Check your actor with a random observation.

    act = getAction(actor,{rand(obsInfo.Dimension)})
    act = 1×1 cell array
        {[-10]}
    
    

    Create a policy object from the actor.

    policy = rlStochasticActorPolicy(actor);

    Create an experience buffer.

    buffer = rlReplayMemory(obsInfo,actInfo);

    Set up the environment for running multiple simulations. For this example, configure the training to log any errors rather than send them to the command window.

    setup(env,StopOnError="off")

    Simulate multiple episodes using the environment and policy. After each episode, append the experiences to the buffer. For this example, run 100 episodes.

    for i = 1:100
        output = runEpisode(env,policy,MaxSteps=300);
        append(buffer,output.AgentData.Experiences)
    end

    Clean up the environment.

    cleanup(env)

    Sample a mini-batch of experiences from the buffer. For this example, sample 10 experiences.

    batch = sample(buffer,10);

    You can then learn from the sampled experiences and update the policy and actor.

    Input Arguments

    collapse all

    Environment, specified as follows:

    • MATLAB® environment, represented by one of the following objects.

      Among the MATLAB environments, only rlMultiAgentFunctionEnv and rlTurnBasedFunctionEnv support training more agents at the same time.

    • Simulink® environment, represented by a SimulinkEnvWithAgent object, and created using:

      • rlSimulinkEnv — This environment is created from a model already containing one or more agents block, and supports training multiple agents at the same time.

      • createIntegratedEnv — This environment is created from a model that does not already contain an agent block, and does not supports training multiple agents at the same time.

      A Simulink-based environment object acts as an interface so that the reinforcement learning simulation or training function calls the (compiled) Simulink model to generate experiences for the agents. Such an environment does not support using the reset and step functions.

    Note

    env is a handle object, so a function that does not return it as output argument, such as train, can still update its internal states. For more information about handle objects, see Handle Object Behavior.

    For more information on reinforcement learning environments, see Reinforcement Learning Environments and Create Custom Simulink Environments.

    Example: env = rlPredefinedEnv("DoubleIntegrator-Continuous") creates a predefined environment that implements a continuous-action double-integrator system and assigns it to the variable env.

    For more information on reinforcement learning policies, see Create Actors, Critics, and Policy Objects.

    Example: policy = getExplorationPolicy(rlPPOAgent(rlNumericSpec([2 1]),rlNumericSpec([1 1]))) extracts the object that implements the exploration policy from a default PPO agent and assigns it to the variable policy.

    Agent, specified as one of the following reinforcement learning agent objects:

    Note

    agent is a handle object, so a function that does not return it as output argument, such as train, can still update it. For more information about handle objects, see Handle Object Behavior.

    For more information on reinforcement learning agents, see Reinforcement Learning Agents.

    Example: agent = rlPPOAgent(rlNumericSpec([2 1]),rlNumericSpec([1 1])) creates the default rlPPOAgent object agent for an environment with an observation channel carrying a continuous two-element vector and an action channel carrying a continuous scalar.

    Name-Value Arguments

    collapse all

    Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

    Example: MaxSteps=1000 assigns the value 1000 to the MaxSteps.

    Maximum simulation steps, specified as a positive integer.

    Example: MaxSteps=200.

    Function for processing experiences and updating the policy or agent based on each experience as it occurs during the simulation, specified as a function handle with the following signature.

    [updatedPolicy,updatedData] = myFcn(experience,episodeInfo,policy,data)

    Here:

    • experience is a structure that contains a single experience. For more information on the structure fields, see output.Experiences.

    • episodeInfo contains data about the current episode and corresponds to output.EpisodeInfo.

    • policy is the policy or agent object being simulated.

    • data contains experience processing data. For more information, see ProcessExperienceData.

    • updatedPolicy is the updated policy or agent.

    • updatedData is the updated experience processing data, which is used as the data input when processing the next experience.

    If env is a Simulink environment configured for multiagent training, specify ProcessExperienceFcn as a cell array of function handles. The order of the function handles in the array must match the agent order used to create env.

    Example: "myPolicyUpdateFcn".

    Experience processing data, specified as any MATLAB data, such as an array or structure. Use this data to pass additional parameters or information to the experience processing function.

    You can also update this data within the experience processing function to use different parameters when processing the next experience. The data values that you specify when you call runEpisode are used to process the first experience in the simulation.

    If env is a Simulink environment configured for multiagent training, specify ProcessExperienceData as a cell array. The order of the array elements must match the agent order used to create env.

    Example: "myDataUpdateFcn".

    Option to clean up the environment after the simulation, specified as true or false. When CleanupPostSim is true, runEpisode calls cleanup(env) when the simulation ends.

    To run multiple episodes without cleaning up the environment, set CleanupPostSim to false. You can then call cleanup(env) after running your simulations.

    If env is a SimulinkEnvWithAgent object and the associated Simulink model is configured to use fast restart, then the model remains in a compiled state between simulations when CleanUpPostSim is false.

    Example: "myCleanupFcn".

    Option to log experiences for each policy or agent, specified as true or false. When LogExperiences is true, the experiences of the policy or agent are logged in output.Experiences.

    Example: "myLogFcn".

    Output Arguments

    collapse all

    Simulation output, returned as a structure with the fields AgentData and SimulationInfo.

    The AgentData field is a structure array containing data for each agent or policy. Each AgentData structure has the following fields.

    FieldDescription
    Experiences

    Logged experience of the policy or agent, returned as a structure array. Each experience contains the following fields.

    • Observation — Observation

    • Action — Action taken

    • NextObservation — Resulting next observation

    • Reward — Corresponding reward

    • IsDone — Termination signal

    TimeSimulation times of experiences, returned as a vector.
    EpisodeInfo

    Episode information, returned as a structure with the following fields.

    • CumulativeReward — Total reward for all experiences

    • StepsTaken — Number of simulation steps taken

    • InitialObservation — Initial observation at the start of the simulation

    ProcessExperienceDataExperience processing data
    AgentPolicy or agent used in the simulation

    The SimulationInfo field is one of the following:

    • For MATLAB environments — Structure containing the field SimulationError. This structure contains any errors that occurred during simulation.

    • For Simulink environments — Simulink.SimulationOutput object containing simulation data. Recorded data includes any signals and states that the model is configured to log, simulation metadata, and any errors that occurred.

    If env is configured to run simulations on parallel workers, then output is a Future object, which supports deferred outputs for environment simulations that run on workers.

    Tips

    • You can speed up episode simulation by using parallel computing. To do so, use the setup function and set the UseParallel argument to true.

      setup(env,UseParallel=true)

    Version History

    Introduced in R2022a