Compare Agents on the Continuous Cart Pole Swing-Up Environment
This example shows how to create and train frequently used default agents on a continuous action space cart-pole environment. This environment is modeled using Simscape™ Multibody™, and represents a pole attached to an unactuated joint on a cart, which moves along a frictionless track. The agent can apply a force to the cart and its training goal is to swing-up and balance the pole upright using minimal control effort. The example plots performance metrics such as the total training time and the total reward for each trained agent. The results that the agents obtain in this environment, with the selected initial conditions and random number generator seed, do not necessarily imply that specific agents are better than others. Also, note that the training times depend on the computer and operating system you use to run the example, and on other processes running in the background. Your training times might differ substantially from the training times shown in the example.
Fix Random Number Stream for Reproducibility
The example code might involve computation of random numbers at various stages. Fixing the random number stream at the beginning of various sections in the example code preserves the random number sequence in the section every time you run it, and increases the likelihood of reproducing the results. For more information, see Results Reproducibility.
Fix the random number stream with seed 0
and random number algorithm Mersenne Twister. For more information on controlling the seed used for random number generation, see rng
.
previousRngState = rng(0,"twister");
The output previousRngState is a structure that contains information about the previous state of the stream. You will restore the state at the end of the example.
Continuous Action Space Simscape Cart-Pole Model Simulink Environment
The reinforcement learning environment for this example is a pole attached to an unactuated joint on a cart, which moves along a frictionless track. The training goal is to make the pole stand upright using minimal control effort.
Open the model.
mdl = "rlCartPoleSimscapeModel";
open_system(mdl)
The cart-pole system is modeled using Simscape™ Multibody™.
In this model:
The upright pole angle is zero radians. Initially, the pole hangs downward (with an angle of
pi
radians) without moving.The force action signal from the agent to the environment is from –15 to 15 N.
The observations from the environment are the position and velocity of the cart, and the sine, cosine, and derivative of the pole angle.
The episode terminates if the cart moves more than 3.5 m from the original position.
The reward , provided at every time step, is
Here:
is the angle of displacement from the upright position of the pole.
is the position displacement from the center position of the cart.
is the control effort from the previous time step.
is a flag (1 or 0) that indicates whether the cart is out of bounds.
For more information on this model, see Load Predefined Control System Environments.
Create Environment Object
Create a predefined environment object for the continuous cart-pole environment.
env = rlPredefinedEnv("CartPoleSimscapeModel-Continuous")
env = SimulinkEnvWithAgent with properties: Model : rlCartPoleSimscapeModel AgentBlock : rlCartPoleSimscapeModel/RL Agent ResetFcn : [] UseFastRestart : on
Obtain the observation and action information for later use when creating agents.
obsInfo = getObservationInfo(env)
obsInfo = rlNumericSpec with properties: LowerLimit: -Inf UpperLimit: Inf Name: "observations" Description: [0×0 string] Dimension: [5 1] DataType: "double"
actInfo = getActionInfo(env)
actInfo = rlNumericSpec with properties: LowerLimit: -15 UpperLimit: 15 Name: "force" Description: [0×0 string] Dimension: [1 1] DataType: "double"
The object has a continuous action space where the agent can apply torque values between –2 to 2 N·m to the pendulum.
Specify the agent sample time Ts
and the simulation time Tf
in seconds.
Ts = 0.02; Tf = 25;
Configure Training and Simulation Options for All Agents
Set up an evaluator object to evaluate the agent 10 times without exploration every 100 training episodes.
evl = rlEvaluator(NumEpisodes=10,EvaluationFrequency=100);
Create a training options object. For this example, use the following options.
Run each training episode for a maximum of 5000 episodes, with each episode lasting at most
Tf/Ts
(by default 1250) time steps.To have a better insight on the agent's behavior during training, plot the training progress (default option). If you want to achieve faster training times, set the
Plots
option tonone
.Stop training when the average reward in the evaluation episodes is greater than -400. At this point, the agent can quickly swing up and balance the pole in the upright position using minimal control effort.
trainOpts = rlTrainingOptions(... MaxEpisodes=5000, ... MaxStepsPerEpisode=ceil(Tf/Ts), ... StopTrainingCriteria="EvaluationStatistic",... StopTrainingValue=-400);
For more information on training options, see rlTrainingOptions
.
To simulate the trained agent, create a simulation options object and configure it to simulate for ceil(Tf/Ts)
steps.
simOptions = rlSimulationOptions(MaxSteps=ceil(Tf/Ts));
For more information on simulation options, see rlSimulationOptions
.
Create, Train, and Simulate a PG Agent
The actor and critic networks are initialized randomly. Ensure reproducibility of the section by fixing the seed used for random number generation.
rngState = rng(0,"twister")
rngState = struct with fields:
Type: 'twister'
Seed: 0
State: [625×1 uint32]
First, create a default rlPGAgent
object using the environment specification objects.
pgAgent = rlPGAgent(obsInfo,actInfo);
To ensure that the RL Agent block in the environment executes every Ts
seconds instead of the default setting of one second, set the SampleTime
property of pgAgent
.
pgAgent.AgentOptions.SampleTime = Ts;
Set a lower learning rate and a lower gradient threshold to promote a smoother (though possibly slower) training.
pgAgent.AgentOptions.CriticOptimizerOptions.LearnRate = 1e-3; pgAgent.AgentOptions.ActorOptimizerOptions.LearnRate = 1e-3; pgAgent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1; pgAgent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;
Set the entropy loss weight to increase exploration.
pgAgent.AgentOptions.EntropyLossWeight = 0.005;
Train the agent, passing the agent, the environment, and the previously defined training options and evaluator objects to train
. Training is a computationally intensive process that takes several minutes to complete. To save time while running this example, load a pretrained agent by setting doTraining
to false
. To train the agent yourself, set doTraining
to true
.
doTraining =false; if doTraining % Train the agent. Save the final agent and training results. tic pgTngRes = train(pgAgent,env,trainOpts,Evaluator=evl); pgTngTime = toc; % Extract the number of training episodes and the number of total steps. pgTngEps = pgTngRes.EpisodeIndex(end); pgTngSteps = sum(pgTngRes.TotalAgentSteps); % Uncomment to save the trained agent and the training metrics. % save("ccpsuBchPGAgent.mat", ... % "pgAgent","pgTngEps","pgTngSteps","pgTngTime") else % Load the pretrained agent and results for the example. load("ccpsuBchPGAgent.mat", ... "pgAgent","pgTngEps","pgTngSteps","pgTngTime") end
For the PG agent, the training does not converge to a solution. You can check the trained agent within the cart-pole swing-up environment.
Ensure reproducibility of the simulation by fixing the seed used for random number generation.
rng(0,"twister")
Configure the agent to use a greedy policy (no exploration) in simulation.
pgAgent.UseExplorationPolicy = false;
Simulate the environment with the trained agent for ceil(Tf/Ts)
steps and display the total reward. For more information on agent simulation, see sim
.
experience = sim(env,pgAgent,simOptions); pgTotalRwd = sum(experience.Reward)
pgTotalRwd = -862.4660
The trained PG agent is not able to swing up the pole.
Create, Train, and Simulate an AC Agent
The actor and critic networks are initialized randomly. Ensure reproducibility of the section by fixing the seed used for random number generation.
rngState = rng(0,"twister")
rngState = struct with fields:
Type: 'twister'
Seed: 0
State: [625×1 uint32]
First, create a default rlACAgent
object using the environment specification objects.
acAgent = rlACAgent(obsInfo,actInfo);
To ensure that the RL Agent block in the environment executes every Ts
seconds instead of the default setting of one second, set the SampleTime
property of acAgent
.
acAgent.AgentOptions.SampleTime = Ts;
Set a lower learning rate and a lower gradient threshold to promote a smoother (though possibly slower) training.
acAgent.AgentOptions.CriticOptimizerOptions.LearnRate = 1e-3; acAgent.AgentOptions.ActorOptimizerOptions.LearnRate = 1e-3; acAgent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1; acAgent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;
Set the entropy loss weight to increase exploration.
acAgent.AgentOptions.EntropyLossWeight = 0.005;
Train the agent, passing the agent, the environment, and the previously defined training options and evaluator objects to train
. Training is a computationally intensive process that takes several minutes to complete. To save time while running this example, load a pretrained agent by setting doTraining
to false
. To train the agent yourself, set doTraining
to true
.
doTraining =false; if doTraining % Train the agent. Save the final agent and training results. tic acTngRes = train(acAgent,env,trainOpts,Evaluator=evl); acTngTime = toc; % Extract the number of training episodes and the number of total steps. acTngEps = acTngRes.EpisodeIndex(end); acTngSteps = sum(acTngRes.TotalAgentSteps); % Uncomment to save the trained agent and the training metrics. % save("ccpsuBchACAgent.mat", ... % "acAgent","acTngEps","acTngSteps","acTngTime") else % Load the pretrained agent and results for the example. load("ccpsuBchACAgent.mat", ... "acAgent","acTngEps","acTngSteps","acTngTime") end
For the AC agent, the training does not converge to a solution. You can check the trained agent within the cart-pole swing-up environment.
Ensure reproducibility of the simulation by fixing the seed used for random number generation.
rng(0,"twister")
Configure the agent to use a greedy policy (no exploration) in simulation.
acAgent.UseExplorationPolicy = false;
Simulate the environment with the trained agent for ceil(Tf/Ts)
steps. For more information on agent simulation, see sim
.
experience = sim(env,acAgent,simOptions); acTotalRwd = sum(experience.Reward)
acTotalRwd = -862.5542
The trained AC agent is not able to swing up the pole.
Create, Train, and Simulate a PPO Agent
For the PPO Agent, the system goes unstable during training, which causes an error. Therefore, for this example, do not train a PPO agent.
Create, Train, and Simulate a DDPG Agent
The actor and critic networks are initialized randomly. Ensure reproducibility of the section by fixing the seed used for random number generation.
rngState = rng(0,"twister")
rngState = struct with fields:
Type: 'twister'
Seed: 0
State: [625×1 uint32]
First, create a default rlDDPGAgent
object using the environment specification objects.
ddpgAgent = rlDDPGAgent(obsInfo,actInfo);
To ensure that the RL Agent block in the environment executes every Ts
seconds instead of the default setting of one second, set the SampleTime
property of ddpgAgent
.
ddpgAgent.AgentOptions.SampleTime = Ts;
Set a lower learning rate and a lower gradient threshold to promote a smoother (though possibly slower) training.
ddpgAgent.AgentOptions.CriticOptimizerOptions.LearnRate = 1e-3; ddpgAgent.AgentOptions.ActorOptimizerOptions.LearnRate = 1e-3; ddpgAgent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1; ddpgAgent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;
Use a larger experience buffer to store more experiences, therefore decreasing the likelihood of catastrophic forgetting.
ddpgAgent.AgentOptions.ExperienceBufferLength = 1e6;
Train the agent, passing the agent, the environment, and the previously defined training options and evaluator objects to train
. Training is a computationally intensive process that takes several minutes to complete. To save time while running this example, load a pretrained agent by setting doTraining
to false
. To train the agent yourself, set doTraining
to true
.
doTraining =false; if doTraining % Train the agent. Save the final agent and training results. tic ddpgTngRes = train(ddpgAgent,env,trainOpts,Evaluator=evl); ddpgTngTime = toc; % Extract the number of training episodes and the number of total steps. ddpgTngEps = ddpgTngRes.EpisodeIndex(end); ddpgTngSteps = sum(ddpgTngRes.TotalAgentSteps); % Uncomment to save the trained agent and the training metrics. % save("ccpsuBchDDPGAgent.mat", ... % "ddpgAgent","ddpgTngEps","ddpgTngSteps","ddpgTngTime") else % Load the pretrained agent and results for the example. load("ccpsuBchDDPGAgent.mat", ... "ddpgAgent","ddpgTngEps","ddpgTngSteps","ddpgTngTime") end
For the DDPG Agent, the training converges to a solution. You can check the trained agent within the cart-pole swing-up environment.
Ensure reproducibility of the simulation by fixing the seed used for random number generation.
rng(0,"twister")
Configure the agent to use a greedy policy (no exploration) in simulation.
ddpgAgent.UseExplorationPolicy = false;
Simulate the environment with the trained agent for ceil(Tf/Ts)
steps. For more information on agent simulation, see sim
.
experience = sim(env,ddpgAgent,simOptions); ddpgTotalRwd = sum(experience.Reward)
ddpgTotalRwd = -377.8038
The trained DDPG agent swings up the pole and maintains it upright.
Create, Train, and Simulate a TD3 Agent
The actor and critic networks are initialized randomly. Ensure reproducibility of the section by fixing the seed used for random number generation.
rngState = rng(0,"twister")
rngState = struct with fields:
Type: 'twister'
Seed: 0
State: [625×1 uint32]
First, create a default rlDDPGAgent
object using the environment specification objects.
td3Agent = rlTD3Agent(obsInfo,actInfo);
To ensure that the RL Agent block in the environment executes every Ts
seconds instead of the default setting of one second, set the SampleTime
property of td3Agent
.
td3Agent.AgentOptions.SampleTime = Ts;
Set a lower learning rate and a lower gradient threshold to promote a smoother (though possibly slower) training.
td3Agent.AgentOptions.CriticOptimizerOptions(1).LearnRate = 1e-3; td3Agent.AgentOptions.CriticOptimizerOptions(2).LearnRate = 1e-3; td3Agent.AgentOptions.ActorOptimizerOptions.LearnRate = 1e-3; td3Agent.AgentOptions.CriticOptimizerOptions(1).GradientThreshold = 1; td3Agent.AgentOptions.CriticOptimizerOptions(2).GradientThreshold = 1; td3Agent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;
Use a larger experience buffer to store more experiences, therefore decreasing the likelihood of catastrophic forgetting.
td3Agent.AgentOptions.ExperienceBufferLength = 1e6;
Train the agent, passing the agent, the environment, and the previously defined training options and evaluator objects to train
. Training is a computationally intensive process that takes several minutes to complete. To save time while running this example, load a pretrained agent by setting doTraining
to false
. To train the agent yourself, set doTraining
to true
.
doTraining =false; if doTraining % Train the agent. Save the final agent and training results. tic td3TngRes = train(td3Agent,env,trainOpts,Evaluator=evl); td3TngTime = toc; % Extract the number of training episodes and the number of total steps. td3TngEps = td3TngRes.EpisodeIndex(end); td3TngSteps = sum(td3TngRes.TotalAgentSteps); % Uncomment to save the trained agent and the training metrics. % save("ccpsuBchTD3Agent.mat", ... % "td3Agent","td3TngEps","td3TngSteps","td3TngTime") else % Load the pretrained agent and results for the example. load("ccpsuBchTD3Agent.mat", ... "td3Agent","td3TngEps","td3TngSteps","td3TngTime") end
For the TD3 Agent, the training converges to a solution. You can check the trained agent within the cart-pole swing-up environment.
Ensure reproducibility of the simulation by fixing the seed used for random number generation.
rng(0,"twister")
Configure the agent to use a greedy policy (no exploration) in simulation.
td3Agent.UseExplorationPolicy = false;
Simulate the environment with the trained agent for ceil(Tf/Ts)
steps. For more information on agent simulation, see sim
.
experience = sim(env,td3Agent,simOptions); td3TotalRwd = sum(experience.Reward)
td3TotalRwd = -6.4102e+03
The trained TD3 agent is not able to swing up the pole.
Create, Train, and Simulate a SAC Agent
The actor and critic networks are initialized randomly. Ensure reproducibility of the section by fixing the seed used for random number generation.
rngState = rng(0,"twister")
rngState = struct with fields:
Type: 'twister'
Seed: 0
State: [625×1 uint32]
First, create a default rlSACAgent
object using the environment specification objects.
sacAgent = rlSACAgent(obsInfo,actInfo);
To ensure that the RL Agent block in the environment executes every Ts
seconds instead of the default setting of one second, set the SampleTime
property of sacAgent
.
sacAgent.AgentOptions.SampleTime = Ts;
Set a lower learning rate and a lower gradient threshold to promote a smoother (though possibly slower) training.
sacAgent.AgentOptions.CriticOptimizerOptions(1).LearnRate = 1e-3; sacAgent.AgentOptions.CriticOptimizerOptions(2).LearnRate = 1e-3; sacAgent.AgentOptions.ActorOptimizerOptions.LearnRate = 1e-3; sacAgent.AgentOptions.CriticOptimizerOptions(1).GradientThreshold = 1; sacAgent.AgentOptions.CriticOptimizerOptions(2).GradientThreshold = 1; sacAgent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;
Use a larger experience buffer to store more experiences, therefore decreasing the likelihood of catastrophic forgetting.
sacAgent.AgentOptions.ExperienceBufferLength = 1e6;
Train the agent, passing the agent, the environment, and the previously defined training options and evaluator objects to train
. Training is a computationally intensive process that takes several minutes to complete. To save time while running this example, load a pretrained agent by setting doTraining
to false
. To train the agent yourself, set doTraining
to true
.
doTraining =false; if doTraining % Train the agent. Save the final agent and training results. tic sacTngRes = train(sacAgent,env,trainOpts,Evaluator=evl); sacTngTime = toc; % Extract the number of training episodes and the number of total steps. sacTngEps = sacTngRes.EpisodeIndex(end); sacTngSteps = sum(sacTngRes.TotalAgentSteps); % Uncomment to save the trained agent and the training metrics. % save("ccpsuBchSACAgent.mat", ... % "sacAgent","sacTngEps","sacTngSteps","sacTngTime") else % Load the pretrained agent and results for the example. load("ccpsuBchSACAgent.mat", ... "sacAgent","sacTngEps","sacTngSteps","sacTngTime") end
For the SAC Agent, the training converges to a solution. You can check the trained agent within the cart-pole swing-up environment.
Ensure reproducibility of the simulation by fixing the seed used for random number generation.
rng(0,"twister")
Configure the agent to use a greedy policy (no exploration) in simulation.
sacAgent.UseExplorationPolicy = false;
Simulate the environment with the trained agent for ceil(Tf/Ts)
steps. For more information on agent simulation, see sim
.
experience = sim(env,sacAgent,simOptions); sacTotalRwd = sum(experience.Reward)
sacTotalRwd = -357.7562
The trained DDPG agent swings up the pole and maintains it upright.
Plot Training and Simulation Metrics
For each agent, collect the total reward from the final simulation episode, the number of training episodes, the total number of agent steps, and the total training time as shown in the Reinforcement Learning Training Monitor. Since the PPO agent training goes unstable, set its metrics to NaN
.
simReward = [ pgTotalRwd acTotalRwd NaN ddpgTotalRwd td3TotalRwd sacTotalRwd ]; tngEpisodes = [ pgTngEps acTngEps NaN ddpgTngEps td3TngEps sacTngEps ]; tngSteps = [ pgTngSteps acTngSteps NaN ddpgTngSteps td3TngSteps sacTngSteps ]; tngTime = [ pgTngTime acTngTime NaN ddpgTngTime td3TngTime sacTngTime ];
Since the training for the PG, AC, and TD3 agents did not converge, set also their metrics to NaN
.
simReward([1 2 5]) = NaN; tngEpisodes([1 2 5]) = NaN; tngSteps([1 2 5]) = NaN; tngTime([1 2 5]) = NaN;
Plot the simulation reward, number of training episodes, number of training steps and training time. Scale the data by the factor [1 20 5e6 20]
for better visualization.
bar([simReward,tngEpisodes,tngSteps,tngTime]./[1 20 5e6 20]) xticklabels(["PG" "AC" "PPO" "DDPG" "TD3" "SAC"]) legend(["Simulation Reward","Training Episodes","Training Steps","Training Time"], ... "Location","northwest")
The plot shows that, for this environment, and with the used random number generator seed and initial conditions, only the DDPG and SAC agents stabilize the pole on the cart, with SAC taking a longer time (due to its more complex structure and the consequent need to calculate more gradients). With a different random seed, the initial agent networks would be different, and therefore, convergence results might be different. For more information on the relative strengths and weaknesses of each agent, see Reinforcement Learning Agents.
Save all the variables created in this example, including the training results, for later use.
% Uncomment to save all the workspace variables % save ccpsuAllVars.mat
Restore the random number stream using the information stored in previousRngState
.
rng(previousRngState);