Documentation

## Train Biped Robot to Walk Using DDPG Agent

This example shows how to train a biped robot, modeled in Simscape™ Multibody™, to walk using a deep deterministic policy gradient (DDPG) agent. For more information on DDPG agents, see Deep Deterministic Policy Gradient Agents.

### Biped Robot Model

The reinforcement learning environment for this example is a biped robot. The training goal is to make the robot walk in a straight line using minimal control effort.

Load parameters of the model to the base MATLAB® workspace.

robotParametersRL

Open the model.

mdl = 'rlWalkingBipedRobot';
open_system(mdl)

The robot is modeled using Simscape Multibody and the Simscape Multibody Contact Forces Library.

For this model:

• The neutral 0 radian position is with both legs straight and the ankles flat.

• The foot contact is modeled using the Simscape Multibody Contact Forces Library.

• The agent can control 3 individual joints, the ankle, knee, and hip, on both legs of the robot by applying torque signals from -3 to 3 Nm. The actual computed action signals are normalized between -1 and 1.

The environment provides 29 observations to the agent. The observations are:

• Y (lateral) and Z (vertical) translations of the torso center of mass. The translation in the Z direction is normalized to a similar range as the other observations.

• X (forward), Y (lateral), and Z (vertical) translation velocities.

• Yaw, pitch, and roll angles of the torso.

• Yaw, pitch, and roll angular velocities of the torso.

• Angular positions and velocities of the 3 joints (ankle, knee, hip) on both legs.

• Action values from the previous time step.

The episode terminates if:

• The robot torso center of mass is less than 0.1 m in Z direction (fallen) or more than 1 m in Y direction (lateral motion).

• The absolute value of either the roll, pitch, or yaw is greater than 0.7854 radians.

The reward function ${\mathit{r}}_{\mathit{t}}$ provided at every time step, is inspired by [1]. This reward function encourages the agent to move forward by providing a positive reward for positive forward velocity. It also encourages the agent to avoid episode termination by providing a constant reward ($25\frac{\mathrm{Ts}}{\mathrm{Tf}}$) at every time step. The other terms in the reward function are penalties for substantial changes in lateral and vertical translations, and for the use of excess control effort.

${\mathbit{r}}_{\mathbit{t}}={\mathit{v}}_{\mathit{x}}-{3\mathit{y}}^{2}-50{\stackrel{ˆ}{\mathit{z}}}^{2}+25\frac{\mathrm{Ts}}{\mathrm{Tf}}-0.02\sum _{\mathit{i}}{{\mathit{u}}_{\mathit{t}-1}^{\mathit{i}}}^{2}$

where:

• ${\mathit{v}}_{\mathit{x}}$ is the translation velocity in X direction (forward toward goal) of the robot.

• $\mathit{y}$ is the lateral translation displacement of the robot from the target straight line trajectory.

• $\stackrel{ˆ}{\mathit{z}}$ is the normalized vertical translation displacement of the robot center of mass.

• ${\mathit{u}}_{\mathit{t}-1}^{\mathit{i}}$ is the torque from joint $\mathit{i}$ from the previous time step.

• $\mathrm{Ts}$ is the sample time of the environment.

• $\mathrm{Tf}$ is the final simulation time of the environment.

### Create Environment Interface

Create the observation specification.

numObs = 29;
obsInfo = rlNumericSpec([numObs 1]);
obsInfo.Name = 'observations';

Create the action specification.

numAct = 6;
actInfo = rlNumericSpec([numAct 1],'LowerLimit',-1,'UpperLimit',1);
actInfo.Name = 'foot_torque';

Create the environment interface for the walking robot model.

blk = [mdl,'/RL Agent'];
env.ResetFcn = @(in) walkerResetFcn(in,upper_leg_length/100,lower_leg_length/100,h/100);

### Create DDPG agent

A DDPG agent approximates the long-term reward given observations and actions using a critic value function representation. A DDPG agent decides which action to take given observations using an actor representation. The actor and critic networks for this example are inspired by [2].

For more information on creating a deep neural network value function representation, see Create Policy and Value Function Representations. For an example that creates neural networks for DDPG agents, see Train DDPG Agent to Control Double Integrator System.

Create the actor and critic networks using the createNetworks helper function.

[criticNetwork,actorNetwork] = createNetworks(numObs,numAct);

View the critic network configuration.

figure
plot(criticNetwork)

Specify options for the critic and actor representations using rlRepresentationOptions.

Create the critic and actor representations using the specified deep neural networks and options. You must also specify the action and observation information for each representation, which you already obtained from the environment interface. For more information, see rlRepresentation.

critic = rlRepresentation(criticNetwork,obsInfo,actInfo,'Observation',{'observation'},'Action',{'action'},criticOptions);
actor  = rlRepresentation(actorNetwork,obsInfo,actInfo,'Observation',{'observation'},'Action',{'ActorTanh1'},actorOptions);

To create the DDPG agent, first specify the DDPG agent options using rlDDPGAgentOptions.

agentOptions = rlDDPGAgentOptions;
agentOptions.SampleTime = Ts;
agentOptions.DiscountFactor = 0.99;
agentOptions.MiniBatchSize = 250;
agentOptions.ExperienceBufferLength = 1e6;
agentOptions.TargetSmoothFactor = 1e-3;
agentOptions.NoiseOptions.MeanAttractionConstant = 1;
agentOptions.NoiseOptions.Variance = 0.1;

Then, create the DDPG agent using the specified actor representation, critic representation, and agent options. For more information, see rlDDPGAgent.

agent = rlDDPGAgent(actor,critic,agentOptions);

### Training Options

To train the agent, first specify the training options. For this example, use the following options:

• Run each training session for at most 10000 episodes, with each episode lasting at most maxSteps time steps.

• Display the training progress in the Episode Manager dialog box (set the Plots option) and disable the command line display (set the Verbose option).

• Stop training when the agent receives an average cumulative reward greater than 100 over 250 consecutive episodes.

• Save a copy of the agent for each episode where the cumulative reward is greater than 150.

maxEpisodes = 10000;
maxSteps = floor(Tf/Ts);
trainOpts = rlTrainingOptions(...
'MaxEpisodes',maxEpisodes,...
'MaxStepsPerEpisode',maxSteps,...
'ScoreAveragingWindowLength',250,...
'Verbose',false,...
'Plots','training-progress',...
'StopTrainingCriteria','AverageReward',...
'StopTrainingValue',100,...
'SaveAgentCriteria','EpisodeReward',...
'SaveAgentValue',150);

Specify the following training options to train the agent in parallel training mode. If you do not have Parallel Computing Toolbox™ software installed, set UseParallel to false.

• Set the UseParallel option to true.

• Train the agent in parallel asynchronously.

• After every 32 steps, each worker sends experiences to the host.

• DDPG agents require workers to send 'Experiences' to the host.

trainOpts.UseParallel = true;
trainOpts.ParallelizationOptions.Mode = 'async';
trainOpts.ParallelizationOptions.StepsUntilDataIsSent = 32;
trainOpts.ParallelizationOptions.DataToSendFromWorkers = 'Experiences';

### Train Agent

Train the agent using the train function. This process is computationally intensive and takes several hours to complete. To save time while running this example, load a pretrained agent by setting doTraining to false. To train the agent yourself, set doTraining to true. Due to randomness of the parallel training, you can expect different training results from the plot below. The pretrained agent was trained in parallel using six workers.

doTraining = false;
if doTraining
% Train the agent.
trainingStats = train(agent,env,trainOpts);
else
% Load pretrained agent for the example.
end

### Simulate DDPG Agent

Fix the random generator seed for reproducibility.

rng(0)

To validate the performance of the trained agent, simulate it within the biped robot environment. For more information on agent simulation, see rlSimulationOptions and sim.

simOptions = rlSimulationOptions('MaxSteps',maxSteps);
experience = sim(env,agent,simOptions);

### References

[1] N. Heess et al, "Emergence of Locomotion Behaviours in Rich Environments," Technical Report, ArXiv, 2017.

[2] T.P. Lillicrap et al, "Continuous Control with Deep Reinforcement Learning," International Conference on Learning Representations, 2016.