Training PPO agent to track 4 reference signals

3 visualizzazioni (ultimi 30 giorni)
Eddie
Eddie il 18 Mar 2025
Risposto: Aastha il 7 Mag 2025
Hi, I am training a PPO agent to track the reference signals. I don't know how to control the action/control signals (or the output of the RL block). I set the action info lower and upper limit to be -1 and 1. I created a tanh activation and scaling layer (value = 1) in my actor network, but the control signal is still coming out larger than 1 and less than -1, and it continues to grow larger.
here are my reference signals:
x1_desired = sin(0.5 * t);
x2_desired = 0.5*cos(0.5*t); % derivative of x1_desired
x3_desired = exp(-0.5*t) * sin(0.5*t);
x4_desired = 0.5 * exp(-0.5*t) * (cos(0.5*t)-sin(0.5*t)); % derivative of x2_desired
I have attached the simulink and codes. Please help.
clc;
clear;
close all;
Ts = 0.01;
Tf = 10;
mdl = 'ppo_simulation_3';
open_system(mdl) % open simulink model
agentblock = 'ppo_simulation_3/RL Agent';
%% Define observation and action space
obsInfo = rlNumericSpec([4, 1], "LowerLimit", -inf, "UpperLimit", inf, "Name", "Observations"); %error: x1, x2, x3, x4
actInfo = rlNumericSpec([2, 1], "LowerLimit", -1, "UpperLimit", 1, "Name", "Actions"); %control: u1, u2
% rlNumericinfo mean: create agent with continuous action space
%% Set up environment
env = rlSimulinkEnv(mdl, agentblock, obsInfo, actInfo);
rng(0);
env.ResetFcn = @(in) resetEnvironment(in, mdl);
function in = resetEnvironment(in, mdl)
in = setVariable(in, 'X1', 0, 'Workspace', mdl);
in = setVariable(in, 'X2', 1, 'Workspace', mdl);
in = setVariable(in, 'X3', 0, 'Workspace', mdl);
in = setVariable(in, 'X4', 1, 'Workspace', mdl);
end
%% Define actor network
commonPath = [
featureInputLayer(prod(obsInfo.Dimension), Name= 'comPathIn')
fullyConnectedLayer(256)
% batchNormalizationLayer(Name= 'bn1')
reluLayer
% dropoutLayer(0.2, Name= 'dropout1')
fullyConnectedLayer(256)
% batchNormalizationLayer(Name= 'bn2')
reluLayer
fullyConnectedLayer(256)
% batchNormalizationLayer(Name= 'bn3')
reluLayer
fullyConnectedLayer(192)
% batchNormalizationLayer(Name= 'bn4')
reluLayer(Name ='comPathOut')];
meanPath = [
fullyConnectedLayer(2,Name="meanPathIn")
tanhLayer
scalingLayer(Scale=1, Name="meanPathOut")]; %mean value
stdPath = [
fullyConnectedLayer(2,"Name","stdPathIn")
softplusLayer(Name="stdPathOut")]; % std deviation value
actorNet = dlnetwork;
actorNet = addLayers(actorNet,commonPath);
actorNet = addLayers(actorNet,meanPath);
actorNet = addLayers(actorNet,stdPath);
%connect 3 layers together
actorNet = connectLayers(actorNet,"comPathOut","meanPathIn/in");
actorNet = connectLayers(actorNet,"comPathOut","stdPathIn/in");
actorNet = initialize(actorNet);
actor = rlContinuousGaussianActor(actorNet, obsInfo, actInfo, ...
"ActionMeanOutputNames","meanPathOut",...
"ActionStandardDeviationOutputNames","stdPathOut",...
ObservationInputNames="comPathIn");
%% Define critic network
criticNet = [
featureInputLayer(prod(obsInfo.Dimension), name='observationinp')
fullyConnectedLayer(256)
reluLayer
fullyConnectedLayer(256)
reluLayer
fullyConnectedLayer(256)
reluLayer
fullyConnectedLayer(192)
reluLayer
fullyConnectedLayer(1)];
criticNet = dlnetwork(criticNet);
criticNet = initialize(criticNet);
critic = rlValueFunction(criticNet, obsInfo);
%% Create Agent
agent = rlPPOAgent(actor, critic);
agent.AgentOptions.SampleTime = 0.01; %-1 mean
agent.AgentOptions.DiscountFactor = 0.99;
agent.AgentOptions.EntropyLossWeight = 0.01;
agent.AgentOptions.ExperienceHorizon = 256;
agent.AgentOptions.MiniBatchSize = 64; % split 4 minibatches, each minibatch contains 128 experiences
agent.AgentOptions.NumEpoch = 10;
agent.AgentOptions.MaxMiniBatchPerEpoch = 100;
agent.AgentOptions.LearningFrequency = -1;
agent.AgentOptions.ActorOptimizerOptions.LearnRate = 5e-3;
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 2;
agent.AgentOptions.CriticOptimizerOptions.LearnRate = 5e-3;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 2;
agent.AgentOptions.ClipFactor = 0.1;
agent.AgentOptions.AdvantageEstimateMethod="finite-horizon";
agent.AgentOptions.GAEFactor = 0.9; %scalar bet 0-1
agent.AgentOptions.NormalizedAdvantageMethod = "none";
agent.AgentOptions.AdvantageNormalizingWindow = 1e6;
maxepisodes = 1000;
maxsteps = Tf / Ts;
trainOpts = rlTrainingOptions( ...
MaxEpisodes = maxepisodes, MaxStepsPerEpisode = maxsteps, ...
ScoreAveragingWindowLength = 5, Verbose=false, ...
Plots = "training-progress", ...
StopTrainingCriteria = "EvaluationStatistic",...
StopTrainingValue = -1e-3 );
doTraining = true;
if doTraining
evl = rlEvaluator(NumEpisodes=1, EvaluationFrequency=10);
trainingStats = train(agent, env, trainOpts, Evaluator=evl);
save("trainedPPOAgent_Ft.mat","agent");
else
load("trainedPPOAgent_Ft.mat", "agent");
end
%%
figure(1)
plot(actorNet);
figure(2)
plot(criticNet);

Risposte (1)

Aastha
Aastha il 7 Mag 2025
Hi @Eddie,
I understand that you are encountering an issue where the control signal exceeds the upper and lower limits defined by the action specification.
The Tips section of the “rlPPOAgent” documentation mentions that for continuous action spaces, this agent does not enforce the constraints set by the action specification. In this case, you must enforce action space constraints within the environment.
For more details on action space constraints, you may refer to the MathWorks documentation linked below:
Additionally, the “rlContinuousGaussianActor” generates actions by sampling from a Gaussian distribution, where the mean and standard deviation are determined by the actor network.
In the implementation, the output layer of the “meanPath” in the actor network uses a tanh activation followed by a scaling operation. This ensures that the mean is always bound between -1 and 1. However, if the standard deviation is large, the sampled action may still exceed these bounds, since it is drawn from a Gaussian distribution.
For more details on the “rlContinuousGaussianActor”, you may refer to the MathWorks documentation linked below:
To ensure that the agent produces bounded actions, you can apply a tanh activation followed by a scalingLayer, as illustrated in the Continuous Action Generation section of the Soft Actor-Critic (SAC) documentation:
I hope this helps!

Tag

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by