Training PPO agent to track 4 reference signals

Question

Eddie il 18 Mar 2025

0
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/2175327-training-ppo-agent-to-track-4-reference-signals

Risposto: Aastha il 7 Mag 2025

Hi, I am training a PPO agent to track the reference signals. I don't know how to control the action/control signals (or the output of the RL block). I set the action info lower and upper limit to be -1 and 1. I created a tanh activation and scaling layer (value = 1) in my actor network, but the control signal is still coming out larger than 1 and less than -1, and it continues to grow larger.

here are my reference signals:

x1_desired = sin(0.5 * t);

x2_desired = 0.5*cos(0.5*t); % derivative of x1_desired

x3_desired = exp(-0.5*t) * sin(0.5*t);

x4_desired = 0.5 * exp(-0.5*t) * (cos(0.5*t)-sin(0.5*t)); % derivative of x2_desired

I have attached the simulink and codes. Please help.

clc;

clear;

close all;

Ts = 0.01;

Tf = 10;

mdl = 'ppo_simulation_3';

open_system(mdl) % open simulink model

agentblock = 'ppo_simulation_3/RL Agent';

%% Define observation and action space

obsInfo = rlNumericSpec([4, 1], "LowerLimit", -inf, "UpperLimit", inf, "Name", "Observations"); %error: x1, x2, x3, x4

actInfo = rlNumericSpec([2, 1], "LowerLimit", -1, "UpperLimit", 1, "Name", "Actions"); %control: u1, u2

% rlNumericinfo mean: create agent with continuous action space

%% Set up environment

env = rlSimulinkEnv(mdl, agentblock, obsInfo, actInfo);

rng(0);

env.ResetFcn = @(in) resetEnvironment(in, mdl);

function in = resetEnvironment(in, mdl)

in = setVariable(in, 'X1', 0, 'Workspace', mdl);

in = setVariable(in, 'X2', 1, 'Workspace', mdl);

in = setVariable(in, 'X3', 0, 'Workspace', mdl);

in = setVariable(in, 'X4', 1, 'Workspace', mdl);

end

%% Define actor network

commonPath = [

featureInputLayer(prod(obsInfo.Dimension), Name= 'comPathIn')

fullyConnectedLayer(256)

% batchNormalizationLayer(Name= 'bn1')

reluLayer

% dropoutLayer(0.2, Name= 'dropout1')

fullyConnectedLayer(256)

% batchNormalizationLayer(Name= 'bn2')

reluLayer

fullyConnectedLayer(256)

% batchNormalizationLayer(Name= 'bn3')

reluLayer

fullyConnectedLayer(192)

% batchNormalizationLayer(Name= 'bn4')

reluLayer(Name ='comPathOut')];

meanPath = [

fullyConnectedLayer(2,Name="meanPathIn")

tanhLayer

scalingLayer(Scale=1, Name="meanPathOut")]; %mean value

stdPath = [

fullyConnectedLayer(2,"Name","stdPathIn")

softplusLayer(Name="stdPathOut")]; % std deviation value

actorNet = dlnetwork;

actorNet = addLayers(actorNet,commonPath);

actorNet = addLayers(actorNet,meanPath);

actorNet = addLayers(actorNet,stdPath);

%connect 3 layers together

actorNet = connectLayers(actorNet,"comPathOut","meanPathIn/in");

actorNet = connectLayers(actorNet,"comPathOut","stdPathIn/in");

actorNet = initialize(actorNet);

actor = rlContinuousGaussianActor(actorNet, obsInfo, actInfo, ...

"ActionMeanOutputNames","meanPathOut",...

"ActionStandardDeviationOutputNames","stdPathOut",...

ObservationInputNames="comPathIn");

%% Define critic network

criticNet = [

featureInputLayer(prod(obsInfo.Dimension), name='observationinp')

fullyConnectedLayer(256)

reluLayer

fullyConnectedLayer(256)

reluLayer

fullyConnectedLayer(256)

reluLayer

fullyConnectedLayer(192)

reluLayer

fullyConnectedLayer(1)];

criticNet = dlnetwork(criticNet);

criticNet = initialize(criticNet);

critic = rlValueFunction(criticNet, obsInfo);

%% Create Agent

agent = rlPPOAgent(actor, critic);

agent.AgentOptions.SampleTime = 0.01; %-1 mean

agent.AgentOptions.DiscountFactor = 0.99;

agent.AgentOptions.EntropyLossWeight = 0.01;

agent.AgentOptions.ExperienceHorizon = 256;

agent.AgentOptions.MiniBatchSize = 64; % split 4 minibatches, each minibatch contains 128 experiences

agent.AgentOptions.NumEpoch = 10;

agent.AgentOptions.MaxMiniBatchPerEpoch = 100;

agent.AgentOptions.LearningFrequency = -1;

agent.AgentOptions.ActorOptimizerOptions.LearnRate = 5e-3;

agent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 2;

agent.AgentOptions.CriticOptimizerOptions.LearnRate = 5e-3;

agent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 2;

agent.AgentOptions.ClipFactor = 0.1;

agent.AgentOptions.AdvantageEstimateMethod="finite-horizon";

agent.AgentOptions.GAEFactor = 0.9; %scalar bet 0-1

agent.AgentOptions.NormalizedAdvantageMethod = "none";

agent.AgentOptions.AdvantageNormalizingWindow = 1e6;

maxepisodes = 1000;

maxsteps = Tf / Ts;

trainOpts = rlTrainingOptions( ...

MaxEpisodes = maxepisodes, MaxStepsPerEpisode = maxsteps, ...

ScoreAveragingWindowLength = 5, Verbose=false, ...

Plots = "training-progress", ...

StopTrainingCriteria = "EvaluationStatistic",...

StopTrainingValue = -1e-3 );

doTraining = true;

if doTraining

evl = rlEvaluator(NumEpisodes=1, EvaluationFrequency=10);

trainingStats = train(agent, env, trainOpts, Evaluator=evl);

save("trainedPPOAgent_Ft.mat","agent");

else

load("trainedPPOAgent_Ft.mat", "agent");

end

%%

figure(1)

plot(actorNet);

figure(2)

plot(criticNet);

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

Accedi per rispondere a questa domanda.

Answer 1

Aastha il 7 Mag 2025

1
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/2175327-training-ppo-agent-to-track-4-reference-signals#answer_1564810

Hi @Eddie,

I understand that you are encountering an issue where the control signal exceeds the upper and lower limits defined by the action specification.

The Tips section of the “rlPPOAgent” documentation mentions that for continuous action spaces, this agent does not enforce the constraints set by the action specification. In this case, you must enforce action space constraints within the environment.

For more details on action space constraints, you may refer to the MathWorks documentation linked below:

https://www.mathworks.com/help/reinforcement-learning/ref/rl.agent.rlppoagent.html

Additionally, the “rlContinuousGaussianActor” generates actions by sampling from a Gaussian distribution, where the mean and standard deviation are determined by the actor network.

In the implementation, the output layer of the “meanPath” in the actor network uses a tanh activation followed by a scaling operation. This ensures that the mean is always bound between -1 and 1. However, if the standard deviation is large, the sampled action may still exceed these bounds, since it is drawn from a Gaussian distribution.

For more details on the “rlContinuousGaussianActor”, you may refer to the MathWorks documentation linked below:

https://www.mathworks.com/help/reinforcement-learning/ref/rl.function.rlcontinuousgaussianactor.html

To ensure that the agent produces bounded actions, you can apply a tanh activation followed by a scalingLayer, as illustrated in the Continuous Action Generation section of the Soft Actor-Critic (SAC) documentation:

https://www.mathworks.com/help/reinforcement-learning/ug/soft-actor-critic-agents.html

I hope this helps!

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

Training PPO agent to track 4 reference signals

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Risposte (1)

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Vedere anche

Categorie

Tag

Community Treasure Hunt

Training PPO agent to track 4 reference signals

0 Commenti Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Risposte (1)

0 Commenti Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Vedere anche

Categorie

Tag

Community Treasure Hunt

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti