Reinforcement learning algorithm keeps using boundary values, doesn't learn

Question

Michael Urbanski il 1 Lug 2021

1
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/870173-reinforcement-learning-algorithm-keeps-using-boundary-values-doesn-t-learn

Commentato: Nico Hartmann il 4 Lug 2023

Hello

I am using the reinforcement learning toolbox to train an algorithm to control a vehicle suspension system. For this, I am using a Simulink model as the environment, and a TD3 agent. My setup looks very similar to those from the examples in the reinforcement learning toolbox, I am using 4 observations and 1 action, which is a desired link angle used to control the suspension. I have copied the code I am using at the bottom of the question.

The problem I am facing is that the agent keeps using only actions that are saturated at the lower and upper bounds. The scope below shows what I mean, the linking angle (converted to degrees) should generally be oscillating around 90 degrees (equilibrium). The algorithm starts off by using a close to 90 degree angle but then it starts jumping between 20 degrees (lower bound) and 160 degrees (upper bound), even though the way the reward function is structured, once the algorithm jumps to the boundary values the penalty incurred is much higher than if it kept even a static 90 degree angle (reward function is heavily penalising the square of vertical displacement, which is quite high when using boundary values such as the 20 degrees in the scope).

This behaviour with the outputs constantly getting saturated leads to the algorithm not learning anything throughout the run. Any help would be greatly appreciated, I am not sure what is causing this as very similar structures were used in the examples and they seemed to work perfectly fine. No matter what I change about the algorithm it always seems to saturate the actions at either of the boundaries sooner or later and makes very little progress.

Code:

% Run modelling and parameter files
run('AVGS_Equivalent_Modelling')
run('parameters_series.m')
% Link the simulink model with the MATLAB file
model='NEW_NN_v2';
open_system(model);
%% Define observations, actions, and the environment
Ts=0.05;
Tf=60;
numObs=4;
obsInfo = rlNumericSpec([numObs 1]);
obsInfo.Name = 'observations';
numAct = 1;
actInfo = rlNumericSpec([numAct 1],'LowerLimit',(20*(pi/180)),'UpperLimit',(160*(pi/180)));
actInfo.Name = 'Desired link angle';
blk = [model,'/RL Agent'];
env = rlSimulinkEnv(model,blk,obsInfo,actInfo);
%% Generate the RL Agent
agent = TD3create(numObs, obsInfo, numAct, actInfo, Ts);
maxEpisodes = 500;
maxSteps = floor(Tf/Ts);
trainOpts = rlTrainingOptions(...
    'MaxEpisodes',maxEpisodes,...
    'MaxStepsPerEpisode',maxSteps,...
    'ScoreAveragingWindowLength',50,...
    'Verbose',false,...
    'Plots','training-progress',...
    'StopTrainingCriteria','EpisodeCount',...
    'StopTrainingValue',maxEpisodes,...
    'SaveAgentCriteria','EpisodeCount',...
    'SaveAgentValue',maxEpisodes);
trainOpts.UseParallel = false;
trainOpts.ParallelizationOptions.Mode = 'async';
trainOpts.ParallelizationOptions.StepsUntilDataIsSent = 32;
trainOpts.ParallelizationOptions.DataToSendFromWorkers = 'Experiences';
trainingStats = train(agent,env,trainOpts);
%% Helper functions
function agent = TD3create(numObs, obsInfo, numAct, actInfo, Ts)
% Walking Robot -- TD3 Agent Setup Script
% Copyright 2020 The MathWorks, Inc.
%% Create the actor and critic networks using the createNetworks helper function
[criticNetwork1,criticNetwork2,actorNetwork] = createNetworks(numObs,numAct); % Use of 2 Critic networks
%% Specify options for the critic and actor representations using rlRepresentationOptions
criticOptions = rlRepresentationOptions('Optimizer','adam','LearnRate',1e-1,... 
                                        'GradientThreshold',1,'L2RegularizationFactor',2e-4);
actorOptions = rlRepresentationOptions('Optimizer','adam','LearnRate',1e-1,...
                                       'GradientThreshold',1,'L2RegularizationFactor',1e-5);
%% Create critic and actor representations using specified networks and
% options
critic1 = rlQValueRepresentation(criticNetwork1,obsInfo,actInfo,'Observation',{'observation'},'Action',{'action'},criticOptions);
critic2 = rlQValueRepresentation(criticNetwork2,obsInfo,actInfo,'Observation',{'observation'},'Action',{'action'},criticOptions);
actor  = rlDeterministicActorRepresentation(actorNetwork,obsInfo,actInfo,'Observation',{'observation'},'Action',{'ActorScaling'},actorOptions);
%% Specify TD3 agent options
agentOptions = rlTD3AgentOptions;
agentOptions.SampleTime = Ts;
agentOptions.DiscountFactor = 0.99;
agentOptions.MiniBatchSize = 64;
agentOptions.ExperienceBufferLength = 1e6;
agentOptions.TargetSmoothFactor = 5e-3;
agentOptions.TargetPolicySmoothModel.Variance = 0.2; % target policy noise
agentOptions.TargetPolicySmoothModel.LowerLimit = -0.5;
agentOptions.TargetPolicySmoothModel.UpperLimit = 0.5;
agentOptions.ExplorationModel = rl.option.OrnsteinUhlenbeckActionNoise; % set up OU noise as exploration noise (default is Gaussian for rlTD3AgentOptions)
agentOptions.ExplorationModel.MeanAttractionConstant = 1;
agentOptions.ExplorationModel.Variance = 0.1;
%% Create agent using specified actor representation, critic representations and agent options
agent = rlTD3Agent(actor, [critic1,critic2], agentOptions);
end
function [criticNetwork1, criticNetwork2, actorNetwork] = createNetworks(numObs, numAct)
%% CRITICS
% Create the critic network layers
criticLayerSizes = [400 300];
%% First Critic network
statePath1 = [
    featureInputLayer(numObs,'Normalization','none','Name', 'observation')
    fullyConnectedLayer(criticLayerSizes(1), 'Name', 'CriticStateFC1', ... 
            'Weights',2/sqrt(numObs)*(rand(criticLayerSizes(1),numObs)-0.5), ...
            'Bias',2/sqrt(numObs)*(rand(criticLayerSizes(1),1)-0.5))
    reluLayer('Name','CriticStateRelu1')
    fullyConnectedLayer(criticLayerSizes(2), 'Name', 'CriticStateFC2', ...
            'Weights',2/sqrt(criticLayerSizes(1))*(rand(criticLayerSizes(2),criticLayerSizes(1))-0.5), ... 
            'Bias',2/sqrt(criticLayerSizes(1))*(rand(criticLayerSizes(2),1)-0.5))
    ];
actionPath1 = [
    featureInputLayer(numAct,'Normalization','none', 'Name', 'action')
    fullyConnectedLayer(criticLayerSizes(2), 'Name', 'CriticActionFC1', ...
            'Weights',2/sqrt(numAct)*(rand(criticLayerSizes(2),numAct)-0.5), ... 
            'Bias',2/sqrt(numAct)*(rand(criticLayerSizes(2),1)-0.5))
    ];
commonPath1 = [
    additionLayer(2,'Name','add')
    reluLayer('Name','CriticCommonRelu1')
    fullyConnectedLayer(1, 'Name', 'CriticOutput',...
            'Weights',2*5e-3*(rand(1,criticLayerSizes(2))-0.5), ...
            'Bias',2*5e-3*(rand(1,1)-0.5))
    ];
% Connect the layer graph
criticNetwork1 = layerGraph(statePath1);
criticNetwork1 = addLayers(criticNetwork1, actionPath1);
criticNetwork1 = addLayers(criticNetwork1, commonPath1);
criticNetwork1 = connectLayers(criticNetwork1,'CriticStateFC2','add/in1');
criticNetwork1 = connectLayers(criticNetwork1,'CriticActionFC1','add/in2');
%% Second Critic network 
statePath2 = [
    featureInputLayer(numObs,'Normalization','none','Name', 'observation')
    fullyConnectedLayer(criticLayerSizes(1), 'Name', 'CriticStateFC1', ... 
            'Weights',2/sqrt(numObs)*(rand(criticLayerSizes(1),numObs)-0.5), ...
            'Bias',2/sqrt(numObs)*(rand(criticLayerSizes(1),1)-0.5))
    reluLayer('Name','CriticStateRelu1')
    fullyConnectedLayer(criticLayerSizes(2), 'Name', 'CriticStateFC2', ...
            'Weights',2/sqrt(criticLayerSizes(1))*(rand(criticLayerSizes(2),criticLayerSizes(1))-0.5), ... 
            'Bias',2/sqrt(criticLayerSizes(1))*(rand(criticLayerSizes(2),1)-0.5))
    ];
actionPath2 = [
    featureInputLayer(numAct,'Normalization','none', 'Name', 'action')
    fullyConnectedLayer(criticLayerSizes(2), 'Name', 'CriticActionFC1', ...
            'Weights',2/sqrt(numAct)*(rand(criticLayerSizes(2),numAct)-0.5), ... 
            'Bias',2/sqrt(numAct)*(rand(criticLayerSizes(2),1)-0.5))
    ];
commonPath2 = [
    additionLayer(2,'Name','add')
    reluLayer('Name','CriticCommonRelu1')
    fullyConnectedLayer(1, 'Name', 'CriticOutput',...
            'Weights',2*5e-3*(rand(1,criticLayerSizes(2))-0.5), ...
            'Bias',2*5e-3*(rand(1,1)-0.5))
    ];
% Connect the layer graph
criticNetwork2 = layerGraph(statePath2);
criticNetwork2 = addLayers(criticNetwork2, actionPath2);
criticNetwork2 = addLayers(criticNetwork2, commonPath2);
criticNetwork2 = connectLayers(criticNetwork2,'CriticStateFC2','add/in1');
criticNetwork2 = connectLayers(criticNetwork2,'CriticActionFC1','add/in2');
%% ACTOR
% Create the actor network layers
actorLayerSizes = [400 300];
actorNetwork = [
    featureInputLayer(numObs,'Normalization','none','Name','observation')
    fullyConnectedLayer(actorLayerSizes(1), 'Name', 'ActorFC1', ...
            'Weights',2/sqrt(numObs)*(rand(actorLayerSizes(1),numObs)-0.5), ... 
            'Bias',2/sqrt(numObs)*(rand(actorLayerSizes(1),1)-0.5))
    reluLayer('Name', 'ActorRelu1')
    fullyConnectedLayer(actorLayerSizes(2), 'Name', 'ActorFC2', ... 
            'Weights',2/sqrt(actorLayerSizes(1))*(rand(actorLayerSizes(2),actorLayerSizes(1))-0.5), ... 
            'Bias',2/sqrt(actorLayerSizes(1))*(rand(actorLayerSizes(2),1)-0.5))
    reluLayer('Name', 'ActorRelu2')
    fullyConnectedLayer(numAct, 'Name', 'ActorFC3', ... 
                       'Weights',2*5e-3*(rand(numAct,actorLayerSizes(2))-0.5), ... 
                       'Bias',2*5e-3*(rand(numAct,1)-0.5))                       
    tanhLayer('Name','ActorTanh1')
    scalingLayer('Name','ActorScaling','Scale',(7*pi)/18,'Bias',pi/2)
    ];
end

2 Commenti
Mostra NessunoNascondi Nessuno

Mirjan Heubaum il 19 Nov 2021

Did you find a solution? Can the tanh or scaling layer cause sucha problem?

Nico Hartmann il 4 Lug 2023

That is exactly the problem I'm having right now. Disappointing that no one has answered :(

Accedi per commentare.

Accedi per rispondere a questa domanda.

Reinforcement learning algorithm keeps using boundary values, doesn't learn

2 Commenti
Mostra NessunoNascondi Nessuno

Risposte (0)

Vedere anche

Categorie

Tag

Prodotti

Release

Community Treasure Hunt

Reinforcement learning algorithm keeps using boundary values, doesn't learn

2 Commenti Mostra NessunoNascondi Nessuno

Risposte (0)

Vedere anche

Categorie

Tag

Prodotti

Release

Community Treasure Hunt

2 Commenti
Mostra NessunoNascondi Nessuno