RL DDPG agent not converging

Question

Haochen il 17 Nov 2024

0
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/2166863-rl-ddpg-agent-not-converging

Risposto: Prathamesh il 3 Giu 2025

Hi,

I am training a DDPG agent to control the single cart with an initial speed moving along a horizontal axis. The RL agent acts as a controller that provides the force in the direction of the axis to assist in its convergence to the origin. It should not be a difficult task , however, after training for many steps, the control effect is still far from optimal.

These are my configurations for the agent and the environment. The optimal policy should be for the force to be equal to zero, meaning that the cart should no longer be moving after it reaches the origin.

The agent by actor critic.

function [agents] = createDDPGAgents(N)
    % Function to create two DDPG agents with the same observation and action info.
    obsInfo = rlNumericSpec([2 1],'LowerLimit',-100*ones(2,1),'UpperLimit',100*ones(2,1));
    actInfo = rlNumericSpec([N 1],'LowerLimit',-100*ones(N,1),'UpperLimit',100*ones(N,1));
    % Define observation and action paths for critic
    obsPath = featureInputLayer(prod(obsInfo.Dimension), Name="obsInLyr");
    actPath = featureInputLayer(prod(actInfo.Dimension), Name="actInLyr");
    
    % Define common path: concatenate along first dimension
    commonPath = [
        concatenationLayer(1, 2, Name="concat")
        fullyConnectedLayer(30)
        reluLayer
        fullyConnectedLayer(1)
    ];
    
    % Add paths to layerGraph network
    criticNet = layerGraph(obsPath);
    criticNet = addLayers(criticNet, actPath);
    criticNet = addLayers(criticNet, commonPath);
    
    % Connect paths
    criticNet = connectLayers(criticNet, "obsInLyr", "concat/in1");
    criticNet = connectLayers(criticNet, "actInLyr", "concat/in2");
    
    % Plot the network
    plot(criticNet)
    
    % Convert to dlnetwork object
    criticNet = dlnetwork(criticNet);
    
    % Display the number of weights
    summary(criticNet)
    
    % Create the critic approximator object
    critic = rlQValueFunction(criticNet, obsInfo, actInfo, ...
        ObservationInputNames="obsInLyr", ...
        ActionInputNames="actInLyr");
    % Check the critic with random observation and action inputs
    getValue(critic, {rand(obsInfo.Dimension)}, {rand(actInfo.Dimension)})
    
    % Create a network to be used as underlying actor approximator
    actorNet = [
        featureInputLayer(prod(obsInfo.Dimension))
        fullyConnectedLayer(30)
        tanhLayer
        fullyConnectedLayer(30)
        tanhLayer
        fullyConnectedLayer(prod(actInfo.Dimension))
    ];
    
    % Convert to dlnetwork object
    actorNet = dlnetwork(actorNet);
    
    % Display the number of weights
    summary(actorNet)
    
    % Create the actor
    actor = rlContinuousDeterministicActor(actorNet, obsInfo, actInfo);
    
    %% DDPG Agent Options
    agentOptions = rlDDPGAgentOptions(...
        'DiscountFactor', 0.98, ...
        'MiniBatchSize', 128, ...
        'TargetSmoothFactor', 1e-3, ...
        'ExperienceBufferLength', 1e6, ...
        'SampleTime', -1);
    %% Create Two DDPG Agents
    agent1 = rlDDPGAgent(actor, critic, agentOptions);
    agent2 = rlDDPGAgent(actor, critic, agentOptions);
    % Return agents as an array
    agents = [agent1, agent2];
    agentOptions.NoiseOptions.MeanAttractionConstant = 0.1;
    agentOptions.NoiseOptions.StandardDeviation = 0.3;
    agentOptions.NoiseOptions.StandardDeviationDecayRate = 8e-4;
    agentOptions.NoiseOptions
end

The envrionment:

function [nextObs, reward, isDone, loggedSignals] = myStepFunction(action, loggedSignals,S)
    % Environment parameters
    nextObs1 = S.A1d*loggedSignals.State + S.B1d*action(1);
    nextObs = nextObs1;
    loggedSignals.State = nextObs1;
    if abs(loggedSignals.State(1))<=0.05 && abs(loggedSignals.State(2))<=0.05 
        reward1 = 10;
    else
        reward1 = -1*(1.01*(nextObs1(1))^2 + 1.01*nextObs1(2)^2 + action^2 );
        if reward1 <= -1000
            reward1 = -1000;
        end
    end
    reward = reward1;
   
    if abs(loggedSignals.State(1))<=0.02 && abs(loggedSignals.State(2))<=0.02
        isDone = true;
    else
        isDone = false;
    end
end

And this is the simulation setup (i omitted the reset function here, and S.N = 1):

obsInfo1 =  rlNumericSpec([2 1],'LowerLimit',-100*ones(2,1),'UpperLimit',100*ones(2,1)) ;
actInfo1 = rlNumericSpec([N 1],'LowerLimit',-100*ones(N,1),'UpperLimit',100*ones(N,1));
stepFn1 = @(action, loggedSignals) myStepFunction(action, loggedSignals, S);
resetFn1 = @() myResetFunction(pos1);
env = rlFunctionEnv(obsInfo1, actInfo1, stepFn1, resetFn1);
%% Specify agent initialization
agent= createDDPGAgents(S.N);
loggedSignals = [];
trainOpts = rlTrainingOptions(...
    StopOnError="on",...
    MaxEpisodes=1000,...  %1100 for fully trained
    MaxStepsPerEpisode=1000,...
    StopTrainingCriteria="AverageReward",...
    StopTrainingValue=480,...
    Plots="training-progress");
    %"training-progress"
train(agent, env, trainOpts);

This is the reward plot wher it it taking very long time for each episode, bt still no signs of reaching the positive reward for this simple system.

And this is the control effect on both states, whichi shows that the RL agent is controlling the a cart to the wrong position near -1 while its velocity is 0.

It is very wierd that the reward is not converging to the positive reward one, but to another point. Can I ask where the problem could be. Thanks.

Haochen

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

Accedi per rispondere a questa domanda.

Answer 1

Prathamesh il 3 Giu 2025

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/2166863-rl-ddpg-agent-not-converging#answer_1565828

Hi @Haochen,

I understand that you are training a DDPG client to control the single cart with an initial speed moving along a horizontal axis. The plots show the agent is not reaching the origin and getting stuck with negative rewards. This is common when the agent is not getting clear enough feedback or isn't exploring enough.

Your agent likely gets a big reward only when it's exactly at the origin. For every other step, it just gets a penalty.

Modify the “myStepFunction” to give the agent continous feedback

Make the reward negative (a penalty) if the cart is away from the origin (position not zero) or if it has speed (velocity not zero).
Also add a small penalty for the amount of force the agent applies. This encourages the agent to use only the necessary force.
You'll need to decide how much each penalty matters. For example, penalize being far from the origin more heavily than using a little bit of force.
The agent will always try to make its reward less negative, pushing it towards the origin using minimal force.

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

RL DDPG agent not converging

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Risposte (1)

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Vedere anche

Categorie

Tag

Prodotti

Release

Community Treasure Hunt

RL DDPG agent not converging

0 Commenti Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Risposte (1)

0 Commenti Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Vedere anche

Categorie

Tag

Prodotti

Release

Community Treasure Hunt

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti