PPO training Stopped Learning.

Question

Lloyd il 21 Ago 2024

0
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/2146879-ppo-training-stopped-learning

Risposto: Kaustab Pal il 22 Ago 2024

I am trying to train the rotatry inverted pendulum enviroment using a PPO agent. It's working... but It's reaching a limit and not learnign past this limit. I am not too sure why. Newbie to RL here so go easy on me :). I think it's something to do with the yellow line, Q0. Also it could be reaching a local optima, but I don't think this is the problem. I think the problem is with Q0 not getting past 100 and the agent not being able to extract more useful info. Hopefully, someone whith a little more experinace has something to say!

mdl = "rlQubeServoModel";
open_system(mdl)
theta_limit = 5*pi/8;
dtheta_limit = 30;
volt_limit = 12;
Ts = 0.005;
rng(22)
obsInfo = rlNumericSpec([7 1]);
actInfo = rlNumericSpec([1 1],UpperLimit=1,LowerLimit=-1);
agentBlk = mdl + "/RL Agent";
simEnv = rlSimulinkEnv(mdl,agentBlk,obsInfo,actInfo);
numObs = prod(obsInfo.Dimension);
criticLayerSizes = [400 300];
actorLayerSizes = [400 300];
% critic:
criticNetwork = [
    featureInputLayer(numObs)
    fullyConnectedLayer(criticLayerSizes(1), ...
        Weights=sqrt(2/numObs)*...
            (rand(criticLayerSizes(1),numObs)-0.5), ...
        Bias=1e-3*ones(criticLayerSizes(1),1))
    reluLayer
    fullyConnectedLayer(criticLayerSizes(2), ...
        Weights=sqrt(2/criticLayerSizes(1))*...
            (rand(criticLayerSizes(2),criticLayerSizes(1))-0.5), ...
        Bias=1e-3*ones(criticLayerSizes(2),1))
    reluLayer
    fullyConnectedLayer(1, ...
        Weights=sqrt(2/criticLayerSizes(2))* ...
            (rand(1,criticLayerSizes(2))-0.5), ...
        Bias=1e-3)
    ];
criticNetwork = dlnetwork(criticNetwork);
summary(criticNetwork)
critic = rlValueFunction(criticNetwork,obsInfo);
% actor:
% Input path layers
inPath = [ 
    featureInputLayer( ...
        prod(obsInfo.Dimension), ...
        Name="netOin")
    fullyConnectedLayer( ...
        prod(actInfo.Dimension), ...
        Name="infc") 
    ];
% Path layers for mean value 
meanPath = [ 
    tanhLayer(Name="tanhMean");
    fullyConnectedLayer(prod(actInfo.Dimension));
    scalingLayer(Name="scale", ...
    Scale=actInfo.UpperLimit) 
    ];
% Path layers for standard deviations
% Using softplus layer to make them non negative
sdevPath = [ 
    tanhLayer(Name="tanhStdv");
    fullyConnectedLayer(prod(actInfo.Dimension));
    softplusLayer(Name="splus") 
    ];
net = dlnetwork();
net = addLayers(net,inPath);
net = addLayers(net,meanPath);
net = addLayers(net,sdevPath);
net = connectLayers(net,"infc","tanhMean/in");
net = connectLayers(net,"infc","tanhStdv/in");
plot(net)
net = initialize(net);
summary(net)
actor = rlContinuousGaussianActor(net, obsInfo, actInfo, ...
    ActionMeanOutputNames="scale",...
    ActionStandardDeviationOutputNames="splus",...
    ObservationInputNames="netOin");
actorOpts = rlOptimizerOptions(LearnRate=1e-4);
criticOpts = rlOptimizerOptions(LearnRate=1e-4);
agentOpts = rlPPOAgentOptions(...
    ExperienceHorizon=600,...
    ClipFactor=0.02,...
    EntropyLossWeight=0.01,...
    ActorOptimizerOptions=actorOpts,...
    CriticOptimizerOptions=criticOpts,...
    NumEpoch=3,...
    AdvantageEstimateMethod="gae",...
    GAEFactor=0.95,...
    SampleTime=0.1,...
    DiscountFactor=0.997);
agent = rlPPOAgent(actor,critic,agentOpts);
trainOpts = rlTrainingOptions(...
    MaxEpisodes=20000,...
    MaxStepsPerEpisode=600,...
    Plots="training-progress",...
    StopTrainingCriteria="AverageReward",...
    StopTrainingValue=430,...
    ScoreAveragingWindowLength=100);
trainingStats = train(agent, simEnv, trainOpts);

thanks in advanced!

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

Accedi per rispondere a questa domanda.

Answer 1

arushi il 22 Ago 2024

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/2146879-ppo-training-stopped-learning#answer_1503004

Modificato: arushi il 22 Ago 2024

Hi Lloyd,

Some potential reasons why your training might be hitting a plateau and not improving further:

Q0 and Learning Plateau:

The variable Q0 might refer to the initial Q-value or a specific parameter in your environment or model. If it's not progressing past a certain point, it might be due to insufficient exploration or suboptimal hyperparameters.

Exploration vs. Exploitation:

Ensure your agent is exploring adequately. The entropy loss weight (EntropyLossWeight) in PPO helps encourage exploration by adding randomness to the policy. You might try increasing this value slightly to see if it helps the agent explore more diverse actions.

Learning Rates:

The learning rates for both the actor and critic (LearnRate=1e-4) might be too low or too high. Experiment with different learning rates, such as 1e-3 or 5e-5, to see if the agent's performance improves.

Clip Factor:

The clip factor (ClipFactor=0.02) controls how much the policy is allowed to change at each update. If it's too restrictive, the agent might not learn effectively. Try increasing it to 0.1 or 0.2.

Reward Function:

Ensure your reward function is well-designed and provides sufficient feedback for the agent to learn effectively. If the reward is sparse or doesn't align well with the task objectives, the agent may struggle to learn.

Hope this helps.

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

Answer 2

Kaustab Pal il 22 Ago 2024

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/2146879-ppo-training-stopped-learning#answer_1503244

Hi @Lloyd,

The yellow line, Q0, in the plot represents the estimate of the discounted long-term reward at the start of each episode, based on the initial observation of the environment. Ideally, as training progresses and if the critic is well-designed and learning effectively, the average Q0 should converge towards the actual discounted long-term reward (depicted by the dark-blue line).

In your case, it seems that around episode 2000, Q0 ceases to improve, indicating that the critic may have stopped learning. This is a common challenge in reinforcement learning. Here are a few suggestions to address this:

Reward function: Ensure that your reward function effectively guides the agent towards the desired behavior. Consider normalizing the rewards before training your agent.
Hyperparameter tuning: Experiment with different values for hyperparameters such as the learning rate, clip factor, and entropy loss weight.
You might want to add more layers to your critic network to enhance its capacity to learn complex information. However, be cautious of overfitting when adding too many layers.

For more information, you can refer to the following documentations:

Options for training reinforcement learning agents: https://www.mathworks.com/help/reinforcement-learning/ref/rl.option.rltrainingoptions.html
PPO agents: https://www.mathworks.com/help/reinforcement-learning/ug/proximal-policy-optimization-agents.html

Hope this is helpful.

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

PPO training Stopped Learning.

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Risposte (2)

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Vedere anche

Categorie

Tag

Prodotti

Release

Community Treasure Hunt

PPO training Stopped Learning.

0 Commenti Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Risposte (2)

0 Commenti Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

0 Commenti Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Vedere anche

Categorie

Tag

Prodotti

Release

Community Treasure Hunt

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti