How can i interpret an oscillating average reward graphic in RL trainig process ?

Question

awcii il 2 Lug 2023

0
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/1990818-how-can-i-interpret-an-oscillating-average-reward-graphic-in-rl-trainig-process

Risposto: awcii il 23 Lug 2023

Hi all,

I have tried to train DDPG agent in RL for referance tracking problem, which has an environment in Simulink. But it can not track the referance. I have changed hyper parameters and tried may times, however, in the most of tries, the average reward graphic osscilate around or below the episode Q0 as in following graphics (first one for NoiseOptions.Variance = 0.1, the second one for 0.05 and the last one for 0.01).

Meanwhile, i have tried many different reward function and observation. They almost have the same problem.

I am sharing the most important part of my codes.

For a well trained agent, is it requred an avareged reward grahpic following the Episode Q0 ?

How can the given training process graphics are interpreted? What should i change in my RL algorithm ?

Thanx for any help.

obsInfo = rlNumericSpec([5 1],...
    'LowerLimit',[-inf -inf 0 -inf 0]',...
    'UpperLimit',[ inf  inf inf inf inf]');
obsInfo.Name = 'observations';
obsInfo.Description = 'integrated error, error, and measured height';
numObservations = obsInfo.Dimension(1);
actInfo = rlNumericSpec([1 1]);
actInfo.Name = 'flow';
numActions = actInfo.Dimension(1);
env = rlSimulinkEnv('sz_rlforward','sz_rlforward/RL Agent',...
    obsInfo,actInfo);
statePath = [
    featureInputLayer(numObservations,Normalization='none', Name='State') % 'rescale-symmetric' = range [-1, 1] veya 'rescale-zero-one' 
    fullyConnectedLayer(50,Name='CriticStateFC1')
    reluLayer %('Name','CriticRelu1')
    fullyConnectedLayer(25, Name='CriticStateFC2')];
actionPath = [
    featureInputLayer(numActions,Normalization='none', Name='Action1')
    fullyConnectedLayer(25,Name='CriticActionFC1')];
commonPath = [
    additionLayer(2,Name='add')
    reluLayer %('Name','CriticCommonRelu')
    fullyConnectedLayer(1,Name='CriticOutput')];
criticNetwork = layerGraph();
criticNetwork = addLayers(criticNetwork,statePath);
criticNetwork = addLayers(criticNetwork,actionPath);
criticNetwork = addLayers(criticNetwork,commonPath);
criticNetwork = connectLayers(criticNetwork,'CriticStateFC2','add/in1');
criticNetwork = connectLayers(criticNetwork,'CriticActionFC1','add/in2');
criticNetwork = dlnetwork(criticNetwork);
critic = rlQValueFunction(criticNetwork,obsInfo,actInfo,...
    ObservationInputNames="State",ActionInputNames="Action1");
actorNetwork = [
    featureInputLayer(numObservations,Normalization='none',Name='State')
    fullyConnectedLayer(5, Name='actorFC')
    reluLayer 
    fullyConnectedLayer(50)
    reluLayer
    fullyConnectedLayer(numActions)
    sigmoidLayer
    scalingLayer(Scale=0.5,Bias=0.5) % i need an action value in the range 0-1
    ];
actorNetwork = dlnetwork(actorNetwork);
actor = rlContinuousDeterministicActor(actorNetwork, ...
    obsInfo,actInfo);
criticOptions = rlOptimizerOptions( ...
    LearnRate=1e-3, ...
    GradientThreshold=1, ...
    L2RegularizationFactor=1e-4);
actorOptions = rlOptimizerOptions( ...
    LearnRate=1e-4, ...
    GradientThreshold=1, ...
    L2RegularizationFactor=1e-4);
agentOptions = rlDDPGAgentOptions(...
    SampleTime=Ts,...
    ActorOptimizerOptions=actorOptions,...
    CriticOptimizerOptions=criticOptions,...
    MiniBatchSize=128, ...
    DiscountFactor=0.95, ...
    ExperienceBufferLength=1e6);
agentOptions.NoiseOptions.Variance = 0.05;
agentOptions.NoiseOptions.VarianceDecayRate = 1e-5;
agent = rlDDPGAgent(actor,critic,agentOptions)

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

Accedi per rispondere a questa domanda.

Answer 1

awcii il 18 Lug 2023

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/1990818-how-can-i-interpret-an-oscillating-average-reward-graphic-in-rl-trainig-process#answer_1274823

@Emmanouil Tzorakoleftherakis can you help me about this problem? I wonder your opinion.

thanx a lot

2 Commenti
Mostra NessunoNascondi Nessuno

Emmanouil Tzorakoleftherakis il 18 Lug 2023

A trained agent does not necessarily need to have overlapping Q0 and reward values. It could be the case that the actor converges faster than the critic in which case it's totally ok to stop the training process early.

The last graph you shared seems promising. How does tha trained agent perform in that case?

awcii il 19 Lug 2023

Yes, the shape of the last one is very similiar in the literature. Moreover, i got better one as in the below.

I got it by changeing only critic and actor learn rate to 1e-2 and 1e-3 respectively. I have tried many times by changing other hyper parameters using these learn rate parameters. However, it never got close to the Q0 value. There is always a negative offset about -400 between Q0.

Accedi per commentare.

Answer 2

awcii il 23 Lug 2023

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/1990818-how-can-i-interpret-an-oscillating-average-reward-graphic-in-rl-trainig-process#answer_1277528

@Emmanouil Tzorakoleftherakis can this offset problem arise from the scaling of the action output?

I need an action vary between 0-1. To do this, i have used a sigmoidlayer at te final layer of the actor. But, somehow without scailing of this layer, the action exceed 0-1 boundry.

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

How can i interpret an oscillating average reward graphic in RL trainig process ?

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Risposte (2)

2 Commenti
Mostra NessunoNascondi Nessuno

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Vedere anche

Categorie

Tag

Prodotti

Release

Community Treasure Hunt

How can i interpret an oscillating average reward graphic in RL trainig process ?

0 Commenti Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Risposte (2)

2 Commenti Mostra NessunoNascondi Nessuno

0 Commenti Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Vedere anche

Categorie

Tag

Prodotti

Release

Community Treasure Hunt

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

2 Commenti
Mostra NessunoNascondi Nessuno

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti