DDPG: Actor clips outputs to zero, thus, keeping exploration minimal

Question

Tobias Michl il 18 Lug 2022

0
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/1762865-ddpg-actor-clips-outputs-to-zero-thus-keeping-exploration-minimal

Modificato: Tobias Michl il 18 Lug 2022

I'm training a DDPG agent from the Reinforcement Learning Toolbox to adjust a PI controller. Thus, the agent should output P and I. After some initial learning episodes (~ 10 to 50) with high values for both, P and I, both outputs decrease to zero.

This is followed by either of the two cases, switching from time to time:

Both output values stay at zero. (marked green in the following picture)
Output I stays at zero while P being a very low value. (marked purple in the following picture)

The actor is structured as follows:

featureInputLayer(20, 'Normalization', 'none', 'Name', 'state vector')

fullyConnectedLayer(20, 'Name', 'fc1')

reluLayer('Name', 'relu1')

fullyConnectedLayer(256, 'Name', 'fc2')

reluLayer('Name', 'relu2')

fullyConnectedLayer(2, 'Name', 'fc3')

tanhLayer('Name', 'output')];

The PI controller is used to control a transfer function while a timed disturbance occurs. The disturbance is always identical.

The used fitness function is the IAE-value of the speed error:

The reward then is calculated by this formula:

r = r1*(2*exp(r2*I/In)-r3) + p;

with r1, r2, r3 being constants; I is the DDPG agent's IAE value and In the IAE value of the reference system; and p being a punishment, that is capped to [-15, 0]:

p = -max(|n_ref-n_act|²) * p1;

What have I done so far:

trying to recreate a paper's solution
- agent should take action once per episode as the disturbance is detected
- copied the transfer function, networks sizes, observation and all options (critic, actor, DDPG agent, training)
- added a flexible punishment (for the system to not oscillate)
adjusted the range of the punishment to the range of the reward
set gradient threshold from 'inf' to '1'
set lower and upper limit within actionInfo
set standarddeviation to different values
- currently being 0.1
- while 1% of the action range may be 0.8943 and 10% corresponds to 8.943
- with standarddeviation being 0.8943: I stays zero; P explores a bit after then staying on it's max value

Many thanks in advance!

// discreteSys_Script_05.m being the main script

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

Accedi per rispondere a questa domanda.

DDPG: Actor clips outputs to zero, thus, keeping exploration minimal

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Risposte (0)

Vedere anche

Categorie

Tag

Community Treasure Hunt

DDPG: Actor clips outputs to zero, thus, keeping exploration minimal

0 Commenti Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Risposte (0)

Vedere anche

Categorie

Tag

Community Treasure Hunt

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti