I want my agent to output a target value, but in certain situations (reward drops dramatically), I would want the agent to look for a better solution by letting him change the target value. I tried to use initial condition block in order to use the target value in the first place. However, my agent (PPO) always outputs an average value after some training episodes.

Is it possible to change RL action values under certain conditions?

Emmanouil Tzorakoleftherakis il 18 Mag 2021

Can you provide some more information? What do you mean by letting the agent change target value? Isn't that what is happening by default every time the agent takes an action? what is the envronment architecture?

black_cat il 18 Mag 2021

My target value is, for example, 1. The agent can output values between 1 and 6. If it outputs a value that does not correspond to 1, it is punished with the difference multiplied by a penalty term. The further away the actual value is from the target value, the greater the punishment. This, of course, leads to the expected behavior that the agent always outputs the target value (i.e. 1) after training. However, there are some time steps within an episode where it would make sense to change that value (to 6 for example) in order to maximize the overall reward. However, this results in having an output of 3 since the agent is averaging it during training. I'm wondering if the agent could output 1 and then, for some time steps, 6 and then 1 again.

Emmanouil Tzorakoleftherakis il 19 Mag 2021

thanks. It's still not clear to me what you mean by "However, this results in having an output of 3 since the agent is averaging it during training". If it's best to output a 6, the agent should do so, why would it average the output? Unless you are talking about the average episode reward that you see in the episode manager?

black_cat il 20 Mag 2021

Modificato: black_cat il 20 Mag 2021

I've tried to create a minimal version that illustrates my problem. Here, I'm outputing numbers from 1-3. I hope it's more understandable that way.

black_cat il 20 Mag 2021

Modificato: black_cat il 20 Mag 2021

Okay, even though the attached example is supposed to be easy to understand, I think I'm able to put my problem in simple terms now:

I'm training my agent to output 3 discrete values (1, 2, 3)
I punish him for not outputing my target value
My target value is 1 for 50% of the time and 3 for the other 50% of the time

When training the agent is done (no matter which one, they all act the same in this case), it will output 1 or 3. For 100% of the time. It's not changing the output values at all. It's just using one. This is my problem.

Is it possible to change RL action values under certain conditions?

5 Commenti
Mostra 3 commenti meno recenti Nascondi 3 commenti meno recenti

Risposte (0)

Categorie

Prodotti

Release

Tag

Community Treasure Hunt

Is it possible to change RL action values under certain conditions?

5 Commenti Mostra 3 commenti meno recenti Nascondi 3 commenti meno recenti

Risposte (0)

Categorie

Prodotti

Release

Tag

Vedere anche

Community Treasure Hunt

5 Commenti
Mostra 3 commenti meno recenti Nascondi 3 commenti meno recenti