Reinforcement Learning agent converges to a suboptimal policy

14 visualizzazioni (ultimi 30 giorni)
Hello
I am trying to learn an multi-period optimal capacity planning problem. The system has 2 uncertainties that are stochastic, but Markovian and a third state which is the capacity. The benchmark is a single-period planning problem, which I have already solved with MINLP optimization.
I have tried many weeks with different agents, but so far I have not succeeded in getting the agent to learn correctly.
In the graph below (with actor critic) you can see that although it seems that learning takes place, the value is suboptimal (less than the single-period optimization value).
One of the uncertainties is demand. In theory, the agent should increase the capacity observing the demand as states. However at convergence, it does not properly do this.
Note that although I have defined the actions as discrete, not all actions are feasible. To compensate for this, I have clipped the actions as follows:
if TIME_P < DEPLOY_T
Action(Action>1-INS_CAP) = 1-INS_CAP; % If OPTION TO ABANDON is added, then [-CAP_UPPER+INS_CAP:5:CAP_UPPER-INS_CAP]
else
% t>DEPLOY_T
Action = 0;
end
Here, DEPLOY_T is the number of years the capacity planning actions can be exercised. The time-step continues to TERMINAL_P to account for more future cash flows.
I was wondering if anyone has any tips (@Emmanouil Tzorakoleftherakis 's answers on this forum has been particularly helpful, but no luck for me) or could possibly look at the code for me.

Risposte (1)

Emmanouil Tzorakoleftherakis
Hello,
In your question you mention a graph but it has not been attached?
It sounds like the agent you trained has converged to a suboptimal solution. If that's the case you probably need to tweak your reward a bit (make sure it is equivalent to your benchmark problem) and possibly make sure the agent is exploring throughout training. Starting simple with a DQN agent would help.The EpsilonDecay and EpsilonMin values are important for exploration (see here). You may also want to randomize the initial condition of your environment. That could help bypass the local solution you converged to.

Categorie

Scopri di più su Policies and Value Functions in Help Center e File Exchange

Prodotti


Release

R2022b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by