TD3算法训练时动作总是输出边界值

30 visualizzazioni (ultimi 30 giorni)
泽宇
泽宇 il 29 Feb 2024
Commentato: 泽宇 il 23 Apr 2024
我在使用TD3算法训练完成后,无论训练过程中奖励曲线是否收敛,动作总是输出边界值或者输出完全不正确。我的state的值在0-20000,动作边界在0-15000.是哪里出了问题,是自定义环境创建的不正确还是哪里?需要对输入输出进行归一化吗

Risposte (1)

UDAYA PEDDIRAJU
UDAYA PEDDIRAJU il 14 Mar 2024
Hi 泽宇,
Regarding your issue with the TD3 algorithm where actions always output at boundary values regardless of whether the reward curve converges.
It’s essential to investigate a few potential factors:
  1. Action Bounds: Ensure that the action bounds are correctly defined. If the boundaries are too restrictive, the agent might struggle to learn effective actions.
  2. Normalization: Normalizing the inputs and outputs can significantly impact training stability. Consider normalizing both state and action values to a common range (e.g., [0, 1]).
  3. Custom Environment: Verify that your custom environment is correctly implemented. Double-check the reward function, state representation, and action space.
  4. Exploration Noise: TD3 relies on exploration noise to encourage exploration. Ensure that the noise level is appropriate during training.
  1 Commento
泽宇
泽宇 il 23 Apr 2024
非常感谢您的回答,我的问题到现在依然没有解决,我在用深matlab强化学习工具箱进行自定义环境智能体训练,在第一次训练时(未得到奖励时),智能体给出的action是action约束范围内的值,然而在第二次训练时(得到第一次训练的奖励后),智能体给出的action是action却是约束范围的边界值?并且从第二次训练到后面第n次的训练也是这样,这是为什么? 我可以给您我的简易代码,您可以帮忙看一下问题出在哪里了吗?function[Observation,Reward,IsDone,NextState]=newgoushi(Action,State)
E=State;
%% 奖励
GT=1000*Action(1);
NextState=E-GT;
if GT-E<0.1
Reward=0;
else
Reward=-1;
end
IsDone=Reward>=0;
Observation=NextState ;
end
我的action是一个连续的,约束范围在0-12000之间,我的state也是一个连续的,约束范围在5000-10000之间

Accedi per commentare.

Categorie

Scopri di più su Big Data Processing in Help Center e File Exchange

Prodotti


Release

R2023a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!