This example shows how to train a custom linear quadratic regulation (LQR) agent to control a discrete-time linear system modeled in MATLAB®.

The reinforcement learning environment for this example is a discrete-time linear system. The dynamics for the system are given by

$${x}_{t+1}=A{x}_{t}+B{u}_{t}$$

The feedback control law is

$${u}_{t}=-K{x}_{t}$$

The control objective is to minimize the quadratic cost: $\mathit{J}={\sum}_{\mathit{t}=0}^{\infty}({{\mathit{x}}_{\mathit{t}}}^{\prime}{\mathit{Qx}}_{\mathit{t}}+{{\mathit{u}}_{\mathit{t}}}^{\prime}{\mathit{Ru}}_{\mathit{t}})$.

In this example, the system matrices are

$$\begin{array}{l}\mathit{A}=\left[\begin{array}{ccc}1.05& 0.05& 0.05\\ 0.05& 1.05& 0.05\\ 0& 0.05& 1.05\end{array}\right]\\ \mathit{B}=\left[\begin{array}{ccc}0.1& 0& 0.2\\ 0.1& 0.5& 0\\ 0& 0& 0.5\end{array}\right]\end{array}$$

A = [1.05,0.05,0.05;0.05,1.05,0.05;0,0.05,1.05]; B = [0.1,0,0.2;0.1,0.5,0;0,0,0.5];

The quadratic cost matrices are:

$$\begin{array}{l}\mathit{Q}=\left[\begin{array}{ccc}10& 3& 1\\ 3& 5& 4\\ 1& 4& 9\end{array}\right]\\ \mathit{R}=\left[\begin{array}{ccc}0.5& 0& 0\\ 0& 0.5& 0\\ 0& 0& 0.5\end{array}\right]\end{array}$$

Q = [10,3,1;3,5,4;1,4,9]; R = 0.5*eye(3);

For this environment, the reward at time $\mathit{t}$ is given by $${r}_{t}=-{x}_{t}^{\prime}Q{x}_{t}-{u}_{t}^{\prime}R{u}_{t}$$, which is the negative of the quadratic cost. Therefore, maximizing the reward minimizes the cost. The initial conditions are set randomly by the reset function.

Create the MATLAB environment interface for this linear system and reward. The `myDiscreteEnv`

function creates an environment by defining custom `step`

and `reset`

functions. For more information on creating such a custom environment, see Create MATLAB Environment Using Custom Functions.

env = myDiscreteEnv(A,B,Q,R);

Fix the random generator seed for reproducibility.

rng(0)

For the LQR problem, the Q-function for a given control gain $\mathit{K}$ can be defined as ${\mathit{Q}}_{\mathit{K}}\left(\mathit{x},\mathit{u}\right)={\left[\begin{array}{c}\mathit{x}\\ \mathit{u}\end{array}\right]}^{\prime}{\mathit{H}}_{\mathit{K}}\left[\begin{array}{c}\mathit{x}\\ \mathit{u}\end{array}\right]$, where ${\mathit{H}}_{\mathit{K}}=\text{\hspace{0.17em}}\left[\begin{array}{cc}{\mathit{H}}_{\mathrm{xx}}& {\mathit{H}}_{\mathrm{xu}}\\ {\mathit{H}}_{\mathrm{ux}}& {\mathit{H}}_{\mathrm{uu}}\end{array}\right]$ is a symmetric, positive definite matrix.

The control law to maximize ${\mathit{Q}}_{\mathit{K}}$ is $\mathit{u}=-{\left({\mathit{H}}_{\mathrm{uu}}\right)}^{-1}{\mathit{H}}_{\mathrm{ux}}\text{\hspace{0.17em}}\mathit{x}$, and the feedback gain is ${\mathit{K}=\left({\mathit{H}}_{\mathrm{uu}}\right)}^{-1}{\mathit{H}}_{\mathrm{ux}}\text{\hspace{0.17em}}$.

The matrix ${\mathit{H}}_{\mathit{K}}$ contains $\mathit{m}=\frac{1}{2}\mathit{n}\left(\mathit{n}+1\right)$ distinct element values, where $\mathit{n}$ is the sum of the number of states and number of inputs. Denote $\theta $ as the vector corresponding to these $\mathit{m}$ elements, where the off-diagonal elements in ${\mathit{H}}_{\mathit{K}}$ are multiplied by two.

Represent the Q-function by $\theta $, where $\theta $ contains the parameters to be learned.

${\mathit{Q}}_{\mathit{K}}\left(\mathit{x},\mathit{u}\right)={\theta}^{\prime}\left(\mathit{K}\right)\varphi \left(\mathit{x},\mathit{u}\right)$, where $\varphi \left(\mathit{x},\mathit{u}\right)$ is the quadratic basis function in terms of $\mathit{x}$ and $\mathit{u}$.

The LQR agent starts with a stabilizing controller ${\mathit{K}}_{0}$. To get an initial stabilizing controller, place the poles of the closed-loop system $\mathit{A}-{\mathit{BK}}_{0}$ inside the unit circle.

K0 = place(A,B,[0.4,0.8,0.5]);

To create a custom agent, you must create a subclass of the `rl.agent.CustomAgent`

abstract class. For the custom LQR agent, the defined custom subclass is `LQRCustomAgent`

. For more information, see Custom Agents. Create the custom LQR agent using $\mathit{Q}$, $\mathit{R}$, and ${\mathit{K}}_{0}$. The agent does not require information on the system matrices $\mathit{A}$ and $\mathit{B}$.

agent = LQRCustomAgent(Q,R,K0);

For this example, set the agent discount factor to one. To use a discounted future reward, set the discount factor to a value less than one.

agent.Gamma = 1;

Because the linear system has three states and three inputs, the total number of learnable parameters is $\mathit{m}=21$. To ensure satisfactory performance of the agent, set the number of parameter estimates ${\mathit{N}}_{\mathit{p}}$ to be greater than twice the number of learnable parameters. In this example, the value is ${\mathit{N}}_{\mathit{p}}=45$.

agent.EstimateNum = 45;

To get good estimation results for $\theta $, you must apply a persistently excited exploration model to the system. In this example, envourage model exploration by adding white noise to the controller output: ${\mathit{u}}_{\mathit{t}}=-{\mathit{Kx}}_{\mathit{t}}+{\mathit{e}}_{\mathit{t}}$. In general, the exploration model depends on the system models.

To train the agent, first specify the training options. For this example, use the following options.

Run each training episode for at most

`10`

episodes, with each episode lasting at most`50`

time steps.Display the command line display (set the

`Verbose`

option) and disable the training progress in the Episode Manager dialog box (set the`Plots`

option).

For more information, see `rlTrainingOptions`

.

trainingOpts = rlTrainingOptions(... 'MaxEpisodes',10, ... 'MaxStepsPerEpisode',50, ... 'Verbose',true, ... 'Plots','none');

Train the agent using the `train`

function.

trainingStats = train(agent,env,trainingOpts);

Episode: 1/ 10 | Episode Reward : -55.16 | Episode Steps: 50 | Avg Reward : -55.16 | Step Count : 50 Episode: 2/ 10 | Episode Reward : -12.52 | Episode Steps: 50 | Avg Reward : -33.84 | Step Count : 100 Episode: 3/ 10 | Episode Reward : -15.59 | Episode Steps: 50 | Avg Reward : -27.76 | Step Count : 150 Episode: 4/ 10 | Episode Reward : -22.22 | Episode Steps: 50 | Avg Reward : -26.37 | Step Count : 200 Episode: 5/ 10 | Episode Reward : -14.32 | Episode Steps: 50 | Avg Reward : -23.96 | Step Count : 250 Episode: 6/ 10 | Episode Reward : -19.23 | Episode Steps: 50 | Avg Reward : -16.78 | Step Count : 300 Episode: 7/ 10 | Episode Reward : -34.14 | Episode Steps: 50 | Avg Reward : -21.10 | Step Count : 350 Episode: 8/ 10 | Episode Reward : -13.95 | Episode Steps: 50 | Avg Reward : -20.77 | Step Count : 400 Episode: 9/ 10 | Episode Reward : -36.01 | Episode Steps: 50 | Avg Reward : -23.53 | Step Count : 450 Episode: 10/ 10 | Episode Reward : -12.43 | Episode Steps: 50 | Avg Reward : -23.15 | Step Count : 500

To validate the performance of the trained agent, simulate it within the MATLAB environment. For more information on agent simulation, see `rlSimulationOptions`

and `sim`

.

```
simOptions = rlSimulationOptions('MaxSteps',20);
experience = sim(env,agent,simOptions);
totalReward = sum(experience.Reward)
```

totalReward = -20.1306

You can compute the optimal solution for the LQR problem using the `dlqr`

function.

[Koptimal,P] = dlqr(A,B,Q,R);

The optimal reward is given by ${\mathit{J}}_{\mathrm{optimal}}={{-\mathit{x}}_{0}}^{\prime}{\mathrm{Px}}_{0}$.

x0 = experience.Observation.obs1.getdatasamples(1); Joptimal = -x0'*P*x0;

Compute the error in the reward between the trained LQR agent and the optimal LQR solution.

rewardError = totalReward - Joptimal

rewardError = 1.5270e-06

View the history of the 2-norm of error in the gains between the trained LQR agent and the optimal LQR solution.

% number of gain updates len = agent.KUpdate; err = zeros(len,1); for i = 1:len % norm of error in the gain err(i) = norm(agent.KBuffer{i}-Koptimal); end plot(err,'b*-')

Compute the norm of final error for the feedback gain.

gainError = norm(agent.K - Koptimal)

gainError = 2.2460e-11

Overall, the trained agent finds an LQR solution that is close to the true optimal LQR solution.