Main Content

rlQAgent

Q-learning reinforcement learning agent

Description

The Q-learning algorithm is an off-policy reinforcement learning method for environments with a discrete action space. A Q-learning agent trains a Q-value function critic to estimate the value of the optimal policy, while following an epsilon-greedy policy based on the value estimated by the critic.

Note

Q-learning agents do not support recurrent networks.

For more information on Q-learning agents, see Q-Learning Agent.

For more information on the different types of reinforcement learning agents, see Reinforcement Learning Agents.

Creation

Description

agent = rlQAgent(critic,agentOptions) creates a Q-learning agent with the specified critic network and sets the AgentOptions property.

example

Input Arguments

expand all

Critic, specified as an rlQValueFunction object. For more information on creating critics, see Create Policies and Value Functions.

Properties

expand all

Agent options, specified as an rlQAgentOptions object.

Option to use an exploration policy when selecting actions during simulation or after deployment, specified as a logical value.

  • true — Specify this value to use the base agent exploration policy when you use the agent with the sim and generatePolicyFunction functions. Specifically, in this case, the agent uses the rlEpsilonGreedyPolicy object. The action selection has a random component, so the agent explores its action and observation spaces.

  • false — Specify this value to force the agent to use the base agent greedy policy (the action with maximum likelihood) when you use the agent with the sim and generatePolicyFunction functions. Specifically, in this case, the agent uses the rlMaxQPolicy policy. The action selection is greedy, so the policy behaves deterministically and the agent does not explore its action and observation spaces.

Note

This option affects only simulation and deployment and does not affect training. When you train an agent using the train function, the agent always uses its exploration policy independently of the value of this property.

Observation specifications, specified as an rlFiniteSetSpec or rlNumericSpec object or an array containing a mix of such objects. Each element in the array defines the properties of an environment observation channel, such as its dimensions, data type, and name.

If you create the agent by specifying an actor or critic, the value of ObservationInfo matches the value specified in the actor and critic objects. If you create a default agent, the agent constructor function sets the ObservationInfo property to the input argument observationInfo.

You can extract observationInfo from an existing environment, function approximator, or agent using getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec or rlNumericSpec.

Example: [rlNumericSpec([2 1]) rlFiniteSetSpec([3,5,7])]

Action specifications, specified as an rlFiniteSetSpec object. This object defines the properties of the environment action channel, such as its dimensions, data type, and name.

Note

For this agent, only one action channel is allowed.

If you create the agent by specifying a critic object, the value of ActionInfo matches the value specified in critic. If you create a default agent, the agent constructor function sets the ActionInfo property to the input argument ActionInfo.

You can extract actionInfo from an existing environment, function approximator, or agent using getActionInfo. You can also construct the specification manually using rlFiniteSetSpec.

Example: rlFiniteSetSpec([3,-5,7])]

Sample time of the agent, specified as a positive scalar or as -1.

Within a MATLAB® environment, the agent is executed every time the environment advances, so, SampleTime does not affect the timing of the agent execution. If SampleTime is set to -1, in MATLAB environments, the time interval between consecutive elements in the returned output experience is considered equal to 1.

Within a Simulink® environment, the RL Agent block that uses the agent object executes every SampleTime seconds of simulation time. If SampleTime is set to -1 the block inherits the sample time from its input signals. Set SampleTime to -1 when the block is a child of an event-driven subsystem.

Set SampleTime to a positive scalar when the block is not a child of an event-driven subsystem. Doing so ensures that the block executes at appropriate intervals when input signal sample times change due to model variations. If SampleTime is a positive scalar, this value is also the time interval between consecutive elements in the output experience returned by sim or train, regardless of the type of environment.

If SampleTime is set to -1, in Simulink environments, the time interval between consecutive elements in the returned output experience reflects the timing of the events that trigger the RL Agent block execution.

This property is shared between the agent and the agent options object within the agent. If you change this property in the agent options object, it also changes in the agent, and vice versa.

Example: SampleTime=-1

Object Functions

trainTrain reinforcement learning agents within a specified environment
simSimulate trained reinforcement learning agents within specified environment
getActionObtain action from agent, actor, or policy object given environment observations
getCriticExtract critic from reinforcement learning agent
setCriticSet critic of reinforcement learning agent
generatePolicyFunctionGenerate MATLAB function that evaluates policy of an agent or policy object

Examples

collapse all

Create an environment object. For this example, use the same environment as in the example Train Reinforcement Learning Agent in Basic Grid World.

env = rlPredefinedEnv("BasicGridWorld");

Get observation and action specifications.

obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

Q-learning agents use a parametrized Q-value function to estimate the value of the policy. A Q-value function takes the current observation and an action as inputs and returns a single scalar as output (the estimated discounted cumulative long-term reward for taking the action from the state corresponding to the current observation, and following the policy thereafter).

Since both observation and action spaces are discrete and low-dimensional, use a table to model the Q-value function within the critic. rlTable creates a value table object from the observation and action specifications objects.

Create a table approximation model derived from the environment observation and action specifications.

qTable = rlTable(obsInfo,actInfo);

Create the Q-value function approximator object using qTable and the environment specification objects. For more information, see rlQValueFunction.

critic = rlQValueFunction(qTable,obsInfo,actInfo);

Create a Q-learning agent using the approximator object.

agent = rlQAgent(critic)
agent = 
  rlQAgent with properties:

            AgentOptions: [1×1 rl.option.rlQAgentOptions]
    UseExplorationPolicy: 0
         ObservationInfo: [1×1 rl.util.rlFiniteSetSpec]
              ActionInfo: [1×1 rl.util.rlFiniteSetSpec]
              SampleTime: 1

Specify an Epsilon value of 0.2.

agent.AgentOptions.EpsilonGreedyExploration.Epsilon = 0.2;

To check your agent, use getAction to return the action from a random observation.

act = getAction(agent,{randi(numel(obsInfo.Elements))});
act{1}
ans = 
1

You can now test and train the agent against the environment.

Version History

Introduced in R2019a