|
A class of problems in machine learning which postulate an
agent exploring an environment in which the agent perceives its current state and takes
actions. The environment, in return, provides a reward (which can be positive or negative).
Reinforcement learning algorithms attempt to find a policy for maximizing cumulative reward for the agent over
the course of the problem.
The environment is typically formulated as a finite-state Markov decision
process (MDP), and reinforcement learning algorithms for this context are highly related to dynamic programming techniques. State transition probabilities and
reward probabilities in the MDP are typically stochastic but stationary over the course of the problem.
Reinforcement learning differs from the supervised
learning problem in that correct input/output pairs are never presented, nor sub-optimal actions explicitly corrected.
Further, there is a focus on on-line performance, which involves finding a balance between exploration (of uncharted territory)
and exploitation (of current knowledge).
Formally, the basic reinforcement learning model consists of:
- a set of environment states S;
- a set of actions A; and
- a set of scalar "rewards" in ℜ.
At each time t, the agent perceives its state st∈S and the set of
possible actions A(st). It chooses an action
a∈A(st) and receives from the environment the new state
st+1 and a reward rt+1. Based on these interactions, the reinforcement
learning agent must develop a policy π:S→A which maximizes the quantity
r0+r1+...+rn for MDPs which have a terminal state, or the
quantity Σtγtrt for MDPs without terminal states
(where γ is some "future reward" discounting factor between 0.0 and 1.0).
Reinforcement learning applies particularly well to problems where long-term reward can be had at the expense of short-term
reward, this class of problems is normally handled using a reinforcement learning technique known as Temporal Difference. It has
been applied successfully to various problems, including robot control, elevator scheduling, and backgammon.
It estimates a optimal value function (V) which indicates the desirability of a state. This estimate is based in the recursive
Bellman Equation.
References
Leslie Kaelbling, Michael Littman, Andrew Moore. Reinforcement Learning: A Survey. Journal of Artificial Intelligence
Research 4 (1996) pp. 237–285. (CiteSeer reference )
Richard Sutton and Andrew Barto. Reinforcement Learning. MIT Press, 1998. (available online )
|