#AIMA3e
function Q-Learning_Agent(percept) returns an action
inputs: percept, a percept indicating the current state s' and reward signal r'
persistent: Q, a table of action values indexed by state and action, initially zero
Nsa, a table of frequencies for state-action pairs, initially zero
s, a, r, the previous state, action, and reward, initially null
if Terminal?(s) then Q[s, None] ← r'
if s is not null then
increment Nsa[s, a]
Q[s, a] ← Q[s, a] + α(Nsa[s, a])(r + γ maxa' Q[s', a'] - Q[s, a])
s, a, r ← s', argmaxa' f(Q[s', a'], Nsa[s', a']), r'
return a
Figure ?? An exploratory Q-learning agent. It is an active learner that learns the value Q(s, a) of each action in each situation. It uses the same exploration function f as the exploratory ADP agent, but avoids having to learn the transition model because the Q-value of a state can be related directly to those of its neighbors.