An implementation fo the Q-learning algorithm with adaptive learning.
In Q-learning the agent stores Q-values (quality values) for each state that it encounters. Q-values are determined by the following equation.
Q-learning is an off-policy form of temporal difference. The Agent simply learns by storing a quality value for the state that it encountered and the reward that it recieved for the action taken along with the discounted future reward. Taking the future reward into account at each value iteration forms a Markov Chain which will converge to the highest reward.
An agent will explore or exploit the Q-values based on the epsilon
hyperparameter.
The implemented agent also employs adaptive learning by which the alpha
and epsilon
hyperparameters are dynamically tuned based on the timestep and an ada divisor
parameter.
Q-learning doesn't work well in continous environments, the pkg/v1/env package provides a normalization adapter. One of the adapters is for discretization and can be used to make continuous states discrete.
See the experiments folder for example implementations.