This is a repository of DQN and its variants implementation in PyTorch based on the original papar.
Algorithms below will be implemented in this repository.
Deep Q Network (DQN) from Playing Atari with Deep Reinforcement Learning, Mnih et al, 2013. [arxiv] [summary]
Double DQN Deep Reinforcement Learning with Double Q-learning, Hasselt et al 2015. [arxiv] [summary]
Dueling DQN Dueling Network Architectures for Deep Reinforcement Learning, Wang et al, 2015. [arxiv] [summary]
Prioritized Experience Replay (PER) Prioritized Experience Replay, Schaul et al, 2015. [arxiv] [summary]
The Q-network I use here is 3-hidden-layer perceptrons(MLP). The hidden_size is 32. The option of dueling network is also included. Double and PER are implemented in the agent codes.
The corresponding policy action is generated by -greedy method. is exponentially decayed w.r.t. a designated decay rate.
When evaluating the performance of the model, I wrote a class method called demo
. demo
basically plays the game for 100 times by exploiting the actions generated by the policy network (equivalent to = 1.0), and get the average score of the games as the score of the current policy network.
The policy network scores, and average scores of the past 10 versions of policy network, as well as the current episode duration are plotted in the result.png.
class DQN(nn.Module):
def __init__(self, num_actions, input_size, hidden_size, dueling = False):
super(DQN, self).__init__()
self.num_actions = num_actions
self.fc1 = nn.Linear(input_size, hidden_size)
self.fc2 = nn.Linear(hidden_size, hidden_size)
self.dueling = dueling
if dueling:
self.fc_value = nn.Linear(hidden_size, 1)
self.fc_actions = nn.Linear(hidden_size, num_actions)
else:
self.fc3 = nn.Linear(hidden_size, self.num_actions)
# Called with either one element to determine next action, or a batch
# during optimization. Returns tensor([[left0exp,right0exp]...]).
def forward(self, x):
x = x.view(x.size(0),-1)
if not self.dueling:
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
else:
# dueling network
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
v = self.fc_value(x)
a = self.fc_actions(x)
x = a.add(v - a.mean(dim=-1).unsqueeze(-1))
return x
if self.PER:
batch_idx, transitions, glNorm_ISWeights = self.memory.sample(self.BATCH_SIZE)
else:
transitions = self.memory.sample(self.BATCH_SIZE)
if self.PER:
# Compute abs TD error
abs_errors = t.detach()
abs_errors_ = abs_errors.numpy()
# Update the priority level
self.memory.batch_update(batch_idx, abs_errors_)
# accumulate weight-change
losses = losses * torch.from_numpy(glNorm_ISWeights).reshape(self.BATCH_SIZE,-1) #* abs_errors
# typical hyperparameters
class AgentConfig:
EXPERIMENT_NO = 99
START_EPISODE = 0
NUM_EPISODES = 500
MEMORY_CAPA = 50000
MAX_EPS = 1.0
MIN_EPS = 0.01
UPDATE_FREQ = 10
DEMO_NUM = 100
LR = 5e-4 # learning rate
LR_STEP_SIZE = 9999 # learning rate step size
DECAY_RATE = 0.99 # decay rate
BATCH_SIZE = 32 # batch size
GAMMA = 0.99 # gamma
ALPHA = 0.6 # alpha for PER
BETA = 0.4 # beta for PER
DOUBLE = False # double Q-learning
DUELING = False # dueling network
PER = False # prioritized replay
RES_PATH = './experiments/'
class EnvConfig:
ENV = "CartPole-v0"
Below are the experiments to test DQN under different circumstances. In the plot,
Blue line: policy network scores
Orange line: the current episode score under greedy policy
Green line: average scores of the past 10 versions of policy network
- In additional to the settings above, the augmented integrated agent also has the following settings:
- Environment: "CartPole-v1"
- The loss function(after accumulating weight change) will be multiplied with
abs_errors
in order to further scale up the gradients of the prioritized transitions - The greedy actions were sampled by the target net, with the hope to stablize the training.
No. | learning rate | double | dueling | PER | result.png | Comments |
---|---|---|---|---|---|---|
91 | 5e-4(step: 400) | True | True | True | Solved the game after 1050 episodes. |
- Vanilla DQN is not a robust method. It may suffer from severe performance degradation if the hyper parameters are not right.
- According to Rainbow(summary), the largest improvement for DQN comes from the modification of priortized experience replay.