This project implements the Lunar Lander v2 problem using Deep Reinforcement Learning (DRL) techniques. The primary focus is on evaluating and improving the performance of the Deep Q-Network (DQN) by utilizing Dueling DQN (D3QN) architectures. This work was completed as part of a reinforcement learning course assignment.
LunarLanderV2.mp4
DQN is a popular reinforcement learning algorithm that combines Q-learning with deep neural networks. In the Lunar Lander environment, DQN uses a deep network to approximate the Q-value function, which tells the agent how good or bad it is to take certain actions in specific states.
However, DQN suffers from:
- Overestimation Bias: It tends to overestimate the action values.
- Instability: Training can become unstable due to correlated updates.
D3QN further enhances performance by incorporating the dueling architecture, where the Q-value function is split into two streams:
- State Value Function (V(s)): How good it is to be in a state, regardless of action.
- Advantage Function (A(s, a)): The benefit of taking a specific action compared to others.
This helps the agent learn more efficiently in the Lunar Lander by better differentiating between valuable states and actions, improving both stability and performance.
By combining Double DQN and Dueling Networks, D3QN offers significant improvements in solving the Lunar Lander problem. It results in:
- Reduced overestimation of Q-values.
- Improved state value approximation, making the agent more adept at landing in difficult scenarios.
- Faster convergence and better reward maximization than standard DQN.
- DQN: The baseline algorithm, implemented with a simple feedforward network and epsilon-greedy exploration.
- D3QN: An advanced version incorporating Double Q-learning and Dueling Network architectures, which leads to better stability and faster convergence.
- Network architecture:
- 3 fully connected layers with 64 neurons each
- Activation: ReLU
- Loss Function: Mean Squared Error (MSE)
- Exploration Strategy: Epsilon-greedy with early stopping
- Optimizer: Adam
- Discount Factor: Varying γ over time as described in the report
(For a more comprehensive list of hyperparameters, refer to the report.)
The D3QN model significantly outperformed the DQN in terms of stability and reward maximization. By introducing a dynamic gamma strategy and optimizing the training process, we achieved consistent improvements in solving the Lunar Lander environment.
Episode reward: 180.14
DQN.mp4
Episode reward: 249.5
D3QN.mp4
This project demonstrated the effectiveness of advanced DRL techniques in solving the Lunar Lander problem. The D3QN model, in particular, offered substantial improvements over the baseline DQN, and future work may explore further enhancements such as prioritized experience replay or multi-agent setups.