This is the final project for the course AIT-3007 - Reinforcement Learning and Scheduling. The assignment given to us is as follows:
In this final project, you will develop and train a reinforcement learning (RL) agent using the MAgent2 platform. The task is to solve a specified MAgent2 environment `battle`, and your trained agent will be evaluated on all following three types of opponents:
1. Random Agents: Agents that take random actions in the environment.
2. A Pretrained Agent: A pretrained agent provided in the repository
3. A Final Agent: A stronger pretrained agent, which will be released in the final week of the course before the deadline
Your agent's performance should be evaluated based on reward and win rate against each of these models. You should control *blue* agents when evaluating.
To address the problem, we implemented and experimented with the following three algorithms:
-
Deep Q Learning : We experimented with training using random and self-play (the best results). Since each agent observes only a matrix of size (13x13x5), we used a single Q-network that learns from the data of both the red and blue agents. Results showed that the algorithm converges quite quickly; after 70 training episodes, it was able to completely defeat the three agents used for evaluation. Furthermore, the training results for random and self-play were nearly identical.
-
QMix: We experimented with training using a random agent. Results showed that the algorithm did not perform well and failed to improve after 30 episodes. We used a single Q_a network to learn for all agents, based on data from both the blue and red agents. Additionally, we experimented with grouping agents into 9, 27, and 81 clusters instead of using the entire observation space of 81 agents for the QMix network. However, the algorithm's results remained poor and showed no improvement.
-
VDN: Similar to the QMix algorithm, but we experimented with 81 different Q-networks sharing the same architecture, training only with random agents. Like QMix, the algorithm performed poorly and failed to improve.
In all experiments (my agent was trained as blue), we used 70 training episodes with an epsilon decay of 0.04 (initial value of 1, minimum value of 0.01). The batch size was fixed at 64, the environment was configured with an
During the training process, we logged several metrics to evaluate the effectiveness of each method (details provided at):
From the results, the Q-Learning algorithm proved to be significantly more effective than the other two methods. We also experimented with this algorithm under three settings: using a network with the same architecture as the published pretrained model, a custom-designed network, and self-play. The results were nearly identical across all settings. Below are some experimental results when our trained agent was tested against 3 agents used for evaluation
- The DQN Agents (checkpoint) trained with Random achieved the following results against Random, Pretrained Agent, and Final Agent (results shown from left to right):
- The DQN Agents (checkpoint) trained using self-play achieved the following results against Random, Pretrained Agent, and Final Agent (results shown from left to right):
We ran the provided eval.py script using the DQN Agents trained with Random, and obtained the following results:
First, you should clone this repo and install with or you only need to download notebook at and run all cells.
pip install -r requirements.txt
Secondly, if you want to retrain the Deep Q results for both random and self-play, use the following command:
python train.py -mode=<self-play or random> -save_dir=<path to save model cpt>
If you want to run the evaluation code using our pre-trained model checkpoints, execute the following command (we provide pre-trained models for random and self-play ):
python eval.py -model_path=<path to model>
If you want to view demo videos of matches between agents, run the following command:
python main.py -blue_agent=<blue name> -red_agent=<red_name> -save_path=<path to save video>
MAgent2 RL Final Project
├── utils
│ └── memory.py # memory model containt replay buffer and statememory
├── agent # implement agents
│ ├── DQl_agent.py
│ ├── QMix_agent.py
│ └── base_agent.py
├── assets
│ ├── doc
│ └── video
├── notebooks
│ ├── rl-training.ipynb # training and eval with notebook
│ └── VDN.ipynb
├── main.py # run inference
├── model # implement Q networks
│ ├── networks.py
│ └── state_dict # save checkpoint
├── README.md
├── requirements.txt
├── train.py # Training
├── Report.pdf
└── eval.py # eval with 3 model
Nguyễn Ngô Việt Trung |
Vũ Minh Tiến |
Phạm Quang Vinh |