This repo provides a simple, distributed and asynchronous multi-agent reinforcement learning framework for the Google Research Football environment, along with research tools and results for benchmarking. In particular, it includes:
- A distributed and asynchronous MARL framework
- Implementation of algorithm IPPO, MAPPO, HAPPO, A2PO, MAT
- Ready-to-run experiment configuration
- Population-based training pipline, such as PSRO and League Training
- Pre-trained GRF policies in both 5-vs-5 and 11-vs-11 full-game scenarios
- Single-step match replay debugger
- Tutorial for GRF online ranking
Documentation: grf-marl.readthedocs.io/
Implementation for CDS_QMIX and CDS_QPLEX: MyCDS benchmark
Check out the paper at Boosting Studies of Multi-Agent Reinforcement Learning on Google Research Football Environment: the Past, Present, and Future
- Install
- Execution
- Cooperative MARL benchmark
- Population-based self-play training
- Framework architecture
- Google Reseach Football Toolkit
- Pre-trained policies
- Online Ranking
- Contribution
- Tensorboard tags explained
- Contact
You can use any tool to manage your python environment. Here, we use conda as an example.
- install conda/minconda.
conda create -n light-malib python==3.9
to create a new conda env.- activate the env by
conda activate light-malib
when you want to use it or you can add this line to your.bashrc
file to enable it everytime you login into the bash. - Clone the repository and install the required dependencies:
git clone https://github.com/jidiai/GRF_MARL.git
cd GRF_MARL
pip install -r requirements.txt
- Follow the instructions in the official website https://pytorch.org/get-started/locally/ to install PyTorch (for example, version 1.13.0+cu116).
- Follow the instructions in the official repo https://github.com/google-research/football and install the Google Research Football environment.
After installation, run an example experiment by executing the following command from the home folder:
python3 light_malib/main_pbt.py --config PATH_TO_CONFIG
where PATH_TO_CONFIG
is the relative path of the experiment configuration file.
To run experiments on a small cluster, please follow ray's official instructions to start a cluster. For example, use ray start --head
on the master, then connect other machines to the master following the hints from command line output.
We support multiple algorithms on benchmark scenarios.
- Pass and shot with keeper (2v1): A 3 vs 2 academy game. Two left-team players start at the right half, competing against one right-team defense player and the goalkeeper. The episode terminates when: a. reaches maximum duration (400 steps); b. ball is out of bounds; c. one team scores; d. ball ownership changes.
- 3 vs 1 with keeper (3v1): A 4 vs 2 academy game. Three left-team players start at the right half, competing against one right-team defense player and the goalkeeper. The same termination condition applies as the pass and shoot with keeper scenario.
- corner: An 11 vs 11 academy game. The left team starts the ball at the right team’s corner. The same termination condition applies as the pass and shoot with keeper scenario.
- counterattack (CT): An 11 vs 11 academy game. Four left team players start the ball at the mid-field in the right team’s half and only two right team players defend in their own half. The rest of the players are at the left team’s half. The same termination condition applies as the pass and shoot with keeper scenario.
- 5-vs-5 full-game (5v5): A 5 vs 5 full-game. Four players from each team gather at the center of the field. The left-team starts the kick-off. The game terminates when the episode reaches the maximum duration (3,000steps). The second half begins at the 1501st step and two teams will swap sides.
- 11-vs-11 full-game (11v11): An 11 vs 11 full-game. The left-team starts the kick-off. The game terminates when the episode reaches the maximum duration (3,000 steps). The second half begins at the 1501st step and two teams will swap sides.
- Independent PPO (IPPO)
- Multi-Agent PPO (MAPPO)
- Heterogeneous-Agent PPO (HAPPO)
- Agent-by-agent Policy Optimization (A2PO)
- Multi-Agent Transformer (MAT)
The experiment configurations are listed under this folder. Your can run an experiment, for example, by
python3 light_malib/main_pbt.py --config expr_configs/cooperative_MARL_benchmark/academy/pass_and_shoot_with_keeper/ippo.yaml
- 5-vs-5 full-game (5v5)
- 11-vs-11 full-game (11v11)
- Policy Space Response Oracle (PSRO)
- League Training
The experiment configurations are listed under this folder. Your can run an experiment, for example, by
python3 light_malib/main_pbt.py --config expr_configs/population_based_self_play/ippo_5v5_hard_psro.yaml
We offer some pre-trained policies for study in both 5-vs-5 and 11-vs-11 full-game scenarios. You probably want use them as opponents or for initalization. Please refer to this section.
Our framework design draws great inspiration from MALib and RLlib. It has five major components, each serving a specific role:
- Rollout Manager: The Rollout Manager establishes multiple parallel rollout workers and delegates rollout tasks to each worker. Each rollout task includes environment settings, policy distributions for simulation, and information pertaining to the Episode Server.
- Training Manager: The Training Manager sets up multiple distributed trainers and assigns training tasks to each trainer. Training task descriptions consist of training configurations and details regarding the Policy and Episode buffers.
- Data Buffer: The Data Buffer serves as a repository for episodes and policies. The Episode Server saves new episodes submitted by the rollout workers, while trainers retrieve sampled episodes from the Episode Server for training. The Policy Server, on the other hand, stores updated policies submitted by the Training Manager. Rollout workers subsequently fetch these updated policies from the Policy Server for simulation.
- Agent Manager: The Agent Manager manages a population of policies and their associated data, which includes pairwise match results and individual rankings.
- Task Scheduler: The Task Scheduler is responsible for scheduling and assigning tasks to the Training Manager and Rollout Manager. In each training generation, it selects an opponent distribution based on computed statistics retrieved from the Agent Manager.
Beside training against a fixed opponent, Light-MALib also supports population-based training, such as Policy-Space Response Oracle (PSRO). An illustration of a PSRO trial is given as below:
Currently, we provide the following tools for better study in the field of Football AI.
A data structure representing a game as a tree structure with branching indicating important events like goals or intercepts. See its usage in README.
A single-step graphical debugger illustrating both 3D and 2D frames with detailed frame data, such as the movements of players and the ball. See its usage in README.
At this stage, we release some of our trained model for use as initializations or opponents.
See documentation.
Thanks for your interests! The project is open for contribution. You can either add new environment or algorithm to be tested under the framework.
For new environment, feel free to check out this example.
For new algorithm, it needs to be put in the directory \light_malib\algorithm\
and should include the following components:
loss.py
: given samples, how to compute loss function and performs gradient update;policy.py
: policy instance for action generation mainly;trainer.py
: trainer class for data preprocessing;
For each policy setting (actor/critic network, feature settings, etc), please check out this doc.
DataServer:
alive_usage_mean/std
: mean/std usage of data samples in buffer;mean_wait_time
: total reading waiting time divided reading counts;sample_per_minute_read
: number of samples read per minute;sample_per_minute_write
: number of samples written per minute;
PSRO:
Elo
: Elo-rate during PBT;Payoff Table
: plot of payoff table;
Rollout:
bad_pass,bad_shot,get_intercepted,get_tackled,good_pass,good_shot,interception,num_pass,num_shot,tackle, total_move,total_pass,total_possession,total_shot
: detailed football statistics;goal_diff
: goal difference of the training agent (positive indicates more goals);lose/win
: expected lose/win rate during rollout;score
: expected scores durig rollout, score for a single game has value 0 if lose, 1 if win and 0.5 if draw;
RolloutTimer
batch
: timer for getting a rollout batch;env_core_step
: timer for simulator stepping time;env_step
: total timer for an enviroment step;feature
: timer for feature encoding;inference
: timer for policy inference;policy_update
: timer for pulling policies from remote;reward
: timer for reward calculation;rollout
: total timer for one rollout;sample
: timer for policy sampling;stats
: timer for collecting statistics;
Training:
Old_V_max/min/mean/std
: value estimate at rollout;V_max/min/mean/std
: current value estimate;advantage_max/min/mean/std
: Advantage value;approx_kl
: KL divergence between old and new action distributions;clip_ratio
: proportion of clipped entries;delta_max/min/mean/std
: TD error;entropy
: entropy value;imp_weights_max/min/mean/std
: importance weights;kl_diff
: variation ofapprox_kl
;lower_clip_ratio
: proportion of up-clipping entries;upper_clip_ratio
: proportion of down-clipping entries;policy_loss
: policy loss;training_epoch
: number of training epoch at each iteration;value_loss
: value loss
TrainingTimer:
compute_return
: timer for GAE compute;data_copy
: timer for data copy when processing data;data_generator
: timer for generating data;loss
: total timer for loss computing;move_to_gpu
: timer for sending data to GPU;optimize
: total timer for an optimization step;push_policy
: timer for pushing trained policies to the remote;train_step
: total timer for a training step;trainer_data
: timer for get data fromlocal_queue
;trainer_optimize
: timer for a optimization step in the trainer;
If you have any questions about this repo, feel free to leave an issue. You can also contact current maintainers, YanSong97 and DiligentPanda, by email.