Major features of this toolbox:
This toolbox contains algorithms, environments, evaluation tools, and helper functions to conduct research on bargaining
in MARL.
This toolbox relies on the Ray/Tune/RLLib
framework
to provide the basic RL components and research functionalities.
Additional features of using the Ray/Tune/RLLib
research framework:
- using components from
RLLib
with extensive configuration available (e.g. using a PPO policy or a priority replay buffer) - track your experiments, log easily in TensorBoard, run hyperparameter search
- be agnostic to the deep learning framework
- create new algorithms using the very simple
Tune
API or theRLLib
API - use the
RLLib
API to take advantage of a fully customizable training pipeline - create distributed algorithms (e.g. by using the policy factory of
RLLib
)
Philosophy: Implement when needed. Improve at each new use. Keep it simple. Keep it flexible. Keep the maintenance cost low.
Support: We actively support researchers by adding tools that they see relevant for research on bargaining in MARL.
Introduction
marltoolbox
is a toolbox in that you should fork/clone and customize for yourself. You can create new experiments by
starting from the existing examples. You should edit/inherit any functionality that doesn't fit exactly your needs. This
repository is intended as a toolbox that can be shared in a research team. It is not intended to be used in
production.
marltoolbox
is not a framework that provide a simple API to run experiments in a few lines of codes (this is a feature
of RLLib
).
RLLib
is built on top of Tune
and Tune
is built on top of Ray
. This toolbox marltoolbox
, is built to work
with RLLib
but also to allow to fallback to Tune
only if needed, at the cost of some functionalities.
To speed up research, we advise to take advantages of the functionalities of Tune
and RLLib
.
Ray
README (<5 min)
Tune
's key concepts (< 5 min)
RLlib
in 60 seconds (< 5 min)
Without any local installation, you can work through 2 tutorials to introduce marltoolbox
together with Tune
and RLLib
.
Please use Google Colab
to run them:
- Basic - How to use the toolbox (~ 30 mins) (in Colab)
- Evaluations - "Level 1 best-response" and "self-play and cross-play" (~ 30 mins) (in Colab)
Advanced introduction
To explore Tune
further:
To explore RLLib
further:
- a simple tutorial
where
RLLib
is used to train a PPO algorithm RLLib
documentationRLLib
tutorialsRLLib
examples
To explore the toolbox marltoolbox
further, take a look at
our examples.
The installation is tested with Ubuntu 18.04 LTS (preferred) and 20.04 LTS.
It requires less than 20 Go of space including all the dependencies like PyTorch, etc.
(Optional) Connect to your virtual machine(VM) on Google Cloud Platform(GCP)
gcloud compute ssh {replace-by-instance-name}
(Usually optional) Do some basic upgrade and install some basic requirements (e.g. needed on a new VM)
sudo apt update
sudo apt upgrade
sudo apt-get install build-essential
# Run this command another time (especially needed with Ubuntu 20.04 LTS)
sudo apt-get install build-essential
(Optional) Use a virtual environment
# If needed, install conda:
## Follow instruction at
https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html
## Like that:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
# Enter. Enter... yes. Enter. yes.
exit
# Connect again to the VM or open a new terminal
gcloud compute ssh {replace-by-instance-name}
# Check your conda installation
conda list
# Create a virtual environment:
conda create -y -n marltoolbox python=3.8.5
conda activate marltoolbox
pip install --upgrade pip
Install the toolbox: marltoolbox
## Install dependencies
### For RLLib
conda install -y psutil
### (optional) To be able to use most of the gym environments
sudo apt-get install -y libglu1-mesa-dev libgl1-mesa-dev libosmesa6-dev xvfb ffmpeg curl patchelf libglfw3 libglfw3-dev cmake zlib1g zlib1g-dev swig
## Install marltoolbox
git clone https://github.com/longtermrisk/marltoolbox.git
cd marltoolbox
## Here are different installation instructions to support different algorithms
### Default install
pip install -e .
### If you are planning to use LOLA then run instead:
conda install -y python=3.6
pip install -e .[lola]
Test the installation
# Check that RLLib is working
## Use RLLib built-in training functionalities
rllib train --run=PPO --env=CartPole-v0 --torch
## Ctrl+C to stop the training
# Check that the toolbox is working
python ./marltoolbox/examples/rllib_api/pg_ipd.py
## You should get the status TERMINATED
# Visualize the logs
tensorboard --logdir ~/ray_results
## If working on GCP: forward the connection from a Virtual Machine(VM) to your machine
## Run this command on your local machine from another terminal (not in the VM)
gcloud compute ssh {replace-by-instance-name} -- -NfL 6006:localhost:6006
## Go to your browser to visualize the url http://localhost:6006/
(Optional) Install additional deep learning libraries (PyTorch CPU only is installed by default)
# Install PyTorch with GPU
# Check cuda version
nvidia-smi
# Look for "CUDA Version: XX.X"
# With the right cuda version:
conda install pytorch torchvision cudatoolkit=[cuda version like 10.2] -c pytorch
# Check PyTorch installation and if your GPU is available to PyTorch
python
import torch
torch.__version__
torch.cuda.is_available()
exit()
# Install Tensorflow
pip install tensorflow
Probably the greatest value of using RLLib/Tune
and this toolbox is that you can use the provided environments,
policies and some components of marltoolbox
and RLLib
(like a PPO agent)
anywhere (e.g. without using Tune
nor RLLib
for anything else).
Yet we recommend to use Tune
and if possible RLLib
. There are mainly 3 ways to run experiments with Tune
or RLLib
. They support increasing functionalities but also use more and more constrained APIs.
Tune function API (the less constrained, not recommended)
- Constraints: With the
Tune
function API, you only need to provide the training function. See theTune
documentation. - Best used: If you want to very quickly run some code from an external repository.
- Functionalities: Running several seeds in parallel and comparing their results. Easily plot values to TensorBoard and visualizing the plots in live. Tracking your experiments and hyperparameters. Hyperparameter search. Early stopping.
Tune class API (very few constraints, recommended)
- Constraints: You need to provide a Trainer class with at minimum a setup method and a step
method. See the
Tune
documentation. - Best used: If you want to run some code from an external repository and you need checkpoints. Helpers in this
toolbox (
marltoolbox.utils.policy.get_tune_policy_class
) will also allow you transform this class (already trained) into frozenRLLib
policies. This is useful to produce evaluation against otherRLLib
algorithms or when using experimentation tools frommarltoolbox.utils
. - Additional functionalities: Cleaner format. Checkpoints. Allow conversion to the
RLLib
policy API.
The trained agents can be converted to theRLLib
policy API for evaluation only. This allows you to use functionalities which rely on theRLLib
API (but not training).
RLLib API (quite constrained, recommended)
- Constraints: You need to use the
RLLib
API (trainer, policy, callbacks, etc.). For information,RLLib
trainer classes are specific implementations of theTune
class API (just above). See theRLLib
documentation. - Best used: If you are creating a new training setup or policy from scratch. Or if you want a seamless integration
with all
RLLib
components. Or if you need distributed training. - Additional functionalities: Using easily all components from
RLLib
(models, environments, algorithms, exploration, schedulers, preprocessing, etc.). Using the customizable trainer and policy factories fromRLLib
.
Fall back to the Tune
APIs when using the RLLib
API is too costly
If the setup you want to train already exist, has a training loop and if the cost to convert it into RLLib
is too
expensive, then with minimum changes you can use Tune
.
When is the conversion cost to RLLib
too high?
- If the algorithm has a complex unusual dataflow
- If the algorithm has an unusual training process
- like
LOLA
: performing "virtual" opponent updates - like
LTFT
: nested algorithms
- like
- If you don't need to change the algorithm
- If you don't plan to run the algorithm against policies from
RLLib
- If you do not plan to work much with the algorithm. And thus, you do not want to invest time in the conversion
to
RLLib
. - Some points above and you are only starting to use
RLLib
- etc.
- Tutorial_Basics_How_to_use_the_toolbox.ipynb
You can find such examples in marltoolbox.examples.tune_class_api
and in marltoolbox.examples.tune_function_api
.
Using components directly provided by RLLib
or
marltoolbox
- Tutorial_Basics_How_to_use_the_toolbox.ipynb
- Using an A3C policy:
amd.py
withuse_rllib_policy = True
(toolbox example) - Using (custom or not) environments:
- IPD and coin game environments: amd.py (toolbox example)
- Asymmetric coin game environment: lola_pg_official.py (toolbox example)
- IPD environments: pg_ipd.py (toolbox example)
- Coin game environment: ppo_coin_game.py (toolbox example)
- APEX_DDPG and the water world environment:
multi_agent_independent_learning.py
- MADDPG and the two step game environment:
two_step_game.py
- Policy Gradient (PG) and the rock paper scissors environment:
rock_paper_scissors_multiagent.py
(in therun_same_policy
function)
Customizing existing algorithms from RLLib
- Customize policy's postprocessing (processing after env.step) and trainer: inequity_aversion.py (toolbox example)
- Change the loss function of the Policy Gradient (PG) Policy:
rock_paper_scissors_multiagent.py
(in therun_with_custom_entropy_loss
function)
Creating and using new custom policies in RLLib
In RLLib
, customizing a policy allows to change its training and evaluation logics.
- Hardcoded random Policy:
multi_agent_custom_policy.py
- Hardcoded fixed Policy:
rock_paper_scissors_multiagent.py
(in therun_heuristic_vs_learned
function) - Policy with nested Policies:
ltft_with_various_env.py
(toolbox example)
Using custom dataflows in RLLib
(custom Trainer or Trainer's execution_plan)
- Training 2 different policies with 2 different Trainers
(less complex but less sample efficient than the 2nd method below):
multi_agent_two_trainers.py
- Training 2 different policies with a custom Trainer (more complex, more sample efficient):
two_trainer_workflow.py
Using experimentation tools from the toolbox
- Evaluations_Level_1_best_response_and_self_play_and_cross_play.ipynb
- Training a level 1 best response:
l1br_amtft.py
(toolbox example) - Evaluating same-play and cross-play performances:
amtft_various_env.py
(toolbox example)
Environments
- various matrix social dilemmas
- various coin games
- bargaining with alternating offers (Emergent Communication through Negotiation)
Algorithms
- AMD (Adaptive Mechanism Design)
- amTFT (Approximate Markov Tit-For-Tat)
- LTFT (Learning Tit-For-Tat, simplified version)
- LOLA-Exact, LOLA-PG, LOLA-DICE
- supervised learning
- population
- This policy plays an episode by sampling a policy from a population of similar policies
- hierarchical
- It is a base policy class which allows the use of nested algorithms
Utils
- exploration
- SoftQ with temperature schedule
- SoftQ with clustering of the Q values
- log
- callbacks to log values from environments and policies
- lvl1_best_response
- helper functions to train level 1 exploiters
- policy
- helper to transform a trained Tune Trainer into frozen RLLib policies
- postprocessing
- helpers to compute welfare functions and add this data in the evaluation batch (the batches sampled by the evaluation workers)
- restore
- helpers to load a checkpoint only for a chosen policy (instead of for all existing policies as RLLib does)
- rollout
- a rollout runner function which can be called from inside a RLLib policy
- self_and_cross_perf
- a helper to evaluate the performance in self-play and cross-play.
"self-play": playing against agents from the same training run.
"cross-play": playing against agents from different training runs.
- a helper to evaluate the performance in self-play and cross-play.
- plot
- helpers to plot results
Scripts
- aggregate_and_plot_tensorboard_data
- a script to aggregate the logged values from several seeds (into mean, std, etc.) and to create summary plots
Improvements
- Add unit tests for the algorithms
- Refactor the algorithm to make them more readable
- Use the logger everywhere
- Add and improve docstrings
- Set good hyper-parameters in the custom examples
- Report all results directly in Weights&Biases (saving download time from VM)
New algorithms
- Multi-agent adversarial IRL
- Multi-agent generative adversarial imitation learning
- Model-based RL like PETS, MPC
- Opponent modeling like k-level
- Capability to use algorithms from OpenSpiel like MCTS
New functionalities
- Reward uncertainty
- Full / partial observability of opponent actions
- (partial) Parameter transparency
- Easy benchmarking with metrics specific to MARL
- (more on) Exploitability evaluation
- Performance against a suite of other MARL algorithms
New environments
- Capability to use environments from OpenSpiel
- (iterated) Ultimatum game (including variants)