SRL (ReaLly Scalable RL): Scaling Distributed Reinforcement Learning to Over Ten Thousand Cores

SRL is an efficient, scalable and extensible distributed Reinforcement Learning system. SRL supports running several state-of-the-art RL algorithms on some common environments with one simple configuration file, and also exposes general APIs for users to develop their self-defined environments, policies and algorithms. SRL even allows users to implement new system components to support their algorithm designs, if current system architecture is not sufficient.

For algorithm developers

Our support for multi-agent training goes beyond classic MAPPO experiments. To unleash full control over your agents, checkout our experiment configuration doc.
We provide a quick start for algorithm developers. Now users could migrate their environment, write customized policy and trainer without knowing the details of system implementation.

Terminology

RL system components:

Controller, a "control panel" connected to all worker;
Actor worker, who runs several environments.
Policy Worker, who generate actions for the agents.
Buffer Worker，who prepares data for trainers, optional in most cases.
Trainer worker, who computes gradients and update parameters.
Eval Manager, who logs evaluation results and update metadata parameter.
Population Manager: who controls the progress of population-based experiments.

Scheduler related:

experiment_name(-e), name as registered by a experiment configuration.
trial_name(-f), name given when launching an experiment.

Code Structure

api: Development api for algorithm and environments.
apps: Main entry.
base: The base library including anything unrelated to the RL logic; e.g. networking utils, general data structures & algorithms.
codespace: where developers should place their code
distributed: Directory for distributed system.
legacy: Implementation of classic algorithm / environments.
local: A local version of the distributed system.
scripts: Scripts for developers.

Getting Start

See code-style.md for guide on development. See cluster.md for description on our cluster.

Prerequisite

Ask the administrators for an account on the cluster.
Setup your VPN. Ask the administrators for details.
On your PC, add the following lines to your ~/.ssh/config

Host prod 
    HostName 10.210.14.4
    User {YOUR_USER_NAME}

First, sync the repo to frlcpu001:

scripts/sync_repo prod

Alternatively you can check out the repo on the server. Make sure to sync or checkout the code to /home so that it is visible on all nodes.

To run a mini experiment:

python3 -m apps.main start -e my-atari-exp -f $(whoami)-test --mode slurm --wandb_mode offline

This runs the experiment my-atari-exp with a trial name username-test. Mode should be slurm unless you are running the code on you PC or with in a container. You can also config your wandb api key on a terminal to allow --wandb_mode online:

# Get your WANDB api key from: https://wandb.ai/authorize
echo "export WANDB_API_KEY=< set your WANDB_API_KEY here>" >> ~/.profile
# Set wandb Host to our proxy.
echo 'export WANDB_BASE_URL="http://proxy.newfrl.com:8081"' >> ~/.profile

By default, experiments timeout after 3 days. You could change this value in your experiment configurations.

System

System documentation is moved to system documentation.

Monitoring

We use both wandb and Prometheus. Run wandb init and use --wandb_mode online to use the former.

The login of prometheus is the same as the cluster.

Checkout W&B configuration on how to customize your wandb_run setup.

Checkout optimize_your_experiment.md on how to improve the efficiency of your experiment.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
api		api
apps		apps
base		base
codespace		codespace
distributed		distributed
legacy		legacy
local		local
scripts		scripts
Dockerfile		Dockerfile
Dockerfile.jax		Dockerfile.jax
README.md		README.md
WORKSPACE		WORKSPACE
iclr_benchmark.py		iclr_benchmark.py
iclr_clear_run.py		iclr_clear_run.py
iclr_plot.py		iclr_plot.py
iclr_raw_scores_latex.py		iclr_raw_scores_latex.py
marl		marl
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SRL (ReaLly Scalable RL): Scaling Distributed Reinforcement Learning to Over Ten Thousand Cores

For algorithm developers

Terminology

Code Structure

Getting Start

System

Monitoring

About

Releases

Packages

Contributors 2

Languages

openpsi-project/srl

Folders and files

Latest commit

History

Repository files navigation

SRL (ReaLly Scalable RL): Scaling Distributed Reinforcement Learning to Over Ten Thousand Cores

For algorithm developers

Terminology

Code Structure

Getting Start

System

Monitoring

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages