This repository contains the code accompanying the paper Multi-Task Batch Reinforcement Learning with Metric Learning
.
Codes for the full model algorithm and each of the baseline and ablation can be found under their corresponding folder. For ease of use, we separate each method into a single folder. Thus, there are duplicate files in this repository.
For software dependencies, please have a look inside the environment
folder, one can create a conda environment with environment.yml
.
To create the conda environment, cd
into the environment
folder and run:
python install_mujoco.py
conda env create -f environment.yml
** To reproduce the results of Fig. 5
in our paper (MetaGenRL related), one should create another conda environment and run the following commands:
pip3 install ray[tune]==0.7.7 gym[all] mujoco_py>=2 tensorflow-gpu==1.15.2 scipy numpy
python3 -c 'import ray; from pyarrow import plasma as plasma; plasma.build_plasma_tensorflow_op()'
which is slightly different from the instruction in the official MetaGenRL repo.
To reproduce the results, we provide the collected transition buffers for each of the training tasks, the trained BCQ models and ensemble predictors in the Google Drive, i.e., the first phase of training pipeline. Please download all the data and put them in the data_and_trained_models
folder. Otherwise one should be careful when running the following experiments and one should correctly specify the locations. After downloading the files from Google Drive, use the command below to unzip the file:
tar -xvf data_and_trained_model.tar.gz
To reproduce the results of Fig. 11, one should further download the multi-task policy trained without reward ensemble from the link. After downloading the file, use the following command to unzip the files
tar -xvf full_model_no_ensemble_results.tar.gz
then move the downloaded folder to the data_and_trained_models
folder.
Experiments are configured via .py
configuration files located in ./configs
. To evaluate different methods, we take AntDir
as an example. Except for WalkerParam
, one can obtain the corresponding results of the other task distributions by simply changing the distribution name. We will specify the difference for WalkerParam
and provide the codes to reproduce the results.
-
To get the results of our model, one should go to the
full_model
folder and run the experiments:cd full_model python main.py --config=ant-dir
-
Note that to get the results of our model on
WalkerParam
, one should go to thefull_model_walker_param
folder instead offull_model
, i.e.cd full_model_walker_param python main.py --config=walker-param
-
Similarly, to get the results of
Contextual BCQ
, one should go to thecontextual_bcq
folder and run the experiments:cd contextual_bcq python main.py --config=ant-dir
-
To get the results of
PEARL
, one should go to thebatch_pearl
folder and run the experiments:cd batch_pearl python launch_experiment.py './configs/ant-dir.json'
-
After you repeat the procedures above for all the 6 task distributions, one can plot the results by running
python -m plotting.evaluate_against_baseline
Note that to reproduce the results, one needs to activate the correct conda environment
-
Run the codes to obtain the training results of MetaGenRL on
AntDir
cd metagenrl python ray_experiments.py train
-
Open the tensorboard using the following comments:
tensorboard --logdir ./ray_results/metagenrl
-
Download the results of
custom_mean_episode_reward
in tensorboard of either agent0
or agent1
. Save the file ascustom_mean_episode_reward.csv
. -
Plot the results by running
python -m plotting.evaluate_metagenrl
Note that we already obtain the results of our full model by following the procedures to reproduce Fig. 4 in our paper.
-
To get the results of
No transition relabelling
, one should go to theno_transition_relabelling
folder and run the experimentscd no_transition_relabelling python main.py --config=ant-dir
-
To get the results of
No triplet loss
, one should go to theno_triplet_loss
folder and run the experimentscd no_triplet_loss python main.py --config=ant-dir
-
Note that to get the results of
No triplet loss
onWalkerParam
, one should go to theno_triplet_loss_walker_param
folder instead ofno_triplet_loss
, i.e.cd no_triplet_loss_walker_param python main.py --config=walker-param
-
To get the results of
neither
, one should go to theneither
folder and run the experiments:cd neither python main.py --config=ant-dir
-
For UmazeGoal-M, we also need the results of
GT
, one should go to thefull_model_ground_truth_label
folder and run the experiments:cd full_model_ground_truth_label python main.py --config=maze-umaze
-
After you repeat the procedures above for all the 6 task distributions, one can plot the results by running
python -m plotting.evaluate_against_ablations
To reproduce the results of Fig. 8 and Fig. 13 in the paper, one needs to run the original SAC oac-explore
on all
training tasks. And run the SAC initialized by our method sac_with_initialization
and a variation of SAC sac_baseline
with two identically initialized Q functions trained by different mini-batches on all
testing tasks.
There are 8 testing tasks for all the task distributions with GOAL_ID
ranging from 1 to 8
. However, the lists of GOAL_ID
of training tasks vary from task distributions to task distributions, which are specified below:
- AntDir:
[0, 1, 4, 10, 12, 14, 17, 21, 26, 27]
- AntGoal:
range from 0 to 9
- HumanoidDir-M:
[0, 1, 4, 10, 12, 14, 17, 21, 26, 27]
- HalfCheetahVel:
[3, 5, 8, 15, 16, 17, 23, 24, 29, 31]
- WalkerParam:
range from 0 to 29
- UmazeGoal-M:
range from 0 to 9
-
To get the results of original SAC, one should go to
oac-explore
and run the experiments on all training tasks:cd oac-explore python main.py --config=ant-dir --goal=GOAL_ID
Note one should repeat experiments with
GOAL_ID
traversing[0, 1, 4, 10, 12, 14, 17, 21, 26, 27]
. -
To get the results of
SAC init by our method
, one should go to thesac_with_initialization
folder and run the experiments on all testing tasks by varyingGOAL_ID
from 1 to 8:cd sac_with_initialization python main.py --config=ant-dir --goal=GOAL_ID
-
To get the results of the variation of SAC with two identically initialized Q functions trained by different mini-batches, one should go to the
sac_baseline
folder and run the experiments on all testing tasks by varyingGOAL_ID
from 1 to 8:cd sac_baseline python main.py --config=ant-dir --goal=GOAL_ID
-
After you repeat the procedures above for all the 6 task distributions, one can plot the results by running
python -m plotting.evaluate_sac_init
The configuration files are listed in full_model/configs
and full_model_walker_param/configs
. File name specified the value of triplet margin
. In paper, we set the values of triplet margin
to be [0.0, 2.0, 4.0, 8.0]
and show the results on five task distributions except for HalfCheetahVel
.
-
For example, to get the results of
triplet margin = 0.0
, one can cd tofull_model
and run the experiment:cd full_model python main.py --config=ant-dir-triplet-margin-0p0
-
To get the results on
Walker-Param
, one should go tofull_model_walker_param
instead:cd full_model python main.py --config=walker-param-triplet-margin-0p0
-
After you repeat the procedures above for all the 5 task distributions, one can plot the results by running
python -m plotting.evaluate_ablate_triplet_margin
-
To obtain the results of SAC initialized by this policy, one should go to the
sac_with_initialization
folder and run the experiments on all testing tasks by varyingGOAL_ID
from 1 to 8:cd sac_with_initialization python main.py --config=humanoid-openai-dir --goal=GOAL_ID --model_root=../data_and_trained_models/full_model_no_ensemble_results --base_log_dir=./data_without_ensemble
-
Note that we have already obtained the results of SAC initialized by our full model and standard SAC when trying to reproduce Fig. 8 and Fig. 13 in the paper. After finishing running the command above for all the testing tasks, one can reproduce Fig. 11 by running
python -m plotting.evaluate_ablate_reward_ensemble
If you would like to generate the results for the training phase, first you can go to the oac-explore
folder and run the following command to obtain the training buffers. Note that the list of GOAL_ID
varies in different task distributions. The lists of training GOAL_ID
for different task distributions are detailed above when we describe how to reproduce the results of Fig. 8 and Fig. 13 in the paper.
python main.py --config=ant-dir --goal=GOAL_ID
Then one can go the the BCQ
folder and run the following command to extract task-specific results:
python main.py --config=ant-dir --goal=GOAL_ID
Simultaneously, one can get the reward prediction ensembles by going to the reward_prediction_ensemble
and running
python main.py --config=ant-dir
and can get the next state prediction ensembles by going to the transition_prediction_ensemble
and running
python main.py --config=walker-param
Note that one should pay extra attention to --data_models_root
.
This repository was based on rlkit, oac-explore, BCQ, PEARL and MetaGenRL.
The codes to generate the transition batch for each task and codes to accelerate conventional SAC are obtained modifying the codes as provided in the oac-explore.
The Batch RL part of this paper is based on the codes as provided in the BCQ.
The codes for batch_pearl
are obtained by modifying PEARL.
The codes for metagenrl
are obtained by modifying MetaGenRL.
The codes for each environment files in the folder env
are adapted from PEARL. Note that the rand_param_envs
in each folder is copied from rand_param_envs.