This project was developed using Python 3.8. Install dependencies using pip
pip install -r requirements.txt
Please also install torch==1.8.1 and torchvision==0.9.1 separately following the instructions here. Follow additional instructions for setting up mujoco from https://github.com/openai/mujoco-py.
Kitchen, maze and ant environments require a modified version of d4rl(https://github.com/rail-berkeley/d4rl) which has been setup support a leared reward function. That repo can be found here (https://github.com/MatthewChang/d4rl_learned_reward). Install it by cloning the repo and running the install script.
git clone git@github.com:MatthewChang/d4rl_learned_reward.git
pip install "git+https://github.com/aravindr93/mjrl@master#egg=mjrl"
pip install -e ./d4rl_learned_reward
Experiments in visual navigation are based on this repo 'Semantic Visual Navigation by Watching Youtube Videos' [VLV Repo] (https://github.com/MatthewChang/video-dqn). Setup instructions in that repo should be followed for installation.
To generate data for the kitchen enviornment run python kitchen/gen_data.py
, for the maze environment run python maze2d/gen_data.py maze2d-medium-v1 --skip 2 --num-actions 4
. These scripts render data into their respective sub-folders.
To generate data for visual navigation, use the below command
VLV_LOCATION=[VLV_LOCATION] GIBSON_LOCATION=[GIBSON_LOCATION] python vis_nav/generate_data.py [LOCATION_TO_WRITE_DATA]
where filling in the location of the VLV repo and the gibson meshes installed by following instruction in the VLV repo.
Data for the gridworld environment is already included in this repo.
To generate data for freeway, we use the code in An Optimistic Perspective on Offline Reinforcement Learning, crucially with sticky actions turned off. Store this data in freeway/batch_rl_data.
cd env
python latent_action_mining.py --gpu [gpu_id]
env can be one of {vis_nav, freeway, maze2d, kitchen}.
This writes the models to env/lam_runs/repro.
Train latent action model on gridworld data with
python gridworld/latent_action_mining.py --gpu 1 --bottleneck_size 8 --batch_norm --logdir grid-bs8-ss6-center-clean --step_size 6 --logdir_prefix output-grid-v2
cd env
python save_actions.py
env can be one of {vis_nav, freeway, maze2d, kitchen}.
cd env
python ./train_q_network.py -g [gpu_id] configs/experiments/real_data
After training, the value function on gridworld is trained with
python gridworld/generate_value_function.py --model learned --model_path output-grid-v2/grid-bs8-ss6-center-clean/model-70000.pth --bottleneck_size 8 --batch_norm
cd env
python spearman.py
For visual navigation, please follow this repo 'Semantic Visual Navigation by Watching Youtube Videos' [VLV Repo] (https://github.com/MatthewChang/video-dqn) for obtaining the ground truth value function.
This writes out a file: spearman.npy; Using the spearman values in this, copy the 95th percentile checkpoint file to value_fuctions/env/
Note: If having trouble loading the maze2d DDPG model, please try using stable-baselines==2.9.0.
After learning a value function you can evaluate agent performance with densified reward using
python ./evaluate.py --env [env_name] --model [path_to_model]
where env_name
is 'kitchen', 'ant', or 'maze', with the path to the model
generated by the value function generating script. For example, if you
generated a value function using the above command for the maze environment you
can run python ./evaluate.py --env maze --model [path_to_model]
. Results are
written into tensorboard archives in ./runs
.
Evaluation of the gridworld is done with tabular q-learning and can be launched with
python gridworld/evaluate.py output-grid-v2/grid-bs8-ss6-center-clean/value_function.npy