-
Notifications
You must be signed in to change notification settings - Fork 17
Use OpenAI Baselines with Dart Env
Tested for Python3 on Ubuntu 14.04, OSX 10.12.
For better managing the python packages, it is recommended to use virtual environments either through virtualenv or anaconda.
- Install virtual env via:
pip install virtualenv
- Create a virtual environment:
virtualenv /path/to/venv --python=python3
- Activate the virtual environment:
. /path/to/venv/bin/activate
Anaconda manages various things including virtual environments, packages, notebooks, etc. However, it may have conflicts with homebrew on mac osx. So use with caution when you are trying to install both anaconda and homebrew. To setup virtual envs with anaconda, use the following steps:
1.Download and install Anaconda for Python 3.6 from: https://www.continuum.io/downloads
2.Create virtual environment:
conda create --name ENV_NAME python=3.6
3.Activate the virtual environment:
source activate ENV_NAME
Please refer to https://github.com/DartEnv/dart-env/wiki for instructions on installing Dart Env.
Detailed instructions can be found in the original repository of Baselines. Here we list the key command line commands:
brew install cmake openmpi
sudo apt-get update && sudo apt-get install cmake libopenmpi-dev python3-dev zlib1g-dev
git clone https://github.com/openai/baselines.git
cd baselines
pip install -e .
We provide an example of training a single-legged robot to move forward using Proximal Policy Optimization (PPO) algorithm. To perform training, first create a new file under the baselines root directory, say we name it run_dart.py. Then copy the following code into the file:
from baselines.common.cmd_util import make_mujoco_env, mujoco_arg_parser
from baselines.common import tf_util as U
from baselines import logger
def callback(localv, globalv):
import joblib
if localv['iters_so_far'] % 10 != 0:
return
save_dict = {}
variables = localv['pi'].get_variables()
for i in range(len(variables)):
cur_val = variables[i].eval()
save_dict[variables[i].name] = cur_val
joblib.dump(save_dict, logger.get_dir()+'/policy_params_'+str(localv['iters_so_far'])+'.pkl', compress=True)
joblib.dump(save_dict, logger.get_dir() + '/policy_params' + '.pkl', compress=True)
def train(env_id, num_timesteps, seed):
from baselines.ppo1 import mlp_policy, pposgd_simple
U.make_session(num_cpu=1).__enter__()
def policy_fn(name, ob_space, ac_space):
return mlp_policy.MlpPolicy(name=name, ob_space=ob_space, ac_space=ac_space,
hid_size=64, num_hid_layers=2)
env = make_mujoco_env(env_id, seed)
pposgd_simple.learn(env, policy_fn,
max_timesteps=num_timesteps,
timesteps_per_actorbatch=4000,
clip_param=0.2, entcoeff=0.0,
optim_epochs=10, optim_stepsize=3e-4, optim_batchsize=64,
gamma=0.99, lam=0.95, schedule='linear',callback=callback,
)
env.close()
def main():
args = mujoco_arg_parser().parse_args()
logger.configure('data/ppo_'+args.env+'_results')
train(args.env, num_timesteps=args.num_timesteps, seed=args.seed)
if __name__ == '__main__':
main()
Then run:
mpirun -np 2 python run_dart.py --env DartHopper-v1 --seed 0
You should find a folder named ppo_DartHopper-v1_results inside the data folder. You can find the learning progress in progress.csv, and policy files at different learning iterations as policy_params_ITER.pkl, where ITER is the iteration number. For this example, it should be able to get 2k+ total reward (EpRewMean) at the end of the training.
Finally, to visualize the policy controlling the simulated robot, first create a new file for the testing code, then copy the following code into the file:
import gym, sys, joblib, numpy as np, tensorflow as tf
from baselines.common import set_global_seeds, tf_util as U
from baselines.ppo1 import mlp_policy
if __name__ == '__main__':
env = gym.make(sys.argv[1])
if hasattr(env.env, 'disableViewer'):
env.env.disableViewer = False
sess = tf.InteractiveSession()
policy = None
if len(sys.argv) > 2:
policy_params = joblib.load(sys.argv[2])
policy = mlp_policy.MlpPolicy(name="pi", ob_space=env.observation_space, ac_space=env.action_space, hid_size=64, num_hid_layers=2)
U.initialize()
cur_scope = policy.get_variables()[0].name[0:policy.get_variables()[0].name.find('/')]
orig_scope = list(policy_params.keys())[0][0:list(policy_params.keys())[0].find('/')]
vars = policy.get_variables()
for i in range(len(policy.get_variables())):
assign_op = policy.get_variables()[i].assign(policy_params[policy.get_variables()[i].name.replace(cur_scope, orig_scope, 1)])
sess.run(assign_op)
traj_num, rew, ct, d = 1, 0, 0, False
o = env.reset()
while ct < traj_num:
if policy is not None:
ac, vpred = policy.act(False, o)
act = ac
else:
act = env.action_space.sample()
o, r, d, env_info = env.step(act)
rew += r
env.render()
if d:
step = 0
ct += 1
print('reward: ', rew)
o=env.reset()
print('avg rew ', rew / traj_num)
Assuming that you named the file test_policy.py and put it under baselines directory, you can then run the following commands to visualize the hopper policy you just trained:
python test_policy.py DartHopper-v1 data/ppo_DartHopper-v1_results/policy_params.pkl
You should be able to see the hopper hopping forward. Here is a video of what it might look like (with different camera angle).
If you see error similar to the following:
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
try running:
pip install mpi4py==2.0.0