Author's Pytorch implementation of ICLR 2023 paper Behavior Proximal Policy Optimization (BPPO). BPPO uses the loss function from Proximal Policy Optimization (PPO) to improve the behavior policy estimated by behavior cloning.
Compared to the loss function of PPO, BPPO does not introduce any extra constraint or regularization. The only difference is the advantage approximation, corresponding to the code difference between ppo.py
line 88-89 and bppo.py
line 151-155.
The code consists of 7 Python scripts and the file main.py
contains various parameter settings which are interpreted and described in our paper.
torch 1.12.0
mujoco 2.2.1
mujoco-py 2.1.2.14
d4rl 1.1
python main.py
: trains the network, storing checkpoints along the way.Example
:
python main.py --env hopper-medium-v2
If you use BPPO, please cite our paper as follows:
@article{zhuang2023behavior,
title={Behavior proximal policy optimization},
author={Zhuang, Zifeng and Lei, Kun and Liu, Jinxin and Wang, Donglin and Guo, Yilang},
journal={arXiv preprint arXiv:2302.11312},
year={2023}
}