Uni-O4: Unifying Online and Offline Deep Reinforcement Learning with Multi-Step On-Policy Optimization
Kun Lei · Zhengmao He* · Chenhao Lu* · Kaizhe Hu · Yang Gao · Huazhe Xu
Project Page | arXiv | Twitter
We evaluate Uni-O4 on standard D4RL benchmarks during offline and online fine-tuning phases. In addition, we utilize Uni-O4 to enable rapid adaptation of our quadrupedal robot dog to new and challenging environments. This repo contains five branches:
master (default) -> Uni-O4
go1_sdk -> sdk set-up for go1 robot
data_collecting_deployment -> Deploying go1 in real-world for data collecting
unio4-offline-robot -> Run Uni-O4 on dataset collected dy real-world robot dog
go1-online-finetuning -> Fine-tuning the robot in real-world online
Clone each branch:
git clone -b [Branch Name] https://github.com/Lei-Kun/Uni-O4.git
torch 1.12.0
mujoco 2.2.1
mujoco-py 2.1.2.14
d4rl 1.1
To install all the required dependencies:
- Install MuJoCo from here.
- Install Python packages listed in
requirements.txt
usingpip install -r requirements.txt
. You should specify the version ofmujoco-py
inrequirements.txt
depending on the version of MuJoCo engine you have installed. - Manually download and install
d4rl
package from here.
main.py
: trains the network, storing checkpoints along the way. Other domain set-up comming soon.Example - for offline pre-training
:
./scripts/mujoco_loco/hm.sh
Example - for online fine-tuning
:
./ppo_finetune/scripts/mujoco_loco/hm.sh
NOTE1: The key hyper-parameters for the offline phase, are whether state normalization is chosen, the rollout steps for offline policy evaluation, and the policy improvement learning rate.
NOTE2: During the offline policy improvement stage, if the OPE score (i.e., 'q mean') becomes excessively large or unstable, consider reducing the number of rollout steps.
NOTE3: The performance of online PPO largely depends on the hyper-parameters and some well-known tricks, see here.
See INSTALL.md for installation instructions.
For real-world adaptation tasks involving quadrupedal robots, our approach involves a three-step process. Firstly, we pre-train a policy in a simulator, which takes several minutes to complete. Then, we proceed with fine-tuning the policy in the real-world environment, both offline and online, utilizing the uni-o4 algorithm.
- Pretrining in Issacgym:
cd ./unio4-offline-robot
pip install -e .
cd ./scripts
python train.py
- Fine-tuning by uni-o4 offline - collecting data (build sdk follows INSTALL.md):
1)Start up go1 sdk:
cd ./go1_sdk/build
./lcm_position
2)Run:
cd ./data_collecting_deployment
pip install -e .
cd ./data_collecting_deployment/go1_gym_deploy/scripts
python deploy_policy --deploy_policy 'sim'
'sim' -> pretrained policy in simulator
'offline' -> offline fine-tuned policy in real-world
'online' -> online fine-tuned policy in real-world
- Fine-tuning by uni-o4 offline - run uni-o4 on collected dataset:
copy dataset to unio4-offline-robot
cd ./unio4-offline-robot
./run.sh
- Fine-tuning by PPO online:
cd ./go1_sdk/build
./lcm_position
cd ./go1-online-finetuning
python off2on.py
If you use Uni-O4, please cite our paper as follows:
@inproceedings{
lei2024unio,
title={Uni-O4: Unifying Online and Offline Deep Reinforcement Learning with Multi-Step On-Policy Optimization},
author={Kun LEI and Zhengmao He and Chenhao Lu and Kaizhe Hu and Yang Gao and Huazhe Xu},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=tbFBh3LMKi}
}