Uni-O4: Unifying Online and Offline Deep Reinforcement Learning with Multi-Step On-Policy Optimization

ICLR 2024

Kun Lei · Zhengmao He* · Chenhao Lu* · Kaizhe Hu · Yang Gao · Huazhe Xu

Project Page | arXiv | Twitter

Code Overview

We evaluate Uni-O4 on standard D4RL benchmarks during offline and online fine-tuning phases. In addition, we utilize Uni-O4 to enable rapid adaptation of our quadrupedal robot dog to new and challenging environments. This repo contains five branches:

master (default) -> Uni-O4
go1_sdk -> sdk set-up for go1 robot
data_collecting_deployment -> Deploying go1 in real-world for data collecting
unio4-offline-robot -> Run Uni-O4 on dataset collected dy real-world robot dog
go1-online-finetuning -> Fine-tuning the robot in real-world online

Clone each branch: git clone -b [Branch Name] https://github.com/Lei-Kun/Uni-O4.git

For D4RL benchmarks

Requirements

torch 1.12.0
mujoco 2.2.1
mujoco-py 2.1.2.14
d4rl 1.1

To install all the required dependencies:

Install MuJoCo from here.
Install Python packages listed in requirements.txt using pip install -r requirements.txt. You should specify the version of mujoco-py in requirements.txt depending on the version of MuJoCo engine you have installed.
Manually download and install d4rl package from here.

Running the code

main.py: trains the network, storing checkpoints along the way. Other domain set-up comming soon.
Example - for offline pre-training:

./scripts/mujoco_loco/hm.sh

Example - for online fine-tuning:

./ppo_finetune/scripts/mujoco_loco/hm.sh

NOTE1: The key hyper-parameters for the offline phase, are whether state normalization is chosen, the rollout steps for offline policy evaluation, and the policy improvement learning rate.

NOTE2: During the offline policy improvement stage, if the OPE score (i.e., 'q mean') becomes excessively large or unstable, consider reducing the number of rollout steps.

NOTE3: The performance of online PPO largely depends on the hyper-parameters and some well-known tricks, see here.

Real-world tasks set-up

See INSTALL.md for installation instructions.

For real-world adaptation tasks involving quadrupedal robots, our approach involves a three-step process. Firstly, we pre-train a policy in a simulator, which takes several minutes to complete. Then, we proceed with fine-tuning the policy in the real-world environment, both offline and online, utilizing the uni-o4 algorithm.

Pretrining in Issacgym:

cd ./unio4-offline-robot
pip install -e .
cd ./scripts
python train.py

Fine-tuning by uni-o4 offline - collecting data (build sdk follows INSTALL.md):

1）Start up go1 sdk:

cd ./go1_sdk/build
./lcm_position

2）Run:

cd ./data_collecting_deployment
pip install -e .
cd ./data_collecting_deployment/go1_gym_deploy/scripts
python deploy_policy --deploy_policy 'sim'

'sim' -> pretrained policy in simulator
'offline' -> offline fine-tuned policy in real-world
'online' -> online fine-tuned policy in real-world

Fine-tuning by uni-o4 offline - run uni-o4 on collected dataset:

copy dataset to unio4-offline-robot
cd ./unio4-offline-robot
./run.sh

Fine-tuning by PPO online:

cd ./go1_sdk/build
./lcm_position
cd ./go1-online-finetuning
python off2on.py

Citation

If you use Uni-O4, please cite our paper as follows:

@inproceedings{
lei2024unio,
title={Uni-O4: Unifying Online and Offline Deep Reinforcement Learning with Multi-Step On-Policy Optimization},
author={Kun LEI and Zhengmao He and Chenhao Lu and Kaizhe Hu and Yang Gao and Huazhe Xu},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=tbFBh3LMKi}
}

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
ppo_finetune		ppo_finetune
scripts		scripts
transition_model		transition_model
.gitignore		.gitignore
BC_ensemble.py		BC_ensemble.py
INSTALL.md		INSTALL.md
LICENSE		LICENSE
README.md		README.md
abppo.py		abppo.py
buffer.py		buffer.py
critic.py		critic.py
dynamics_eval.py		dynamics_eval.py
main.py		main.py
net.py		net.py
pipeline_gif.gif		pipeline_gif.gif
ppo.py		ppo.py
t265_frame.jpg		t265_frame.jpg
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Uni-O4: Unifying Online and Offline Deep Reinforcement Learning with Multi-Step On-Policy Optimization

ICLR 2024

Project Page | arXiv | Twitter

Code Overview

For D4RL benchmarks

Requirements

Running the code

NOTE1: The key hyper-parameters for the offline phase, are whether state normalization is chosen, the rollout steps for offline policy evaluation, and the policy improvement learning rate.

NOTE2: During the offline policy improvement stage, if the OPE score (i.e., 'q mean') becomes excessively large or unstable, consider reducing the number of rollout steps.

NOTE3: The performance of online PPO largely depends on the hyper-parameters and some well-known tricks, see here.

Real-world tasks set-up

Citation

About

Releases

Packages

Contributors 2

Languages

License

Lei-Kun/Uni-O4

Folders and files

Latest commit

History

Repository files navigation

Uni-O4: Unifying Online and Offline Deep Reinforcement Learning with Multi-Step On-Policy Optimization

ICLR 2024

Project Page | arXiv | Twitter

Code Overview

For D4RL benchmarks

Requirements

Running the code

NOTE1: The key hyper-parameters for the offline phase, are whether state normalization is chosen, the rollout steps for offline policy evaluation, and the policy improvement learning rate.

NOTE2: During the offline policy improvement stage, if the OPE score (i.e., 'q mean') becomes excessively large or unstable, consider reducing the number of rollout steps.

NOTE3: The performance of online PPO largely depends on the hyper-parameters and some well-known tricks, see here.

Real-world tasks set-up

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages