Skip to content
Soichiro Nishimori edited this page Jun 24, 2024 · 22 revisions

Validity of Reported Results and Hyperparameter Details

Deep offline reinforcement learning (RL) algorithms are known to be highly sensitive to hyperparameters and small implementation details. This can be observed by skimming through various papers and comparing results. Surprisingly, even different deep neural network (DNN) libraries can produce different results despite identical code logic .

In such a context, ensuring consistent performance across different codebases is challenging. Essentially, there is no single, unified performance value for an algorithm like Conservative Q-Learning (CQL). Instead, what exists is the performance of CQL with specific hyperparameters implemented in a particular codebase.

Given this situation, our approach was as follows:

  • We chose a single reliable existing codebase for each algorithm.
  • We transferred that codebase into a single-file JAX code with the same hyperparameters.

For each algorithm, we report:

  • The codebase we referred to (also listed in the README)
  • Published papers using the codebase for baseline experiments (if available)
  • The performance reported by the paper.

Although we can run the referred codebase ourselves, for those who wish to use JAX-CoRL as a baseline in their own research, results from published papers provide a more reliable certification. For detailed performance reports of our implementations, please refer to the README.

AWAC

  • Codebase: jaxrl
  • Paper using the codebase: Cal-QL [2]
  • Results: From Table 5 (Only mean)
ver halfcheetah-m halfcheetah-me hopper-m hopper-me walker2d-m walker2d-me
Reference 49 72 58 30 75 86
Ours 42 77 51 52 68 91

CQL

  • Codebase: JaxCQL
  • Paper using the codebase: Cal-QL [2]
  • Results: From Table 5 (Only mean)
ver halfcheetah-m halfcheetah-me hopper-m hopper-me walker2d-m walker2d-me
Reference 53 59 78 86 80 100
Ours 49 54 78 90 80 110

IQL

  • Codebase: Original
  • Paper using the codebase: TD7 [3]
  • Results: From Table 2 (Only mean)
ver halfcheetah-m halfcheetah-me hopper-m hopper-me walker2d-m walker2d-me Note
Reference 47.4 89.6 63.9 64.2 84.2 108.9 1M steps, average over 10 episodes
Ours 43.3 92.9 52.2 53.4 75.3 109.2 1M steps, average over 5 episodes

TD3+BC

  • Codebase: Original
  • Paper using the codebase: TD7 [3]
  • Results: From Table 2 (Only mean)
ver halfcheetah-m halfcheetah-me hopper-m hopper-me walker2d-m walker2d-me Note
Reference 48.1 93.7 59.1 98.1 84.3 110.5 1M steps, average over 10 episodes
Ours 48.1 93.0 46.5 105.5 72.7 109.2 1M steps, average over 5 episodes

DT

ver halfcheetah-m halfcheetah-me hopper-m hopper-me walker2d-m walker2d-me
- - - - - - -
- - - - - - -

TD7

ver halfcheetah-m halfcheetah-me hopper-m hopper-me walker2d-m walker2d-me
- - - - - - -
- - - - - - -

Reference

  • [1] Tarasov, Denis, et al. "CORL: Research-oriented deep offline reinforcement learning library." Advances in Neural Information Processing Systems 36 (2024).
  • [2] Nakamoto, Mitsuhiko, et al. "Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning." Advances in Neural Information Processing Systems 36 (2024).
  • [3] Fujimoto, Scott, et al. "For sale: State-action representation learning for deep reinforcement learning." Advances in Neural Information Processing Systems 36 (2024).
Clone this wiki locally