Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Report] Non-deterministic training (Isaac-Velocity-Rough-Anymal-C-v0) #275

Closed
2 of 3 tasks
vassil-atn opened this issue Mar 8, 2024 · 9 comments
Closed
2 of 3 tasks
Labels
bug Something isn't working duplicate This issue or pull request already exists

Comments

@vassil-atn
Copy link

Describe the bug

The training is non-deterministic even with the same random seed.

Steps to reproduce

>> ./orbit.sh -p source/standalone/workflows/rsl_rl/train.py --headless --task "Isaac-Velocity-Rough-Anymal-C-v0" --seed 42

If you then rerun the same command and train for some episodes, the following two plots can be seen:
image

Despite nothing changing in between the two runs and using the same random seed, the resulting plots differ. Similarly, if you run the Flat terrain environment, the same can be observed:
image

Overall the same trends can be observed (which is good), but there is some stochasticity in each training run which makes hyperparameter and reward tuning problematic.

From my experience Isaac gym also had the same problem (isaac-sim/IsaacGymEnvs#189). Interestingly, in Isaac Gym this was only an issue when training on trimesh terrains, but on plane terrains it was deterministic.

Based on the documentation, this can be somewhat expected (https://isaac-orbit.github.io/orbit/source/refs/issues.html#non-determinism-in-physics-simulation) when randomising physical properties at run-time. However, the rough_env_cfg does not do this to my knowledge - friction and mass are only randomised at startup. In any case, I tested it with both physics_material and base_mass commented out in the RandomizationCfg and it was still non-deterministic.

Is there something that I'm missing or is it an inherent issue of the GPU-based training?

System Info

  • Commit: c86481b
  • Isaac Sim Version: 2023.1.1-rc.8+2023.1.688.573e0291.tc(orbit)
  • OS: Ubuntu 20.04
  • GPU: RTX 3060
  • CUDA: 12.4
  • GPU Driver: 550.40.07

Checklist

  • I have checked that there is no similar issue in the repo (required)

  • I have checked that the issue is not in running Isaac Sim itself and is related to the repo

Acceptance Criteria

  • Deterministic Training can be achieved in Isaac sim Orbit
@Mayankm96
Copy link
Contributor

@vassil-atn
Copy link
Author

Unfortunately it doesn't seem to have much of an effect:
image
Where the gray and cyan are two consecutive runs with enable_enhanced_determinism=True and magenta and yellow are two consecutive runs with enable_enhanced_determinism=False

@Mayankm96
Copy link
Contributor

I realized one more change that might affect you:

https://github.com/NVIDIA-Omniverse/orbit/blob/main/source/extensions/omni.isaac.orbit/omni/isaac/orbit/envs/base_env.py#L297-L298

Can you change this to the following and see if that improves things?

# set seed for torch and other libraries
return torch_utils.set_seed(seed, torch_deterministic=True)

@vassil-atn
Copy link
Author

Interestingly, calling torch.use_deterministic_algorithms(True) seems to break it with the following error:

2024-03-11 10:02:35 [95,796ms] [Error] [__main__] linearIndex.numel()*sliceSize*nElemBefore == expandedValue.numel() INTERNAL ASSERT FAILED at "../aten/src/ATen/native/cuda/Indexing.cu":389, please report a bug to PyTorch. number of flattened indices did not match number of elements in the value tensor: 102 vs 2
2024-03-11 10:02:35 [95,797ms] [Error] [__main__] Traceback (most recent call last):
  File "/home/user/orbit/source/standalone/workflows/rsl_rl/train.py", line 142, in <module>
    main()
  File "/home/user/orbit/source/standalone/workflows/rsl_rl/train.py", line 133, in main
    runner.learn(num_learning_iterations=agent_cfg.max_iterations, init_at_random_ep_len=True)
  File "/home/user/miniconda3/envs/orbit/lib/python3.10/site-packages/rsl_rl/runners/on_policy_runner.py", line 112, in learn
    obs, rewards, dones, infos = self.env.step(actions)
  File "/home/user/orbit/source/extensions/omni.isaac.orbit_tasks/omni/isaac/orbit_tasks/utils/wrappers/rsl_rl/vecenv_wrapper.py", line 161, in step
    obs_dict, rew, terminated, truncated, extras = self.env.step(actions)
  File "/home/user/miniconda3/envs/orbit/lib/python3.10/site-packages/gymnasium/wrappers/order_enforcing.py", line 56, in step
    return self.env.step(action)
  File "/home/user/orbit/source/extensions/omni.isaac.orbit/omni/isaac/orbit/envs/rl_task_env.py", line 192, in step
    self._reset_idx(reset_env_ids)
  File "/home/user/orbit/source/extensions/omni.isaac.orbit/omni/isaac/orbit/envs/rl_task_env.py", line 311, in _reset_idx
    self.scene.reset(env_ids)
  File "/home/user/orbit/source/extensions/omni.isaac.orbit/omni/isaac/orbit/scene/interactive_scene.py", line 222, in reset
    articulation.reset(env_ids)
  File "/home/user/orbit/source/extensions/omni.isaac.orbit/omni/isaac/orbit/assets/articulation/articulation.py", line 143, in reset
    super().reset(env_ids)
  File "/home/user/orbit/source/extensions/omni.isaac.orbit/omni/isaac/orbit/assets/rigid_object/rigid_object.py", line 114, in reset
    self._external_force_b[env_ids] = 0.0
RuntimeError: linearIndex.numel()*sliceSize*nElemBefore == expandedValue.numel() INTERNAL ASSERT FAILED at "../aten/src/ATen/native/cuda/Indexing.cu":389, please report a bug to PyTorch. number of flattened indices did not match number of elements in the value tensor: 102 vs 2

Seems like it's an issue with the slicing that only pops up when running in deterministic mode.

For reference I am running with torch == 2.0.1+cu118

@vassil-atn
Copy link
Author

Hi,

Have there been any developments regarding this? Is the issue reproducible on your end @Mayankm96 ?

ADebor pushed a commit to ADebor/IsaacLab that referenced this issue Apr 8, 2024
As we pushed all the assets to the Nucleus, this MR now adapts all the
paths to the assets to make sure they work correctly.

Fixes isaac-sim#166

- Bug fix (non-breaking change which fixes an issue)

- [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with
`./orbit.sh --format`
- [ ] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [ ] I have updated the changelog and the corresponding version in the
extension's `config/extension.toml` file
- [x] I have added my name to the `CONTRIBUTORS.md` or my name already
exists there
@MuhongGuo
Copy link
Contributor

Seems like the issue of using torch.use_deterministic_algorithms(True) has been fixed in this pytorch PR, which is included in pytorch v2.1.0

@MuhongGuo
Copy link
Contributor

It seems difficult to upgrade the pytorch as the it's shipped with isaac sim 2023.1.1. This is my workaround to avoid the error without upgrading pytorch: change self._external_force_b[env_ids] = 0.0 to self._external_force_b[env_ids].zero_(). However, the results are still non-deterministic :(

fatimaanes pushed a commit to fatimaanes/omniperf that referenced this issue Aug 8, 2024
As we pushed all the assets to the Nucleus, this MR now adapts all the
paths to the assets to make sure they work correctly.

Fixes isaac-sim#166

- Bug fix (non-breaking change which fixes an issue)

- [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with
`./orbit.sh --format`
- [ ] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [ ] I have updated the changelog and the corresponding version in the
extension's `config/extension.toml` file
- [x] I have added my name to the `CONTRIBUTORS.md` or my name already
exists there
@hojae-io
Copy link

Maybe you wanna check: #904

@Mayankm96
Copy link
Contributor

Closing this issue as #904 raises the some concern. The fix is under review: #940

@Mayankm96 Mayankm96 added bug Something isn't working duplicate This issue or pull request already exists labels Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

4 participants