Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve deterministic engine #2756

Merged
merged 3 commits into from
Oct 30, 2022

Conversation

louis-she
Copy link
Contributor

@louis-she louis-she commented Oct 29, 2022

#2754

The neptune-client has make some APIs to their legacy package, see neptune-ai/neptune-client#1039

@github-actions github-actions bot added module: engine Engine module module: utils Utils module labels Oct 29, 2022
@louis-she
Copy link
Contributor Author

@vfdev-5 could you take a look at the failed tests?

For the RL failed examples, I can not reproduce them on my own machine.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Oct 29, 2022

Thanks for the PR, @louis-she !
I'll check in details a bit later the failures. Seeing one job RL failure complaining about seed arg, it seems that this could be related to gym version...
Related PR #2706

@louis-she
Copy link
Contributor Author

Thanks for the PR, @louis-she ! I'll check in details a bit later the failures. Seeing one job RL failure complaining about seed arg, it seems that this could be related to gym version...

yes, I think it should be related to the version. I upgraded to the latest version of gym 0.26.2 but still can not reproduce the error.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Oct 29, 2022

Do you check with python 3.7 ?
Can you please see in the CI which version we are installing? Maybe they simply dropped support for py37

@louis-she
Copy link
Contributor Author

I'm using 3.7.10, the CI uses 3.7.15. I can't see the version of gym CI use cause the pip install ... -qq in the workflow. https://github.com/pytorch/ignite/blob/master/.github/workflows/unit-tests.yml#L90

Can we remove the -qq of pip install for easily debug?

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Oct 29, 2022

Yes, let's remove -qq . I propose to create a separate PR for CI fix. For Neptune logger fix we can ping one of their folks for review

@louis-she
Copy link
Contributor Author

OK, then I'll create another PR to remove the -qq in CI. It'll be great if neptune guys could review the codes.

@louis-she louis-she mentioned this pull request Oct 29, 2022
Comment on lines 197 to 200
# according to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility
# CUBLAS_WORKSPACE_CONFIG must be set to let cuBLAS behave deterministic.
# **the behavior is expected to change in a future release of cuBLAS**.
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
Copy link
Collaborator

@vfdev-5 vfdev-5 Oct 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not fan of doing that in ignite. If this call is necessary, it should be done by pytorch...
Reading the docs:

set the debug environment variable CUBLAS_WORKSPACE_CONFIG to ":16:8" (may limit overall performance) or ":4096:8" (will increase library footprint in GPU memory by approximately 24MiB).

I do not think that we want to set a debug env variable.

@louis-she let's remove that.

EDIT: I checked pytorch docs https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html?highlight=use_deterministic_algorithms#torch.use_deterministic_algorithms and see this suggestion...

If one of these environment variable configurations is not set, a RuntimeError will be raised from these operations when called with CUDA tensors:

# CUBLAS_WORKSPACE_CONFIG must be set to let cuBLAS behave deterministic.
# **the behavior is expected to change in a future release of cuBLAS**.
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
torch.use_deterministic_algorithms(True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, we should set warn_only=True such that we do not break previous code but only raise warnings about non-deterministic implementation.

@louis-she louis-she force-pushed the deterministic-engine-improvement branch from 69b29fd to ce1d363 Compare October 30, 2022 01:55
Copy link
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @louis-she

@vfdev-5 vfdev-5 merged commit 117529e into pytorch:master Oct 30, 2022
@vfdev-5
Copy link
Collaborator

vfdev-5 commented Oct 30, 2022

@louis-she you were right, gpu tests are failing for deterministic engine:

>       return torch.matmul(kernel_x.t(), kernel_y)  # (kernel_size, 1) * (1, kernel_size)
E       RuntimeError: Deterministic behavior was enabled with either `torch.use_deterministic_algorithms(True)` or `at::Context::setDeterministicAlgorithms(true)`, but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility

=========================== short test summary info ============================
FAILED tests/ignite/engine/test_deterministic.py::test_gradients_on_resume_on_cuda - RuntimeError: Deterministic behavior was enabled with either `torch.use_det...
FAILED tests/ignite/metrics/test_ssim.py::test_ssim[shape0-7-False-True-cuda] - RuntimeError: Deterministic behavior was enabled with either `torch.use_det...
FAILED tests/ignite/metrics/test_ssim.py::test_ssim[shape1-11-True-False-cuda] - RuntimeError: Deterministic behavior was enabled with either `torch.use_det...
FAILED tests/ignite/metrics/gan/test_utils.py::test_device_mismatch_cuda - RuntimeError: Deterministic behavior was enabled with either `torch.use_det...
===== 4 failed, 20 passed, 5 skipped, 1920 deselected, 1 warning in 22.48s =====

Exited with code exit status 1

I do not quite understand why it fails with RuntimeError but we asked for a warning

@louis-she
Copy link
Contributor Author

looks like it was related to multiple GPU nodes. Let me look at this.

@louis-she
Copy link
Contributor Author

I'm not sure if this is a bug of PyTorch, but I can reproduce this with this very straightforward snippet:

import torch
torch.use_deterministic_algorithms(True, warn_only=True)
assert torch.is_deterministic_algorithms_warn_only_enabled()
torch.nn.Linear(10, 10, device="cuda")((torch.rand(1, 10, device="cuda")))

raise RuntimeError ,

Traceback (most recent call last):
  File "test.py", line 4, in <module>
    torch.nn.Linear(10, 10, device="cuda")((torch.rand(1, 10, device="cuda")))
  File "/environment/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/environment/miniconda3/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: Deterministic behavior was enabled with either `torch.use_deterministic_algorithms(True)` or `at::Context::setDeterministicAlgorithms(true)`, but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility

The error will not be raised with torch==1.13.0, I have had a look at the source of torch, and it was different of the 2 versions.

v1.12.1 https://github.com/pytorch/pytorch/blob/v1.12.1/aten/src/ATen/Context.cpp#L126
v1.13.0 https://github.com/pytorch/pytorch/blob/v1.13.0/aten/src/ATen/Context.cpp#L142

For v1.12.1, there is no if statement about the warning flag, but v1.13.0 did has it.

@vfdev-5 vfdev-5 mentioned this pull request Oct 30, 2022
3 tasks
@vfdev-5
Copy link
Collaborator

vfdev-5 commented Oct 30, 2022

@louis-she does your code sample show a warning with 1.13.0 and cuda 11.6 ?
In my case, it does not show anything ... Seems like it is also related to cuda version as 1.12.1 with cuda 11.6 does not raise any error...

@louis-she
Copy link
Contributor Author

Hmm seems like NVIDIA makes torch.nn.Linear deterministic in cu116. Another test snippet

import torch
import torchvision

warn_only = False
torch.use_deterministic_algorithms(True, warn_only=warn_only)
if warn_only:
    assert torch.is_deterministic_algorithms_warn_only_enabled()

model = torchvision.models.swin_s().cuda()
model(torch.rand(2, 3, 224, 224, device="cuda"))

Here are some experiments result:

torch version warn_only behavior
torch==1.12.1+cu113 True RuntimeError
torch==1.12.1+cu113 False RuntimeError
torch==1.12.1+cu116 True RuntimeError
torch==1.12.1+cu116 False RuntimeError
torch==1.13.0+cu116 True UserWarning
torch==1.13.0+cu116 False RuntimeError

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: engine Engine module module: utils Utils module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants