-
-
Notifications
You must be signed in to change notification settings - Fork 612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve deterministic engine #2756
improve deterministic engine #2756
Conversation
@vfdev-5 could you take a look at the failed tests? For the RL failed examples, I can not reproduce them on my own machine. |
Thanks for the PR, @louis-she ! |
yes, I think it should be related to the version. I upgraded to the latest version of gym |
Do you check with python 3.7 ? |
I'm using Can we remove the |
Yes, let's remove -qq . I propose to create a separate PR for CI fix. For Neptune logger fix we can ping one of their folks for review |
OK, then I'll create another PR to remove the |
ignite/engine/deterministic.py
Outdated
# according to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility | ||
# CUBLAS_WORKSPACE_CONFIG must be set to let cuBLAS behave deterministic. | ||
# **the behavior is expected to change in a future release of cuBLAS**. | ||
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not fan of doing that in ignite. If this call is necessary, it should be done by pytorch...
Reading the docs:
set the debug environment variable CUBLAS_WORKSPACE_CONFIG to ":16:8" (may limit overall performance) or ":4096:8" (will increase library footprint in GPU memory by approximately 24MiB).
I do not think that we want to set a debug env variable.
@louis-she let's remove that.
EDIT: I checked pytorch docs https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html?highlight=use_deterministic_algorithms#torch.use_deterministic_algorithms and see this suggestion...
If one of these environment variable configurations is not set, a RuntimeError will be raised from these operations when called with CUDA tensors:
ignite/engine/deterministic.py
Outdated
# CUBLAS_WORKSPACE_CONFIG must be set to let cuBLAS behave deterministic. | ||
# **the behavior is expected to change in a future release of cuBLAS**. | ||
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8" | ||
torch.use_deterministic_algorithms(True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, we should set warn_only=True
such that we do not break previous code but only raise warnings about non-deterministic implementation.
69b29fd
to
ce1d363
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @louis-she
@louis-she you were right, gpu tests are failing for deterministic engine:
I do not quite understand why it fails with RuntimeError but we asked for a warning |
looks like it was related to multiple GPU nodes. Let me look at this. |
I'm not sure if this is a bug of import torch
torch.use_deterministic_algorithms(True, warn_only=True)
assert torch.is_deterministic_algorithms_warn_only_enabled()
torch.nn.Linear(10, 10, device="cuda")((torch.rand(1, 10, device="cuda"))) raise
The error will not be raised with v1.12.1 https://github.com/pytorch/pytorch/blob/v1.12.1/aten/src/ATen/Context.cpp#L126 For v1.12.1, there is no |
@louis-she does your code sample show a warning with 1.13.0 and cuda 11.6 ? |
Hmm seems like NVIDIA makes import torch
import torchvision
warn_only = False
torch.use_deterministic_algorithms(True, warn_only=warn_only)
if warn_only:
assert torch.is_deterministic_algorithms_warn_only_enabled()
model = torchvision.models.swin_s().cuda()
model(torch.rand(2, 3, 224, 224, device="cuda")) Here are some experiments result:
|
#2754
The
neptune-client
has make some APIs to their legacy package, see neptune-ai/neptune-client#1039