-
Notifications
You must be signed in to change notification settings - Fork 22.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
torch.linalg.eigh
fails on GPU and corrupts memory
#105359
Comments
I can repro this issue. Now, let's take the SVD of the matrix. Its smallest singular values are
It's clear that this matrix is very close to being singular. In particular, this falls under https://pytorch.org/docs/main/notes/numerical_accuracy.html#extremal-values-in-linalg, so it is expected. As recommended there, if you are working with very singular matrices, there is two ways of going about it:
Option 1. works in this case. All this being said, cusolver should not crash in an unrecoverable way if possible. cc @IvanYashchuk @xwang233 This issue was already reported in #94772, as you mentioned. Let's continue the discussion there. |
…conditioned, in some cusolver version (#107082) Related: #94772, #105359 I can locally reproduce this crash with pytorch 2.0.1 stable pip binary. The test already passes with the latest cuda 12.2 release. Re: #94772 (comment) > From discussion in triage review: - [x] we should add a test to prevent regressions - [x] properly document support wrt different CUDA versions - [x] possibly add support using MAGMA Pull Request resolved: #107082 Approved by: https://github.com/lezcano
…conditioned, in some cusolver version (pytorch#107082) Related: pytorch#94772, pytorch#105359 I can locally reproduce this crash with pytorch 2.0.1 stable pip binary. The test already passes with the latest cuda 12.2 release. Re: pytorch#94772 (comment) > From discussion in triage review: - [x] we should add a test to prevent regressions - [x] properly document support wrt different CUDA versions - [x] possibly add support using MAGMA Pull Request resolved: pytorch#107082 Approved by: https://github.com/lezcano
🐛 Describe the bug
torch.linalg.eigh
fails on some large low-rank float32 matrices on GPU, but succeeds on CPU or when cast to float64. (See similar issue at #94772). After failing, the matrix cannot be accessed again without causing a CUDA illegal memory access error.An example matrix that fails can be found here: rank7_idx0.1.3.0_iter100_factor.pt.zip
This matrix was generated when applying the Shampoo optimizer to HF T5 finetuning.
cusolver eigendecomposition error:
CUDA illegal memory access error:
cc @ezyang @gchanan @zou3519 @ptrblck @jianyuh @nikitaved @pearu @mruberry @walterddr @IvanYashchuk @xwang233 @lezcano @hjmshi @mikerabbat @dmudigere @tsunghsienlee @awgu @wanchaol @gallego-posada
Versions
The text was updated successfully, but these errors were encountered: