[flashv2/BW] `nan` in some configurations #443

danthe3rd · 2023-08-11T16:01:34Z

Hi, after upgrading Flashv2 to 2.0.4 (in facebookresearch/xformers#816), we still have some test failures in xformers. Here is a simple repro code:

Repro code:

# Tested with
# flash-attn 2.0.4 (v2.0.4 - d30f2e1cd50185c98ed88c0684b4a603f15bee37)
# torch==2.0.0
# NVIDIA A100-SXM4-80GB
# cuda 11.8
import torch
import flash_attn


q_cuseqlen = torch.tensor([0, 76, 110, 256], device='cuda', dtype=torch.int32)
k_cuseqlen = torch.tensor([0, 1, 2, 3], device='cuda', dtype=torch.int32)
Mq = 256
Mk = 3
H = 1
K = 32

torch.manual_seed(0)
q = torch.randn([Mq, H, K], dtype=torch.float16, device="cuda") * 3
k, v = [torch.randn([Mk, H, K], dtype=torch.float16, device="cuda") * 3 for _ in range(2)]
q.requires_grad_(True)
k.requires_grad_(True)
v.requires_grad_(True)

grad = torch.full_like(q, 1.0)

out = flash_attn.flash_attn_varlen_func(q, k, v, q_cuseqlen, k_cuseqlen, Mq, Mk, causal=True)
out.backward(grad)

print("flash_attn", flash_attn.__version__)
print("Q gradient:", "NaNs!" if q.grad.isnan().any().item() else "OK")
print("K gradient:", "NaNs!" if k.grad.isnan().any().item() else "OK")
print("V gradient:", "NaNs!" if v.grad.isnan().any().item() else "OK")

Output

flash_attn 2.0.4
Q gradient: NaNs!
K gradient: OK
V gradient: OK

The text was updated successfully, but these errors were encountered:

tmm1 · 2023-08-11T16:14:58Z

I replicated the results above on a 3090 as well. The result is also the same changing float16 to bfloat16.

tridao · 2023-08-11T17:24:05Z

Thanks for the bug report.
Something I don't quite understand: the cuseqlen says there are 2 sequences, one from indices 0 to 46 and one from indices 46 to 256. However, the K & V passed in only has length 2, so it doesn't agree with what cuseqlen is describing.

When I changed K & V to have length 256, the gradients are ok.

Do you mean to pass in a different cuseqlen for K & V?
When I pass in a different cuseqlen_k = torch.tensor([ 0, 1, 2], device=device, dtype=torch.int32) the gradients are also ok.

danthe3rd · 2023-08-11T17:46:03Z

Woops my bad indeed. Let me close this and reopen once I figure out my issue

danthe3rd · 2023-08-16T13:53:22Z

Reopening - I fixed the repro script.
The issue only happens with causal=True (although in this case with 1 key, it's equivalent to setting causal=False)

tridao · 2023-08-16T17:01:11Z

I can reproduce the bug now, thank you @danthe3rd! I'm investigating.

tridao · 2023-08-16T22:15:11Z

I've (hopefully) fixed this in v2.0.8. CI is building all the CUDA wheels now. Thanks for the bug report again!

danthe3rd · 2023-08-17T09:17:02Z

Confirming that all xformers tests pass now on A100 :)
Thanks a lot for the prompt fix!

danthe3rd mentioned this issue Aug 11, 2023

Bump flash-attn to v2.0.4 facebookresearch/xformers#816

Merged

10 tasks

danthe3rd closed this as completed Aug 11, 2023

danthe3rd reopened this Aug 16, 2023

danthe3rd closed this as completed Aug 17, 2023

tridao mentioned this issue Nov 2, 2023

How the backward overflow test works #650

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[flashv2/BW] `nan` in some configurations #443

[flashv2/BW] `nan` in some configurations #443

danthe3rd commented Aug 11, 2023 •

edited

Loading

tmm1 commented Aug 11, 2023

tridao commented Aug 11, 2023 •

edited

Loading

danthe3rd commented Aug 11, 2023

danthe3rd commented Aug 16, 2023 •

edited

Loading

tridao commented Aug 16, 2023

tridao commented Aug 16, 2023

danthe3rd commented Aug 17, 2023

[flashv2/BW] nan in some configurations #443

[flashv2/BW] nan in some configurations #443

Comments

danthe3rd commented Aug 11, 2023 • edited Loading

Repro code:

Output

tmm1 commented Aug 11, 2023

tridao commented Aug 11, 2023 • edited Loading

danthe3rd commented Aug 11, 2023

danthe3rd commented Aug 16, 2023 • edited Loading

tridao commented Aug 16, 2023

tridao commented Aug 16, 2023

danthe3rd commented Aug 17, 2023

[flashv2/BW] `nan` in some configurations #443

[flashv2/BW] `nan` in some configurations #443

danthe3rd commented Aug 11, 2023 •

edited

Loading

tridao commented Aug 11, 2023 •

edited

Loading

danthe3rd commented Aug 16, 2023 •

edited

Loading