Enable float8 attention support (q/k/v) #1382

jerryzh168 · 2024-12-05T01:02:32Z

Summary:
This PR integrates flashattention 3 kernel: https://github.com/Dao-AILab/flash-attention/blob/1feb711f46563960fc10a8e659c93c300619504b/flash_attn/flash_attn_interface.py#L1102 to float8 affine quantized tensor.

To use the kernel, right now we need to manually add quantize call for q/k/v before sdpa op, but we can explore other APIs in the future

@sijiac is working on adding new variations of attention implementation in the future (per row, per column, per block scaling etc.).

Test Plan:
python test/dtypes/test_affine_quantized_float.py -k test_float8_attention

SAM2

tested on sam2 and seems to be a bit slower than before, this is reasonable because sam2 is using 16 and 32 head dimension, but fa3 requires 64 being the minimum size, we need to do some padding to make this work (pad 32 to 64) which is expected to increase runtime significantly.

llama2

llama2 without fallback: doesn't work because attn_mask is not supported.

llama2 numerics only

(just for testing, code is not checked in) tested on llama2 (with fallback to test numerics):

since attn_mask is not supported in flashattention 3 kernel, it's using the fallback path: https://github.com/pytorch/ao/pull/1382/files#diff-3019e8f38b0919dbaba5aa1329a697e89fc98749e35a7bdc274c71a0d3738ec2R285

no quantize attn:

wikitext: {'alias': 'wikitext', 'word_perplexity,none': 12.2451228989592, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.5975503222785694, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.675861376279183, 'bits_per_byte_stderr,none': 'N/A'}

quantize attn:

wikitext: {'alias': 'wikitext', 'word_perplexity,none': 12.294668124182962, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.5987571166545238, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6769507810861697, 'bits_per_byte_stderr,none': 'N/A'}

Reviewers:

Subscribers:

Tasks:

Tags:

Summary: att, right now we need to manually add quantize call for q/k/v before sdpa op, but we can explore other APIs in the future Test Plan: TBD Reviewers: Subscribers: Tasks: Tags:

pytorch-bot · 2024-12-05T01:02:35Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1382

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 2895626 with merge base 04d611a ():

NEW FAILURE - The following job has failed:

Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://downloa... / linux-job (gh)
test/integration/test_integration.py::TestSubclass::test_int8_dynamic_quant_subclass_api_5_cuda

This comment was automatically generated by Dr. CI and updates every 15 minutes.

drisspg · 2024-12-05T02:28:40Z

torchao/dtypes/floatx/float8_layout.py

+    q_float8_data = q_tensor_impl.float8_data
+    # change from scalar to tensor of size [1]
+    q_scale = q_tensor_impl.scale
+    q_scale = torch.tensor([q_scale], device=q_scale.device)


are the scales on host?

you mean q_scale before we call torch.tensor? they should be using the same device as original weight I think, so should be on cuda before

torchao/dtypes/affine_quantized_tensor_ops.py

test/dtypes/test_affine_quantized_float.py

drisspg

Overall looks good, left a few comments. I think we can add the int8 kernel when it gets added as well for CPU

torchao/dtypes/floatx/float8_layout.py

torchao/quantization/README.md

torchao/dtypes/affine_quantized_tensor_ops.py

drisspg · 2024-12-06T15:53:52Z

torchao/_models/llama/model.py

+            from torchao.quantization.quant_api import _float8_symmetric_per_tensor_quant
+            original_dtype = v.dtype
+            if q.shape[-1] in [64, 128, 256]:
+                q = _float8_symmetric_per_tensor_quant(q)


We also likely need/want to apply the hadamard transform. I don't remember off hand if this is include in fav3 api

didn't see this, maybe we can add it after spinquant is integrated

We also likely need/want to apply the hadamard transform. I don't remember off hand if this is include in fav3 api

https://pytorch.org/blog/hadacore/

bhack · 2024-12-08T13:55:16Z

torchao/_models/sam2/modeling/sam/transformer.py

Can we update the current SAM2 readme with all the ao optimizations we have introduced?

* Update multimodal.md Complete markup for testing * Update run-docs Add ability to run on docs/multimodal.md * Update run-readme-pr.yml

jerryzh168 added 2 commits December 4, 2024 09:44

[WIP] Enable float8 attention support (q/k/v)

4f3902e

Summary: att, right now we need to manually add quantize call for q/k/v before sdpa op, but we can explore other APIs in the future Test Plan: TBD Reviewers: Subscribers: Tasks: Tags:

add float8 from fa3 to aqt

4db8feb

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 5, 2024

formatting

6df8f41

jerryzh168 added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Dec 5, 2024

jerryzh168 added 2 commits December 4, 2024 18:16

add eval support for llama

57ac986

ruff

01ffb20

jerryzh168 requested review from cpuhrsch, sijiac, vkuzo and drisspg December 5, 2024 02:20

jerryzh168 changed the title ~~[WIP] Enable float8 attention support (q/k/v)~~ Enable float8 attention support (q/k/v) Dec 5, 2024

drisspg reviewed Dec 5, 2024

View reviewed changes

torchao/dtypes/affine_quantized_tensor_ops.py Outdated Show resolved Hide resolved

drisspg reviewed Dec 5, 2024

View reviewed changes

test/dtypes/test_affine_quantized_float.py Outdated Show resolved Hide resolved

drisspg reviewed Dec 5, 2024

View reviewed changes

test/dtypes/test_affine_quantized_float.py Outdated Show resolved Hide resolved

drisspg reviewed Dec 5, 2024

View reviewed changes

torchao/dtypes/floatx/float8_layout.py Outdated Show resolved Hide resolved

guard import error

9ce7a62

vkuzo reviewed Dec 5, 2024

View reviewed changes

torchao/quantization/README.md Show resolved Hide resolved

vkuzo reviewed Dec 5, 2024

View reviewed changes

torchao/dtypes/affine_quantized_tensor_ops.py Outdated Show resolved Hide resolved

fix numeric error

4f98518

jerryzh168 requested review from drisspg and vkuzo December 5, 2024 23:51

jerryzh168 added 3 commits December 5, 2024 15:55

fix

07a07c1

fix

e35a2a6

skip fa8 test in CI

f3a8e8d

drisspg reviewed Dec 6, 2024

View reviewed changes

ruff

2895626

bhack reviewed Dec 8, 2024

View reviewed changes

torchao/_models/sam2/modeling/sam/transformer.py

Copy link

bhack Dec 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update the current SAM2 readme with all the ao optimizations we have introduced?

yanbing-j pushed a commit to yanbing-j/ao that referenced this pull request Dec 9, 2024

Add multimodal to possible tests (pytorch#1382)

edc2cfb

* Update multimodal.md Complete markup for testing * Update run-docs Add ability to run on docs/multimodal.md * Update run-readme-pr.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable float8 attention support (q/k/v) #1382

Enable float8 attention support (q/k/v) #1382

jerryzh168 commented Dec 5, 2024 •

edited

Loading

pytorch-bot bot commented Dec 5, 2024 •

edited

Loading

drisspg Dec 5, 2024

jerryzh168 Dec 5, 2024

drisspg left a comment

drisspg Dec 6, 2024

jerryzh168 Dec 6, 2024

bhack Dec 8, 2024 •

edited

Loading

bhack Dec 8, 2024

Enable float8 attention support (q/k/v) #1382

Are you sure you want to change the base?

Enable float8 attention support (q/k/v) #1382

Conversation

jerryzh168 commented Dec 5, 2024 • edited Loading

SAM2

llama2

llama2 numerics only

pytorch-bot bot commented Dec 5, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1382

❌ 1 New Failure

drisspg Dec 5, 2024

Choose a reason for hiding this comment

jerryzh168 Dec 5, 2024

Choose a reason for hiding this comment

drisspg left a comment

Choose a reason for hiding this comment

drisspg Dec 6, 2024

Choose a reason for hiding this comment

jerryzh168 Dec 6, 2024

Choose a reason for hiding this comment

bhack Dec 8, 2024 • edited Loading

Choose a reason for hiding this comment

bhack Dec 8, 2024

Choose a reason for hiding this comment

jerryzh168 commented Dec 5, 2024 •

edited

Loading

pytorch-bot bot commented Dec 5, 2024 •

edited

Loading

bhack Dec 8, 2024 •

edited

Loading