Where is the code about "remaining layers use faster half precision accumulate"? #10

goldhuang · 2024-09-03T21:44:59Z

Flux diffusion model implementation using quantized fp8 matmul & remaining layers use faster half precision accumulate, which is ~2x faster on consumer devices.
Hello there!
Thanks for sharing your quantization implementation of Flux!
I have a question about "remaining layers use faster half precision accumulate". Could you help to point out the lines that enable "faster half precision accumulate" in the repo?
Thanks in advance!

The text was updated successfully, but these errors were encountered:

aredden · 2024-09-04T04:22:58Z

It's the CublasLinear layers. It's a repo I made which allows matmuls to run with half precision accumulate within the matmul kernel- which doubles the tflops for most consumer gpus. The source is here- https://github.com/aredden/torch-cublas-hgemm - so, wherever you see CublasLinear replacements happening- I think it's actually in the float8_quantize.py file, that's where that occurs.

goldhuang · 2024-09-04T21:02:19Z

@aredden Thanks for your detailed answer!
I have 2 follow-up questions now.

Why do you only replace linear layers in single/double block with fp8?
Why does CublasLinear only support float16?

aredden · 2024-09-05T20:57:51Z

You can optionally quantize the others by setting "quantize_flow_embedder_layers": true, but it does pretty considerably reduce quality and doesn't add much extra vram or increase it/s. The non-single-or-double-block layers only make up for ~2% of the models actual weights, but have a considerable effect on quality.
Well if you check out the ADA whitepaper, you'll find that the top theoretical tflops for fp16 w/ fp32 accumulate is ~160 for 4090, but 330 for fp16 w/ fp16 accumulate. Unfortunately you cannot use fp16 accumulate with anything other than fp16 tensors, and bf16 cannot be used as accumulation datatype so the only way to achieve those tflops on consumer gpus is via fp16. It's actually the same speed as fp8 matmul!

spejamas · 2024-10-31T19:05:06Z

Hey @aredden, will a datacenter GPU (like L40S for example) get any benefit from the cublas swap?

aredden · 2024-10-31T20:07:17Z

Not really- it has enough sram where it gets the same tflops for fp16 w/ fp32 accumulate as it does for fp16 w/ fp16 accumulate. @spejamas

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Where is the code about "remaining layers use faster half precision accumulate"? #10

Where is the code about "remaining layers use faster half precision accumulate"? #10

goldhuang commented Sep 3, 2024

aredden commented Sep 4, 2024

goldhuang commented Sep 4, 2024

aredden commented Sep 5, 2024

spejamas commented Oct 31, 2024

aredden commented Oct 31, 2024

Where is the code about "remaining layers use faster half precision accumulate"? #10

Where is the code about "remaining layers use faster half precision accumulate"? #10

Comments

goldhuang commented Sep 3, 2024

aredden commented Sep 4, 2024

goldhuang commented Sep 4, 2024

aredden commented Sep 5, 2024

spejamas commented Oct 31, 2024

aredden commented Oct 31, 2024