int8 dynamic quant + bsr support #821

jcaip · 2024-09-06T13:36:15Z

This PR, based on #891 adds in int8 dynamicquant + bsr support. In this PR is a pretty large refactor of the superblock code, mainly removing duplicated code. There's still some work to be done but this PR is already getting pretty big, so want to check it in.

Changes:

Use i8i8 -> bf16 matmul to maintain accuracy
Added a block sparse layout type to AffineQuantizedTensor + check/impl.
Cleaned up benchmark.py script and add a single line benchmark.sh file for acceleration numbers
Updated eval.py and added a single line evaluate.sh file for accuracy numbers
Lots of lint formatting and README updates
torch.compile now working

TODO:

move sparse layouts to their own folder
promote blocksparse.py from prototype and add some tests
move evaluate.py/benchmark.py out of superblock and into the top level folders?
move superblock.py into training folder?

This PR is a prototype of adding int8 dynamic quant with BSR support.

Currently it works, but is not performant. (1.6 img/s vs 2.0 baseline) on Vit-H

This is because the block sparse mm is added as a custom op, since we cannot trace through the creation of a sparse bsr tensor in torch.compile:
Instead, we created a custom op that takes in the data tensors of a bsr tensor (crow_indicies, col_indicies, values) and wrapper subclass that holds these values. See here

This causes an issue for the quantization workflow, since we cannot fuse across the custom op. It's pretty much the same issue we face with 2:4 sparsity, where we materialize the full int32 intermediary matrix, which eats into our composed quant + sparse speedups.

Normally we avoid materializing the int32 intermediary matrix by fusing the dequant into the matmul, but we can't do so here because we use a custom op. I think the long term solution here is to just rip out the BSR triton code in core and have it in AO, in that case we can change the function to take in the data tensors themselves instead of a sparse bsr tensor, avoiding this need for a custom op.

Good news though, accuracy is not degraded at all when composing int8 and block sparsity

cc @cpuhrsch

pytorch-bot · 2024-09-06T13:36:18Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/821

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit a2519d8 with merge base 7dff17a ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cpuhrsch · 2024-09-06T18:16:08Z

cc @pearu

We could do epilogue fusion in core if we're adding _int_bsr_dense_addmm as a Template to https://github.com/pytorch/pytorch/tree/196748d49193eed2b000aa99b2c7bb1bff576d35/torch/_inductor/kernel .

We can do that similar to how we did it for _int_mm: https://github.com/pytorch/pytorch/pull/111125/files#diff-e1a7b1be52a02ad5595c2f38c1e5c1ac3086fd49b6638482ae192249f070f4dc

jerryzh168 · 2024-09-10T03:56:37Z

torchao/dtypes/affine_quantized_tensor.py

@@ -613,6 +620,163 @@ def from_plain(
        int_data_compressed = torch._cslt_compress(int_data)
        return cls(int_data_compressed, scale, zero_point, layout_type)

+@register_layout_cls(BlockSparseLayoutType)


@jcaip do you want to move these to sparsity folder? now we should be able to do everything outside of affine_quantized_tensor after #783

(meaning register layout_type/layout tensor and dispatch/impl for quantized linear

No, not for this PR at this time. I think we should resolve the perf issues before we merge.

cpuhrsch · 2024-09-10T21:31:13Z

I wonder if a sufficiently large batch size would eventually show a gain even without epilogue fusion.

jcaip · 2024-09-10T22:53:20Z

cc @cpuhrsch It's possible, the ImageNet benchmark is not the kindest benchmark either since we can't quantize the attention layers as well. So we may be able to see perf gains with SAM as well. I added this as a hackathon project for CUDA-MODE so maybe we can pick up a contributor there.

pearu

When running

python benchmark.py --model vit_h_14   --batch-size 256   --sparsity-linear 0.8   --sp-linear-tile-size 64 --bsr 64 --sparsity bsr --quantization

notice the warning messages:

torch/sparse/_triton_ops.py:795: UserWarning: bsr_dense_addmm uses non-optimal triton kernel parameters for M=5120 K=1280 N=65792 Ms=64, Ks=64 beta=0 alpha=1
  warn_once(
torch/sparse/_triton_ops.py:795: UserWarning: bsr_dense_addmm uses non-optimal triton kernel parameters for M=1280 K=5120 N=65792 Ms=64, Ks=64 beta=0 alpha=1

When using the tuned parameters, the performance increase is about 0.1 img/s.

I'll create a torch PR with the corresponding tuned parameters shortly.

pearu · 2024-09-13T14:51:51Z

torchao/sparsity/prototype/superblock/blocksparse.py

+
+    # original_batch_dims_broadcasted = broadcast_batch_dims("_int_bsr_dense_addmm", weight_bsr, A)
+    # input = torch.zeros(M, N, dtype=torch.int32, device=A.device)
+    return bsr_dense_mm(weight_bsr, A).t().contiguous()


Suggested change

return bsr_dense_mm(weight_bsr, A).t().contiguous()

return bsr_dense_mm(weight_bsr, A).t()

That should increase the performance of

benchmark.py --model vit_h_14 --batch-size 256 --sparsity-linear 0.8 --sp-linear-tile-size 64 --bsr 64 --sparsity bsr --quantization

by ~0.9 img/s.

pearu · 2024-09-13T14:52:47Z

torchao/dtypes/affine_quantized_tensor.py

+    w_vals = weight_tensor.layout_tensor
+    w_scales = weight_tensor.layout_tensor.scale
+    tmp = x_vals_int8.reshape(-1, x_vals_int8.shape[-1])
+    tmp_t = tmp.t().contiguous()


Suggested change

tmp_t = tmp.t().contiguous()

tmp_t = tmp.t()

to increase the performance about 0.2 img/s.

pearu · 2024-09-14T21:10:00Z

torchao/dtypes/affine_quantized_tensor.py

+    y = torch.ops.blocksparse.int_mm(w_vals.crow_indices(),
+                                          w_vals.col_indices(),
+                                          w_vals.values(),
+                                          w_vals.shape[0],
+                                          w_vals.shape[1],
+                                          tmp_t)
+
+    # breakpoint()
+
+
+    y = x_scales.reshape(-1, 1) * y
+
+    y = (y * w_scales).reshape(
+        *x_vals_int8.shape[:-1], y.shape[-1]
+    )


TODO: move scaling operations to bsr_dense_addmm kernel.

pytorch/pytorch#136104

jcaip · 2024-09-25T06:48:16Z

cc @pearu @cpuhrsch

I've pulled in your changes to this PR here, seeing some nice speedups! Unfortunately I think I'm running into some saturation issues when doing the int8 int8 -> int8 BSR matmul.

This is causing poor accuracy when running BSR with quantization, evaluate.sh should print out the results. I'm seeing 0 accuracy when running with BSR+ quant. I can reproduce this is the test too, when I bump up the matrix sizes, the results I'm seeing for the quantized mm is quite different from the reference result:

We ran into something similar with cuSPASRELt and we solved that by adding support for i8i8->bf16 matmul. Would it be possible to add something similar to the BSR kernels? Something like aout_dtype option would work.

pearu · 2024-09-25T07:35:40Z

Would it be possible to add something similar to the BSR kernels? Something like aout_dtype option would work.

We have it in terms of out argument. In blocksparse_int_addmm, try changing

out = A.new_empty(original_batch_dims_broadcasted + (M, N))

to

out = A.new_empty(original_batch_dims_broadcasted + (M, N), dtype=torch.int32)

that should resolve the saturation problem. I did't use bfloat16 as out dtype because triton dot accumulator uses int32.

Notice that we don't have tuned parameters for bsr_dense_addmm(<int8>)->int32, so the performance is expected to be lower. Using --tune-kernel-params will not be helpful either because optimize_bsr_dense_addmm assumes that out.dtype is int8 if dtype is int8.
There is an easy fix though: we'll need to add out_dtype parameter to optimize_bsr_dense_addmm. I'll cook up a PR for this..

jcaip · 2024-09-25T08:51:03Z

Nice, that makes sense.I still run into numerical issues with torch.int32, but passing in torch.bfloat16 seems to fix those :)

i8i8->bf16 eval:

Test:  Acc@1 77.087 Acc@5 105.867
vit_b_16,256,bfloat16,bsr,64,0.8,True,77.08733974358974

i8i8->i8/32 eval:

Test:  Acc@1 0.495 Acc@5 1.945                                                                                       
vit_b_16,256,bfloat16,bsr,64,0.8,True,0.4947916666666667

Will spend some time benchmarking perf + closing out the PR tomorrow, but preliminary seeing 436.59 ms for bf16 vs 362.064 ms for int8.

As in the title. Addresses the task in pytorch/ao#821 (comment) [ghstack-poisoned]

pearu · 2024-09-25T10:59:42Z

There is an easy fix though: we'll need to add out_dtype parameter to optimize_bsr_dense_addmm. I'll cook up a PR for this..

Done in pytorch/pytorch#136626

pearu · 2024-09-25T11:12:25Z

I still run into numerical issues with torch.int32

Can you provide a simple reproducer?

What is the size of inputs to bsr_dense_addmm? In the worst case scenario (all entries have value 2**7), the maximal intermediate size for int8 @ int8 -> int32 is about 2**31 / 2**14 = 131072 when int32 accumulator would not overflow.

jcaip · 2024-09-25T15:19:39Z

@pearu

I don't think it is a saturation issue, but rather a numerical one. I think the outputs of the layer are small values to close 0. When we run with an int intermediately, these all values all get quantized to 0. But this leads to the model spitting out all zeros which ruins the model accuracy.

If you change the dtype to int32 in my branch, you can run sh evaluate.sh after setting IMAGENET_PATH to reproduce.

This PR, adds in int8 dynamicquant + bsr support. Changes: * Use i8i8 -> bf16 matmul to maintain accuracy * Added a block sparse layout type to AffineQuantizedTensor + check/impl. * Cleaned up benchmark.py script and add a single line `benchmark.sh` file for acceleration numbers * Updated eval.py and added a single line `evaluate.sh` file for accuracy numbers * Lots of lint formatting and README updates * torch.compile now working and is correct

…th torch.compile (#904) * [float8] improve eager numerics for dynamic scales * leave torch.linalg.vector_norm for another PR Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * cuda Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * remove _data and investigate Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * remove _data comment Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * upcast to float32 is enough Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * explain why float32 Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * _data parity Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * handle sm8.9 Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * fix transformer unit test Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * print if error Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Add tutorial for trainable tensor subclass (#908) Summary: The new tutorial provides an example of how to implement a trainable tensor subclass that wraps quantized data. This extends the existing `MyDTypeTensor` with a few necessary steps to ensure proper gradient updates, namely: 1. Define a differentiable constructor 2. Define backward pass for ops of interest (e.g. torch.nn.functional.linear) 3. Handle special ops used by the optimizer (e.g. aten.add, aten.add_) Test Plan: python tutorials/developer_api_guide/my_trainable_tensor_subclass.py * Introducing 1-bit quantization for Llama in torchchat (#910) Differential Revision: D63052325 Pull Request resolved: #911 * Rename Floating point to fp8 (#909) * [float8] fix typo in bitwise_identical unit test (#918) Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Adding example for quantized tensor + tensor parallelism (#785) * [WIP] Adding example for quantized tensor + tensor parallelism Summary: This PR adds an example of how quantized tensor subclass can work with DTensor: https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/README.md End goal is to rewrite https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/llama2.py with normal llama2 implementation and show case with DTensor + AffineQuantizedTensor + torch.compile we can get on par performance with the custom tensor parallel implementation Test Plan: torchrun --standalone --nnodes=1 --nproc-per-node=4 tutorials/developer_api_guide/tensor_parallel.py Reviewers: Subscribers: Tasks: Tags: * tensor parallel file * Use DTensor.from instead of distribute_tensor * implementing aten.slice.Tensor (WIP) * working * some shape fix and use more quant primitive ops * Add rowwise test * make rowwise sharding work * compile still not working yet * fake tensor didn't pick up shape changes from transpose * backend='eager' * change transpose to non-inplace op * add error message * works now with torch nightly * remove print * ruff * Clean up * Fix device id --------- Co-authored-by: Ke Wen <kw2501@meta.com> * rename cuda mode -> gpu mode (#925) * Add workaround to recover the perf for quantized vit in torch.compile (#926) Add temporary workaround to recover the perf for quantized vit under torch.compile Summary: Recently we found a perf drop in quantized vit due to #898 (comment) This PR add a temp fix until we figure out the longer term fix. I think ideally we should figure out why the tensor subclass check failed in torch.compile (https://github.com/pytorch/pytorch/blob/e4d294221b140fdbb49a64f297bc60c9fcc2f80e/torch/nn/modules/activation.py#L1286) and fix that Test Plan: python tutorials/quantize_vit/run_vit_b_quant.py Reviewers: Subscribers: Tasks: Tags: * clean up device checks in float8 unit test files (#923) Summary: While working on rowwise scaling I noticed that some of the CUDA device capability checks we had in the test files did not make sense, cleaning this up. Test Plan: tests pass on my H100 CI, it should skip less tests now since CI only has CUDA capability 8, 9 Reviewers: Subscribers: Tasks: Tags: * [low-bit optim] Change 8-bit and FP8 optim block size from 2048 to 256 to match new bnb v0.44 (#927) * Float8 autoquant weight only (#866) * Fix failing FP6 benchmark (#931) * Remove two if statements in fp8 padding (#935) Reviewed By: vkuzo Differential Revision: D63051205 Pull Request resolved: #935 Approved by: https://github.com/vkuzo * [Distributed] Improve sharding example (#937) * [Distributed] Improve sharding example * Add comment * Add composable QAT quantizer (#938) Summary: This is a utility for users who wish to apply multiple QAT quantizers to their models. In the near future, we expect to add an embedding QAT quantizer that composes with the existing linear QAT quantizers. Test Plan: python test/quantization/test_qat.py -k test_composable_qat_quantizer * resolve conflict with latest main Differential Revision: D63048850 Pull Request resolved: #912 * Add torchchat quantizer Differential Revision: D62394341 Pull Request resolved: #897 * Add compile tests to test suite (#906) * Add compile tests to test suite Summary: This is a follow up PR addressing #839 (comment) We can add more compiler related tests in the future. Next * refactor a bit to use quantize_ API directly * use the test suite in existing API tests Test Plan: python torchao/testing/utils.py Reviewers: Subscribers: Tasks: Tags: * rename * add result check * Fix up CMakeLists and reorganize some code locations Differential Revision: D62711903 Pull Request resolved: #948 * [float8] all-reduce amax on dp mesh instead of global pg (#933) * [float8] all-reduce amax on dp mesh instead of global pg Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * liner Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * improve comments Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * move hp tensor inside if Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * linter Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * linter Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * linter Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * linter Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * linter Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * int8 dynamic quant + bsr support (#821) This PR, adds in int8 dynamicquant + bsr support. Changes: * Use i8i8 -> bf16 matmul to maintain accuracy * Added a block sparse layout type to AffineQuantizedTensor + check/impl. * Cleaned up benchmark.py script and add a single line `benchmark.sh` file for acceleration numbers * Updated eval.py and added a single line `evaluate.sh` file for accuracy numbers * Lots of lint formatting and README updates * torch.compile now working and is correct * fixing some issues with our support for 70/405B models (#941) Summary: download and convert scripts needed to be updated alongside model.py config files Test Plan: python generate.py --checkpoint_path ../../../checkpoints/meta-llama/Meta-Llama-3.1-70B/model.pth Reviewers: Subscribers: Tasks: Tags: * Update INT8 mixed-precision training test to be less flaky (#950) * Add executorch parallel Differential Revision: D62711909 Pull Request resolved: #953 * test CI Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * better comment on why upcasting Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * control seed Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * move unit test to test_compile Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * fix typo Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * float64 upcasting after allreduce Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * use LinearMMConfig Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: --------- Co-authored-by: andrewor14 <andrewor14@gmail.com> Co-authored-by: Vaishnavi Gupta <vaishnavi10367@gmail.com> Co-authored-by: Apurva Jain <apurvajain.kota@gmail.com> Co-authored-by: Jerry Zhang <jerryzh168@gmail.com> Co-authored-by: Ke Wen <kw2501@meta.com> Co-authored-by: Mark Saroufim <marksaroufim@meta.com> Co-authored-by: Vasiliy Kuznetsov <vkuzo@users.noreply.github.com> Co-authored-by: Thien Tran <gau.nernst@yahoo.com.sg> Co-authored-by: Tobias van der Werff <33268192+tobiasvanderwerff@users.noreply.github.com> Co-authored-by: Shuqi Yang <shuqiyang@meta.com> Co-authored-by: Scott Roy <161522778+metascroy@users.noreply.github.com> Co-authored-by: Jesse Cai <jessecai@meta.com> Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>

This PR, adds in int8 dynamicquant + bsr support. Changes: * Use i8i8 -> bf16 matmul to maintain accuracy * Added a block sparse layout type to AffineQuantizedTensor + check/impl. * Cleaned up benchmark.py script and add a single line `benchmark.sh` file for acceleration numbers * Updated eval.py and added a single line `evaluate.sh` file for accuracy numbers * Lots of lint formatting and README updates * torch.compile now working and is correct

…th torch.compile (pytorch#904) * [float8] improve eager numerics for dynamic scales * leave torch.linalg.vector_norm for another PR Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * cuda Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * remove _data and investigate Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * remove _data comment Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * upcast to float32 is enough Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * explain why float32 Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * _data parity Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * handle sm8.9 Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * fix transformer unit test Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * print if error Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Add tutorial for trainable tensor subclass (pytorch#908) Summary: The new tutorial provides an example of how to implement a trainable tensor subclass that wraps quantized data. This extends the existing `MyDTypeTensor` with a few necessary steps to ensure proper gradient updates, namely: 1. Define a differentiable constructor 2. Define backward pass for ops of interest (e.g. torch.nn.functional.linear) 3. Handle special ops used by the optimizer (e.g. aten.add, aten.add_) Test Plan: python tutorials/developer_api_guide/my_trainable_tensor_subclass.py * Introducing 1-bit quantization for Llama in torchchat (pytorch#910) Differential Revision: D63052325 Pull Request resolved: pytorch#911 * Rename Floating point to fp8 (pytorch#909) * [float8] fix typo in bitwise_identical unit test (pytorch#918) Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Adding example for quantized tensor + tensor parallelism (pytorch#785) * [WIP] Adding example for quantized tensor + tensor parallelism Summary: This PR adds an example of how quantized tensor subclass can work with DTensor: https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/README.md End goal is to rewrite https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/llama2.py with normal llama2 implementation and show case with DTensor + AffineQuantizedTensor + torch.compile we can get on par performance with the custom tensor parallel implementation Test Plan: torchrun --standalone --nnodes=1 --nproc-per-node=4 tutorials/developer_api_guide/tensor_parallel.py Reviewers: Subscribers: Tasks: Tags: * tensor parallel file * Use DTensor.from instead of distribute_tensor * implementing aten.slice.Tensor (WIP) * working * some shape fix and use more quant primitive ops * Add rowwise test * make rowwise sharding work * compile still not working yet * fake tensor didn't pick up shape changes from transpose * backend='eager' * change transpose to non-inplace op * add error message * works now with torch nightly * remove print * ruff * Clean up * Fix device id --------- Co-authored-by: Ke Wen <kw2501@meta.com> * rename cuda mode -> gpu mode (pytorch#925) * Add workaround to recover the perf for quantized vit in torch.compile (pytorch#926) Add temporary workaround to recover the perf for quantized vit under torch.compile Summary: Recently we found a perf drop in quantized vit due to pytorch#898 (comment) This PR add a temp fix until we figure out the longer term fix. I think ideally we should figure out why the tensor subclass check failed in torch.compile (https://github.com/pytorch/pytorch/blob/e4d294221b140fdbb49a64f297bc60c9fcc2f80e/torch/nn/modules/activation.py#L1286) and fix that Test Plan: python tutorials/quantize_vit/run_vit_b_quant.py Reviewers: Subscribers: Tasks: Tags: * clean up device checks in float8 unit test files (pytorch#923) Summary: While working on rowwise scaling I noticed that some of the CUDA device capability checks we had in the test files did not make sense, cleaning this up. Test Plan: tests pass on my H100 CI, it should skip less tests now since CI only has CUDA capability 8, 9 Reviewers: Subscribers: Tasks: Tags: * [low-bit optim] Change 8-bit and FP8 optim block size from 2048 to 256 to match new bnb v0.44 (pytorch#927) * Float8 autoquant weight only (pytorch#866) * Fix failing FP6 benchmark (pytorch#931) * Remove two if statements in fp8 padding (pytorch#935) Reviewed By: vkuzo Differential Revision: D63051205 Pull Request resolved: pytorch#935 Approved by: https://github.com/vkuzo * [Distributed] Improve sharding example (pytorch#937) * [Distributed] Improve sharding example * Add comment * Add composable QAT quantizer (pytorch#938) Summary: This is a utility for users who wish to apply multiple QAT quantizers to their models. In the near future, we expect to add an embedding QAT quantizer that composes with the existing linear QAT quantizers. Test Plan: python test/quantization/test_qat.py -k test_composable_qat_quantizer * resolve conflict with latest main Differential Revision: D63048850 Pull Request resolved: pytorch#912 * Add torchchat quantizer Differential Revision: D62394341 Pull Request resolved: pytorch#897 * Add compile tests to test suite (pytorch#906) * Add compile tests to test suite Summary: This is a follow up PR addressing pytorch#839 (comment) We can add more compiler related tests in the future. Next * refactor a bit to use quantize_ API directly * use the test suite in existing API tests Test Plan: python torchao/testing/utils.py Reviewers: Subscribers: Tasks: Tags: * rename * add result check * Fix up CMakeLists and reorganize some code locations Differential Revision: D62711903 Pull Request resolved: pytorch#948 * [float8] all-reduce amax on dp mesh instead of global pg (pytorch#933) * [float8] all-reduce amax on dp mesh instead of global pg Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * liner Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * improve comments Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * move hp tensor inside if Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * linter Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * linter Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * linter Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * linter Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * linter Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * int8 dynamic quant + bsr support (pytorch#821) This PR, adds in int8 dynamicquant + bsr support. Changes: * Use i8i8 -> bf16 matmul to maintain accuracy * Added a block sparse layout type to AffineQuantizedTensor + check/impl. * Cleaned up benchmark.py script and add a single line `benchmark.sh` file for acceleration numbers * Updated eval.py and added a single line `evaluate.sh` file for accuracy numbers * Lots of lint formatting and README updates * torch.compile now working and is correct * fixing some issues with our support for 70/405B models (pytorch#941) Summary: download and convert scripts needed to be updated alongside model.py config files Test Plan: python generate.py --checkpoint_path ../../../checkpoints/meta-llama/Meta-Llama-3.1-70B/model.pth Reviewers: Subscribers: Tasks: Tags: * Update INT8 mixed-precision training test to be less flaky (pytorch#950) * Add executorch parallel Differential Revision: D62711909 Pull Request resolved: pytorch#953 * test CI Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * better comment on why upcasting Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * control seed Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * move unit test to test_compile Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * fix typo Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * float64 upcasting after allreduce Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * use LinearMMConfig Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: --------- Co-authored-by: andrewor14 <andrewor14@gmail.com> Co-authored-by: Vaishnavi Gupta <vaishnavi10367@gmail.com> Co-authored-by: Apurva Jain <apurvajain.kota@gmail.com> Co-authored-by: Jerry Zhang <jerryzh168@gmail.com> Co-authored-by: Ke Wen <kw2501@meta.com> Co-authored-by: Mark Saroufim <marksaroufim@meta.com> Co-authored-by: Vasiliy Kuznetsov <vkuzo@users.noreply.github.com> Co-authored-by: Thien Tran <gau.nernst@yahoo.com.sg> Co-authored-by: Tobias van der Werff <33268192+tobiasvanderwerff@users.noreply.github.com> Co-authored-by: Shuqi Yang <shuqiyang@meta.com> Co-authored-by: Scott Roy <161522778+metascroy@users.noreply.github.com> Co-authored-by: Jesse Cai <jessecai@meta.com> Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>

…dense_addmm" As in the title. Addresses the task in pytorch/ao#821 (comment) [ghstack-poisoned]

As in the title. Addresses the task in pytorch/ao#821 (comment) [ghstack-poisoned]

As in the title. Addresses the task in pytorch/ao#821 (comment) Pull Request resolved: #136626 Approved by: https://github.com/amjames, https://github.com/cpuhrsch

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 6, 2024

jerryzh168 reviewed Sep 10, 2024

View reviewed changes

pearu reviewed Sep 13, 2024

View reviewed changes

pearu reviewed Sep 14, 2024

View reviewed changes

pearu mentioned this pull request Sep 16, 2024

Add blocksparse_int_addmm. Eliminate unnecessary contiguous calls which leads to performance increase. #891

Draft

wip

9a0e918

jcaip force-pushed the jcaip/int8-bsr branch from b6fc991 to 9a0e918 Compare September 25, 2024 00:14

jcaip added 7 commits September 24, 2024 17:22

updated README

f3bebe9

update benchmarks

a7937f6

wip

898aa08

updated README and script

e304b9c

wip eval script

9662497

removed duplicated files

860123a

consolidated all args

a5575f9

update evaluate script

f9b0ca1

working now with bfloat16

e451f8c

jcaip added 2 commits September 25, 2024 03:28

more cleanup + results

606e88e

fix ruff

843cd5f

jcaip changed the title ~~[wip] int8 dynamic quant + bsr support~~ int8 dynamic quant + bsr support Sep 25, 2024

jcaip marked this pull request as ready for review September 25, 2024 10:34

pearu mentioned this pull request Sep 25, 2024

Add out_dtype kw argument to optimize_bsr_dense_addmm pytorch/pytorch#136626

Closed

pearu added a commit to pytorch/pytorch that referenced this pull request Sep 25, 2024

Update on "Add out_dtype kw argument to optimize_bsr_dense_addmm"

340d7ce

As in the title. Addresses the task in pytorch/ao#821 (comment) [ghstack-poisoned]

fix get_plain

5b7ca84

jcaip added 3 commits September 25, 2024 11:55

update eval script to capture encoder time

2fdcb45

update

5acff84

fix version guards

46aeb9b

jcaip force-pushed the jcaip/int8-bsr branch from ea64906 to 46aeb9b Compare September 26, 2024 00:06

fix version

a2519d8

cpuhrsch approved these changes Sep 26, 2024

View reviewed changes

jcaip merged commit 4b5b5ee into main Sep 26, 2024
17 checks passed

pearu added a commit to pytorch/pytorch that referenced this pull request Oct 21, 2024

Update base for Update on "Add out_dtype kw argument to optimize_bsr_…

7e31d28

…dense_addmm" As in the title. Addresses the task in pytorch/ao#821 (comment) [ghstack-poisoned]

pearu added a commit to pytorch/pytorch that referenced this pull request Oct 21, 2024

Update on "Add out_dtype kw argument to optimize_bsr_dense_addmm"

c8f347a

As in the title. Addresses the task in pytorch/ao#821 (comment) [ghstack-poisoned]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

int8 dynamic quant + bsr support #821

int8 dynamic quant + bsr support #821

jcaip commented Sep 6, 2024 •

edited

Loading

pytorch-bot bot commented Sep 6, 2024 •

edited

Loading

cpuhrsch commented Sep 6, 2024

jerryzh168 Sep 10, 2024

jcaip Sep 10, 2024

cpuhrsch commented Sep 10, 2024

jcaip commented Sep 10, 2024

pearu left a comment

pearu Sep 13, 2024

pearu Sep 13, 2024

pearu Sep 14, 2024

pearu Sep 14, 2024

jcaip commented Sep 25, 2024

pearu commented Sep 25, 2024

jcaip commented Sep 25, 2024

pearu commented Sep 25, 2024

pearu commented Sep 25, 2024 •

edited

Loading

jcaip commented Sep 25, 2024

	return bsr_dense_mm(weight_bsr, A).t().contiguous()
	return bsr_dense_mm(weight_bsr, A).t()

int8 dynamic quant + bsr support #821

int8 dynamic quant + bsr support #821

Conversation

jcaip commented Sep 6, 2024 • edited Loading

pytorch-bot bot commented Sep 6, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/821

✅ No Failures

cpuhrsch commented Sep 6, 2024

jerryzh168 Sep 10, 2024

Choose a reason for hiding this comment

jcaip Sep 10, 2024

Choose a reason for hiding this comment

cpuhrsch commented Sep 10, 2024

jcaip commented Sep 10, 2024

pearu left a comment

Choose a reason for hiding this comment

pearu Sep 13, 2024

Choose a reason for hiding this comment

pearu Sep 13, 2024

Choose a reason for hiding this comment

pearu Sep 14, 2024

Choose a reason for hiding this comment

pearu Sep 14, 2024

Choose a reason for hiding this comment

jcaip commented Sep 25, 2024

pearu commented Sep 25, 2024

jcaip commented Sep 25, 2024

pearu commented Sep 25, 2024

pearu commented Sep 25, 2024 • edited Loading

jcaip commented Sep 25, 2024

jcaip commented Sep 6, 2024 •

edited

Loading

pytorch-bot bot commented Sep 6, 2024 •

edited

Loading

pearu commented Sep 25, 2024 •

edited

Loading