Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

int8 dynamic quant + bsr support #821

Merged
merged 17 commits into from
Sep 26, 2024
Merged

int8 dynamic quant + bsr support #821

merged 17 commits into from
Sep 26, 2024

Conversation

jcaip
Copy link
Contributor

@jcaip jcaip commented Sep 6, 2024

This PR, based on #891 adds in int8 dynamicquant + bsr support. In this PR is a pretty large refactor of the superblock code, mainly removing duplicated code. There's still some work to be done but this PR is already getting pretty big, so want to check it in.

Changes:

  • Use i8i8 -> bf16 matmul to maintain accuracy
  • Added a block sparse layout type to AffineQuantizedTensor + check/impl.
  • Cleaned up benchmark.py script and add a single line benchmark.sh file for acceleration numbers
  • Updated eval.py and added a single line evaluate.sh file for accuracy numbers
  • Lots of lint formatting and README updates
  • torch.compile now working

TODO:

  • move sparse layouts to their own folder
  • promote blocksparse.py from prototype and add some tests
  • move evaluate.py/benchmark.py out of superblock and into the top level folders?
  • move superblock.py into training folder?

This PR is a prototype of adding int8 dynamic quant with BSR support.

Currently it works, but is not performant. (1.6 img/s vs 2.0 baseline) on Vit-H

This is because the block sparse mm is added as a custom op, since we cannot trace through the creation of a sparse bsr tensor in torch.compile:
Instead, we created a custom op that takes in the data tensors of a bsr tensor (crow_indicies, col_indicies, values) and wrapper subclass that holds these values. See here

This causes an issue for the quantization workflow, since we cannot fuse across the custom op. It's pretty much the same issue we face with 2:4 sparsity, where we materialize the full int32 intermediary matrix, which eats into our composed quant + sparse speedups.

Normally we avoid materializing the int32 intermediary matrix by fusing the dequant into the matmul, but we can't do so here because we use a custom op. I think the long term solution here is to just rip out the BSR triton code in core and have it in AO, in that case we can change the function to take in the data tensors themselves instead of a sparse bsr tensor, avoiding this need for a custom op.

Good news though, accuracy is not degraded at all when composing int8 and block sparsity

cc @cpuhrsch

Copy link

pytorch-bot bot commented Sep 6, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/821

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit a2519d8 with merge base 7dff17a (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 6, 2024
@cpuhrsch
Copy link
Contributor

cpuhrsch commented Sep 6, 2024

cc @pearu

We could do epilogue fusion in core if we're adding _int_bsr_dense_addmm as a Template to https://github.com/pytorch/pytorch/tree/196748d49193eed2b000aa99b2c7bb1bff576d35/torch/_inductor/kernel .

We can do that similar to how we did it for _int_mm: https://github.com/pytorch/pytorch/pull/111125/files#diff-e1a7b1be52a02ad5595c2f38c1e5c1ac3086fd49b6638482ae192249f070f4dc

@@ -613,6 +620,163 @@ def from_plain(
int_data_compressed = torch._cslt_compress(int_data)
return cls(int_data_compressed, scale, zero_point, layout_type)

@register_layout_cls(BlockSparseLayoutType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jcaip do you want to move these to sparsity folder? now we should be able to do everything outside of affine_quantized_tensor after #783

(meaning register layout_type/layout tensor and dispatch/impl for quantized linear

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, not for this PR at this time. I think we should resolve the perf issues before we merge.

@cpuhrsch
Copy link
Contributor

I wonder if a sufficiently large batch size would eventually show a gain even without epilogue fusion.

@jcaip
Copy link
Contributor Author

jcaip commented Sep 10, 2024

cc @cpuhrsch It's possible, the ImageNet benchmark is not the kindest benchmark either since we can't quantize the attention layers as well. So we may be able to see perf gains with SAM as well. I added this as a hackathon project for CUDA-MODE so maybe we can pick up a contributor there.

Copy link

@pearu pearu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When running

python benchmark.py --model vit_h_14   --batch-size 256   --sparsity-linear 0.8   --sp-linear-tile-size 64 --bsr 64 --sparsity bsr --quantization

notice the warning messages:

torch/sparse/_triton_ops.py:795: UserWarning: bsr_dense_addmm uses non-optimal triton kernel parameters for M=5120 K=1280 N=65792 Ms=64, Ks=64 beta=0 alpha=1
  warn_once(
torch/sparse/_triton_ops.py:795: UserWarning: bsr_dense_addmm uses non-optimal triton kernel parameters for M=1280 K=5120 N=65792 Ms=64, Ks=64 beta=0 alpha=1

When using the tuned parameters, the performance increase is about 0.1 img/s.

I'll create a torch PR with the corresponding tuned parameters shortly.


# original_batch_dims_broadcasted = broadcast_batch_dims("_int_bsr_dense_addmm", weight_bsr, A)
# input = torch.zeros(M, N, dtype=torch.int32, device=A.device)
return bsr_dense_mm(weight_bsr, A).t().contiguous()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return bsr_dense_mm(weight_bsr, A).t().contiguous()
return bsr_dense_mm(weight_bsr, A).t()

That should increase the performance of

benchmark.py --model vit_h_14   --batch-size 256   --sparsity-linear 0.8   --sp-linear-tile-size 64 --bsr 64 --sparsity bsr --quantization

by ~0.9 img/s.

w_vals = weight_tensor.layout_tensor
w_scales = weight_tensor.layout_tensor.scale
tmp = x_vals_int8.reshape(-1, x_vals_int8.shape[-1])
tmp_t = tmp.t().contiguous()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tmp_t = tmp.t().contiguous()
tmp_t = tmp.t()

to increase the performance about 0.2 img/s.

Comment on lines 1200 to 1214
y = torch.ops.blocksparse.int_mm(w_vals.crow_indices(),
w_vals.col_indices(),
w_vals.values(),
w_vals.shape[0],
w_vals.shape[1],
tmp_t)

# breakpoint()


y = x_scales.reshape(-1, 1) * y

y = (y * w_scales).reshape(
*x_vals_int8.shape[:-1], y.shape[-1]
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: move scaling operations to bsr_dense_addmm kernel.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jcaip
Copy link
Contributor Author

jcaip commented Sep 25, 2024

cc @pearu @cpuhrsch

I've pulled in your changes to this PR here, seeing some nice speedups! Unfortunately I think I'm running into some saturation issues when doing the int8 int8 -> int8 BSR matmul.

This is causing poor accuracy when running BSR with quantization, evaluate.sh should print out the results. I'm seeing 0 accuracy when running with BSR+ quant. I can reproduce this is the test too, when I bump up the matrix sizes, the results I'm seeing for the quantized mm is quite different from the reference result:

Screenshot 2024-09-24 at 11 47 21 PM

We ran into something similar with cuSPASRELt and we solved that by adding support for i8i8->bf16 matmul. Would it be possible to add something similar to the BSR kernels? Something like aout_dtype option would work.

@pearu
Copy link

pearu commented Sep 25, 2024

Would it be possible to add something similar to the BSR kernels? Something like aout_dtype option would work.

We have it in terms of out argument. In blocksparse_int_addmm, try changing

out = A.new_empty(original_batch_dims_broadcasted + (M, N))

to

out = A.new_empty(original_batch_dims_broadcasted + (M, N), dtype=torch.int32)

that should resolve the saturation problem. I did't use bfloat16 as out dtype because triton dot accumulator uses int32.

Notice that we don't have tuned parameters for bsr_dense_addmm(<int8>)->int32, so the performance is expected to be lower. Using --tune-kernel-params will not be helpful either because optimize_bsr_dense_addmm assumes that out.dtype is int8 if dtype is int8.
There is an easy fix though: we'll need to add out_dtype parameter to optimize_bsr_dense_addmm. I'll cook up a PR for this..

@jcaip
Copy link
Contributor Author

jcaip commented Sep 25, 2024

Nice, that makes sense.I still run into numerical issues with torch.int32, but passing in torch.bfloat16 seems to fix those :)

i8i8->bf16 eval:

Test:  Acc@1 77.087 Acc@5 105.867
vit_b_16,256,bfloat16,bsr,64,0.8,True,77.08733974358974

i8i8->i8/32 eval:

Test:  Acc@1 0.495 Acc@5 1.945                                                                                       
vit_b_16,256,bfloat16,bsr,64,0.8,True,0.4947916666666667 

Will spend some time benchmarking perf + closing out the PR tomorrow, but preliminary seeing 436.59 ms for bf16 vs 362.064 ms for int8.

@jcaip jcaip changed the title [wip] int8 dynamic quant + bsr support int8 dynamic quant + bsr support Sep 25, 2024
@jcaip jcaip marked this pull request as ready for review September 25, 2024 10:34
pearu added a commit to pytorch/pytorch that referenced this pull request Sep 25, 2024
As in the title.

Addresses the task in pytorch/ao#821 (comment)




[ghstack-poisoned]
@pearu
Copy link

pearu commented Sep 25, 2024

There is an easy fix though: we'll need to add out_dtype parameter to optimize_bsr_dense_addmm. I'll cook up a PR for this..

Done in pytorch/pytorch#136626

@pearu
Copy link

pearu commented Sep 25, 2024

I still run into numerical issues with torch.int32

Can you provide a simple reproducer?

What is the size of inputs to bsr_dense_addmm? In the worst case scenario (all entries have value 2**7), the maximal intermediate size for int8 @ int8 -> int32 is about 2**31 / 2**14 = 131072 when int32 accumulator would not overflow.

@jcaip
Copy link
Contributor Author

jcaip commented Sep 25, 2024

@pearu

I don't think it is a saturation issue, but rather a numerical one. I think the outputs of the layer are small values to close 0. When we run with an int intermediately, these all values all get quantized to 0. But this leads to the model spitting out all zeros which ruins the model accuracy.

If you change the dtype to int32 in my branch, you can run sh evaluate.sh after setting IMAGENET_PATH to reproduce.

@jcaip jcaip merged commit 4b5b5ee into main Sep 26, 2024
17 checks passed
weifengpy pushed a commit to weifengpy/ao that referenced this pull request Sep 26, 2024
This PR, adds in int8 dynamicquant + bsr support.

Changes:
* Use i8i8 -> bf16 matmul to maintain accuracy
* Added a block sparse layout type to AffineQuantizedTensor + check/impl.  
* Cleaned up benchmark.py script and add a single line `benchmark.sh` file for acceleration numbers
* Updated eval.py and added a single line `evaluate.sh` file for accuracy numbers
* Lots of lint formatting and README updates
* torch.compile now working and is correct
weifengpy added a commit that referenced this pull request Oct 1, 2024
…th torch.compile (#904)

* [float8] improve eager numerics for dynamic scales

* leave torch.linalg.vector_norm for another PR

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* cuda

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* remove _data and investigate

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* remove _data comment

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* upcast to float32 is enough

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* explain why float32

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* _data parity

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* handle sm8.9

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* fix transformer unit test

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* print if error

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Add tutorial for trainable tensor subclass (#908)

Summary: The new tutorial provides an example of how to implement
a trainable tensor subclass that wraps quantized data. This extends
the existing `MyDTypeTensor` with a few necessary steps to ensure
proper gradient updates, namely:

1. Define a differentiable constructor
2. Define backward pass for ops of interest (e.g. torch.nn.functional.linear)
3. Handle special ops used by the optimizer (e.g. aten.add, aten.add_)

Test Plan:
python tutorials/developer_api_guide/my_trainable_tensor_subclass.py

* Introducing 1-bit quantization for Llama in torchchat (#910)

Differential Revision: D63052325

Pull Request resolved: #911

* Rename Floating point to fp8 (#909)

* [float8] fix typo in bitwise_identical unit test (#918)

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Adding example for quantized tensor + tensor parallelism (#785)

* [WIP] Adding example for quantized tensor + tensor parallelism

Summary:
This PR adds an example of how quantized tensor subclass can work with DTensor: https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/README.md

End goal is to rewrite https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/llama2.py with normal llama2 implementation and show case with DTensor + AffineQuantizedTensor + torch.compile we can get on par performance with the custom tensor parallel implementation

Test Plan:
torchrun --standalone --nnodes=1 --nproc-per-node=4 tutorials/developer_api_guide/tensor_parallel.py

Reviewers:

Subscribers:

Tasks:

Tags:

* tensor parallel file

* Use DTensor.from instead of distribute_tensor

* implementing aten.slice.Tensor (WIP)

* working

* some shape fix and use more quant primitive ops

* Add rowwise test

* make rowwise sharding work

* compile still not working yet

* fake tensor didn't pick up shape changes from transpose

* backend='eager'

* change transpose to non-inplace op

* add error message

* works now with torch nightly

* remove print

* ruff

* Clean up

* Fix device id

---------

Co-authored-by: Ke Wen <kw2501@meta.com>

* rename cuda mode -> gpu mode (#925)

* Add workaround to recover the perf for quantized vit in torch.compile (#926)

Add temporary workaround to recover the perf for quantized vit under torch.compile

Summary:
Recently we found a perf drop in quantized vit due to #898 (comment)
This PR add a temp fix until we figure out the longer term fix.

I think ideally we should figure out why the tensor subclass check failed in torch.compile (https://github.com/pytorch/pytorch/blob/e4d294221b140fdbb49a64f297bc60c9fcc2f80e/torch/nn/modules/activation.py#L1286) and fix that

Test Plan:
python tutorials/quantize_vit/run_vit_b_quant.py

Reviewers:

Subscribers:

Tasks:

Tags:

* clean up device checks in float8 unit test files (#923)

Summary:

While working on rowwise scaling I noticed that some of the CUDA
device capability checks we had in the test files did not make sense,
cleaning this up.

Test Plan:

tests pass on my H100

CI, it should skip less tests now since CI only has CUDA capability 8, 9

Reviewers:

Subscribers:

Tasks:

Tags:

* [low-bit optim] Change 8-bit and FP8 optim block size from 2048 to 256 to match new bnb v0.44 (#927)

* Float8 autoquant weight only (#866)

* Fix failing FP6 benchmark (#931)

* Remove two if statements in fp8 padding (#935)

Reviewed By: vkuzo

Differential Revision: D63051205

Pull Request resolved: #935
Approved by: https://github.com/vkuzo

* [Distributed] Improve sharding example (#937)

* [Distributed] Improve sharding example

* Add comment

* Add composable QAT quantizer (#938)

Summary: This is a utility for users who wish to apply multiple
QAT quantizers to their models. In the near future, we expect
to add an embedding QAT quantizer that composes with the
existing linear QAT quantizers.

Test Plan:
python test/quantization/test_qat.py -k test_composable_qat_quantizer

* resolve conflict with latest main

Differential Revision: D63048850

Pull Request resolved: #912

* Add torchchat quantizer

Differential Revision: D62394341

Pull Request resolved: #897

* Add compile tests to test suite (#906)

* Add compile tests to test suite

Summary:
This is a follow up PR addressing #839 (comment)
We can add more compiler related tests in the future.

Next
* refactor a bit to use quantize_ API directly
* use the test suite in existing API tests

Test Plan:
python torchao/testing/utils.py

Reviewers:

Subscribers:

Tasks:

Tags:

* rename

* add result check

* Fix up CMakeLists and reorganize some code locations

Differential Revision: D62711903

Pull Request resolved: #948

* [float8] all-reduce amax on dp mesh instead of global pg (#933)

* [float8] all-reduce amax on dp mesh instead of global pg

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* liner

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* improve comments

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* move hp tensor inside if

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* linter

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* linter

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* linter

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* linter

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* linter

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* int8 dynamic quant + bsr support (#821)

This PR, adds in int8 dynamicquant + bsr support.

Changes:
* Use i8i8 -> bf16 matmul to maintain accuracy
* Added a block sparse layout type to AffineQuantizedTensor + check/impl.  
* Cleaned up benchmark.py script and add a single line `benchmark.sh` file for acceleration numbers
* Updated eval.py and added a single line `evaluate.sh` file for accuracy numbers
* Lots of lint formatting and README updates
* torch.compile now working and is correct

* fixing some issues with our support for 70/405B models (#941)

Summary: download and convert scripts needed to be updated alongside
model.py config files

Test Plan: python generate.py --checkpoint_path ../../../checkpoints/meta-llama/Meta-Llama-3.1-70B/model.pth

Reviewers:

Subscribers:

Tasks:

Tags:

* Update INT8 mixed-precision training test to be less flaky (#950)

* Add executorch parallel

Differential Revision: D62711909

Pull Request resolved: #953

* test CI

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* better comment on why upcasting

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* control seed

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* move unit test to test_compile

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* fix typo

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* float64 upcasting after allreduce

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* use LinearMMConfig

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

---------

Co-authored-by: andrewor14 <andrewor14@gmail.com>
Co-authored-by: Vaishnavi Gupta <vaishnavi10367@gmail.com>
Co-authored-by: Apurva Jain <apurvajain.kota@gmail.com>
Co-authored-by: Jerry Zhang <jerryzh168@gmail.com>
Co-authored-by: Ke Wen <kw2501@meta.com>
Co-authored-by: Mark Saroufim <marksaroufim@meta.com>
Co-authored-by: Vasiliy Kuznetsov <vkuzo@users.noreply.github.com>
Co-authored-by: Thien Tran <gau.nernst@yahoo.com.sg>
Co-authored-by: Tobias van der Werff <33268192+tobiasvanderwerff@users.noreply.github.com>
Co-authored-by: Shuqi Yang <shuqiyang@meta.com>
Co-authored-by: Scott Roy <161522778+metascroy@users.noreply.github.com>
Co-authored-by: Jesse Cai <jessecai@meta.com>
Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>
melvinebenezer pushed a commit to melvinebenezer/ao that referenced this pull request Oct 3, 2024
This PR, adds in int8 dynamicquant + bsr support.

Changes:
* Use i8i8 -> bf16 matmul to maintain accuracy
* Added a block sparse layout type to AffineQuantizedTensor + check/impl.  
* Cleaned up benchmark.py script and add a single line `benchmark.sh` file for acceleration numbers
* Updated eval.py and added a single line `evaluate.sh` file for accuracy numbers
* Lots of lint formatting and README updates
* torch.compile now working and is correct
melvinebenezer pushed a commit to melvinebenezer/ao that referenced this pull request Oct 7, 2024
…th torch.compile (pytorch#904)

* [float8] improve eager numerics for dynamic scales

* leave torch.linalg.vector_norm for another PR

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* cuda

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* remove _data and investigate

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* remove _data comment

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* upcast to float32 is enough

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* explain why float32

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* _data parity

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* handle sm8.9

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* fix transformer unit test

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* print if error

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Add tutorial for trainable tensor subclass (pytorch#908)

Summary: The new tutorial provides an example of how to implement
a trainable tensor subclass that wraps quantized data. This extends
the existing `MyDTypeTensor` with a few necessary steps to ensure
proper gradient updates, namely:

1. Define a differentiable constructor
2. Define backward pass for ops of interest (e.g. torch.nn.functional.linear)
3. Handle special ops used by the optimizer (e.g. aten.add, aten.add_)

Test Plan:
python tutorials/developer_api_guide/my_trainable_tensor_subclass.py

* Introducing 1-bit quantization for Llama in torchchat (pytorch#910)

Differential Revision: D63052325

Pull Request resolved: pytorch#911

* Rename Floating point to fp8 (pytorch#909)

* [float8] fix typo in bitwise_identical unit test (pytorch#918)

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Adding example for quantized tensor + tensor parallelism (pytorch#785)

* [WIP] Adding example for quantized tensor + tensor parallelism

Summary:
This PR adds an example of how quantized tensor subclass can work with DTensor: https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/README.md

End goal is to rewrite https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/llama2.py with normal llama2 implementation and show case with DTensor + AffineQuantizedTensor + torch.compile we can get on par performance with the custom tensor parallel implementation

Test Plan:
torchrun --standalone --nnodes=1 --nproc-per-node=4 tutorials/developer_api_guide/tensor_parallel.py

Reviewers:

Subscribers:

Tasks:

Tags:

* tensor parallel file

* Use DTensor.from instead of distribute_tensor

* implementing aten.slice.Tensor (WIP)

* working

* some shape fix and use more quant primitive ops

* Add rowwise test

* make rowwise sharding work

* compile still not working yet

* fake tensor didn't pick up shape changes from transpose

* backend='eager'

* change transpose to non-inplace op

* add error message

* works now with torch nightly

* remove print

* ruff

* Clean up

* Fix device id

---------

Co-authored-by: Ke Wen <kw2501@meta.com>

* rename cuda mode -> gpu mode (pytorch#925)

* Add workaround to recover the perf for quantized vit in torch.compile (pytorch#926)

Add temporary workaround to recover the perf for quantized vit under torch.compile

Summary:
Recently we found a perf drop in quantized vit due to pytorch#898 (comment)
This PR add a temp fix until we figure out the longer term fix.

I think ideally we should figure out why the tensor subclass check failed in torch.compile (https://github.com/pytorch/pytorch/blob/e4d294221b140fdbb49a64f297bc60c9fcc2f80e/torch/nn/modules/activation.py#L1286) and fix that

Test Plan:
python tutorials/quantize_vit/run_vit_b_quant.py

Reviewers:

Subscribers:

Tasks:

Tags:

* clean up device checks in float8 unit test files (pytorch#923)

Summary:

While working on rowwise scaling I noticed that some of the CUDA
device capability checks we had in the test files did not make sense,
cleaning this up.

Test Plan:

tests pass on my H100

CI, it should skip less tests now since CI only has CUDA capability 8, 9

Reviewers:

Subscribers:

Tasks:

Tags:

* [low-bit optim] Change 8-bit and FP8 optim block size from 2048 to 256 to match new bnb v0.44 (pytorch#927)

* Float8 autoquant weight only (pytorch#866)

* Fix failing FP6 benchmark (pytorch#931)

* Remove two if statements in fp8 padding (pytorch#935)

Reviewed By: vkuzo

Differential Revision: D63051205

Pull Request resolved: pytorch#935
Approved by: https://github.com/vkuzo

* [Distributed] Improve sharding example (pytorch#937)

* [Distributed] Improve sharding example

* Add comment

* Add composable QAT quantizer (pytorch#938)

Summary: This is a utility for users who wish to apply multiple
QAT quantizers to their models. In the near future, we expect
to add an embedding QAT quantizer that composes with the
existing linear QAT quantizers.

Test Plan:
python test/quantization/test_qat.py -k test_composable_qat_quantizer

* resolve conflict with latest main

Differential Revision: D63048850

Pull Request resolved: pytorch#912

* Add torchchat quantizer

Differential Revision: D62394341

Pull Request resolved: pytorch#897

* Add compile tests to test suite (pytorch#906)

* Add compile tests to test suite

Summary:
This is a follow up PR addressing pytorch#839 (comment)
We can add more compiler related tests in the future.

Next
* refactor a bit to use quantize_ API directly
* use the test suite in existing API tests

Test Plan:
python torchao/testing/utils.py

Reviewers:

Subscribers:

Tasks:

Tags:

* rename

* add result check

* Fix up CMakeLists and reorganize some code locations

Differential Revision: D62711903

Pull Request resolved: pytorch#948

* [float8] all-reduce amax on dp mesh instead of global pg (pytorch#933)

* [float8] all-reduce amax on dp mesh instead of global pg

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* liner

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* improve comments

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* move hp tensor inside if

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* linter

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* linter

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* linter

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* linter

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* linter

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* int8 dynamic quant + bsr support (pytorch#821)

This PR, adds in int8 dynamicquant + bsr support.

Changes:
* Use i8i8 -> bf16 matmul to maintain accuracy
* Added a block sparse layout type to AffineQuantizedTensor + check/impl.  
* Cleaned up benchmark.py script and add a single line `benchmark.sh` file for acceleration numbers
* Updated eval.py and added a single line `evaluate.sh` file for accuracy numbers
* Lots of lint formatting and README updates
* torch.compile now working and is correct

* fixing some issues with our support for 70/405B models (pytorch#941)

Summary: download and convert scripts needed to be updated alongside
model.py config files

Test Plan: python generate.py --checkpoint_path ../../../checkpoints/meta-llama/Meta-Llama-3.1-70B/model.pth

Reviewers:

Subscribers:

Tasks:

Tags:

* Update INT8 mixed-precision training test to be less flaky (pytorch#950)

* Add executorch parallel

Differential Revision: D62711909

Pull Request resolved: pytorch#953

* test CI

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* better comment on why upcasting

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* control seed

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* move unit test to test_compile

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* fix typo

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* float64 upcasting after allreduce

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* use LinearMMConfig

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

---------

Co-authored-by: andrewor14 <andrewor14@gmail.com>
Co-authored-by: Vaishnavi Gupta <vaishnavi10367@gmail.com>
Co-authored-by: Apurva Jain <apurvajain.kota@gmail.com>
Co-authored-by: Jerry Zhang <jerryzh168@gmail.com>
Co-authored-by: Ke Wen <kw2501@meta.com>
Co-authored-by: Mark Saroufim <marksaroufim@meta.com>
Co-authored-by: Vasiliy Kuznetsov <vkuzo@users.noreply.github.com>
Co-authored-by: Thien Tran <gau.nernst@yahoo.com.sg>
Co-authored-by: Tobias van der Werff <33268192+tobiasvanderwerff@users.noreply.github.com>
Co-authored-by: Shuqi Yang <shuqiyang@meta.com>
Co-authored-by: Scott Roy <161522778+metascroy@users.noreply.github.com>
Co-authored-by: Jesse Cai <jessecai@meta.com>
Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>
pearu added a commit to pytorch/pytorch that referenced this pull request Oct 21, 2024
…dense_addmm"

As in the title.

Addresses the task in pytorch/ao#821 (comment)




[ghstack-poisoned]
pearu added a commit to pytorch/pytorch that referenced this pull request Oct 21, 2024
As in the title.

Addresses the task in pytorch/ao#821 (comment)




[ghstack-poisoned]
pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request Oct 22, 2024
SamGinzburg pushed a commit to pytorch/pytorch that referenced this pull request Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants