-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for AQTLayout
, PlainAQTLayout
and TensorCoreTiledAQTLayout
#278
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/278
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 4758982 with merge base 62745fc (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
4d232b8
to
5dd3200
Compare
AQTStorage
and PlainAQTStorage
AQTStorage
, PlainAQTStorage
and TiledAQTStorage
AQTStorage
, PlainAQTStorage
and TiledAQTStorage
AQTStorage
, PlainAQTStorage
and TensorCoreTiledAQTStorage
Nice! Nit: Let's please call it Layout :) |
Summary: Today `AffineQuantizedTensor` has hardcoded storage format of `int_data`, `scale`, `zero_point`. But this does not work if we want to support packed weight. In this PR, we added support to hide the storage details for `AffineQuantizedTensor` in a family of tensor subclasses, all should inherit from the base Storage type: `AQTStorage` (affine quantized tensor storage) This PR just added support for a plain storage tensor (`PlainAQTStorage`) that stores `int_data`, `scale` and `zero_point` tensors directly, in the next PR we'll also support storing packed weight (result of `torch.ops.aten._convert_weight_to_int4pack`) in a different type of `AQTStorage`. `AffineQuantizedTensor` will have the following: - storage_tensor: AQTStorage (can store data of different storage formats) - storage_layout: str (a string represents the type of storage_tensor we have, can be used in dispatch) Test Plan: python test/quantization/test_quant_api.py Reviewers: Subscribers: Tasks: Tags:
AQTStorage
, PlainAQTStorage
and TensorCoreTiledAQTStorage
AQTLayout
, PlainAQTLayout
and TensorCoreTiledAQTLayout
@@ -496,7 +496,7 @@ def test_quantized_tensor_subclass_int8_dyn_quant(self): | |||
m = ToyLinearModel(1024, 1024, 1024).eval().to(torch.bfloat16).to("cuda") | |||
m_copy = copy.deepcopy(m) | |||
# setting batch_size to 20 to be compatible with the kernel | |||
example_inputs = tuple(map(lambda x: x.to(torch.bfloat16).to("cuda"), m.example_inputs(batch_size=20))) | |||
example_inputs = m.example_inputs(batch_size=20, dtype=torch.bfloat16, device="cuda") | |||
m = quantize(m, get_apply_int8dyn_quant()) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any tests explicitly testing TensorCoreTiledAQTLayout
self.int_data = int_data | ||
self.scale = scale | ||
self.zero_point = zero_point | ||
self.layout_tensor = layout_tensor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the benefit of group the above 3 configs into the same object but I don't think I'm quite clear on why this needs to be a subclass? Is it because we need to override the behavior of detach to do bit unpacking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With a non-standard layout, I don't doubt the need for a subclass (since you need to override the behavior of operations to know how to deal with the layout), but I doubt the need for a wrapper subclass, just do a basic storage and have all the smarts in the tensor. That's how non-contiguous layouts work on dense tensor!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang IIUC you mean we just put the layout related logic to tensor subclass itself and create different tensor subclasses for different layout?
the problem here is that we still want the same tensor (AffineQuantizedTensor) but can be packed into different formats (different storage)
btw, this is just a use case I want to make sure you are aware, not a feature request since we can workaround with tensor subclass, even though not that "naturally"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the tensor is repacked multiple ways, what kind of lifetime do you have for the packing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh sorry, I meant we still use the same tensor subclass, not the same tensor instance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang - the analogy here is changing from strided to COO to BSR to strided etc.
Essentially we want to introduce new layouts for this Tensor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will also be useful for https://github.com/pytorch/ao/blob/4c1d568263582db3e48299d2eba78d8ad1f6274d/torchao/dtypes/uint4.py to implement different bitpacking formats (different layouts).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many ways to skin this cat. But the way I would think about it is that if you want a bunch of different behaviors to live in a single class (no extra subclass hierarchy), well, you can't use the class dispatch mechanism to dispatch them. So you need your own dispatch mechanism. Luckily, in Python, that's pretty easy to build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, agreed. Coming up with a good way to skin this cat could help us make it a generic recipe for PyTorch users. @jerryzh168 and I have been chatting a bit about it.
@@ -348,7 +348,7 @@ def apply_int4wo_quant(weight): | |||
preserve_zero = False | |||
zero_point_dtype = torch.bfloat16 | |||
zero_point_domain = ZeroPointDomain.FLOAT | |||
return to_aq(weight, mapping_type, block_size, target_dtype, quant_min, quant_max, eps, zero_point_dtype=zero_point_dtype, preserve_zero=preserve_zero, zero_point_domain=zero_point_domain) | |||
return to_aq(weight, mapping_type, block_size, target_dtype, quant_min, quant_max, eps, zero_point_dtype=zero_point_dtype, preserve_zero=preserve_zero, zero_point_domain=zero_point_domain, extended_layout="tensor_core_tiled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@msaroufim "tensor_core_tiled" is tested here
…Layout` (pytorch#278) Summary: Today `AffineQuantizedTensor` has hardcoded storage format of `int_data`, `scale`, `zero_point`. But this does not work if we want to support packed weight. In this PR, we added support to hide the storage details for `AffineQuantizedTensor` in a family of tensor subclasses, all should inherit from the base Storage type: `AQTStorage` (affine quantized tensor storage) This PR just added support for a plain storage tensor (`PlainAQTStorage`) that stores `int_data`, `scale` and `zero_point` tensors directly, in the next PR we'll also support storing packed weight (result of `torch.ops.aten._convert_weight_to_int4pack`) in a different type of `AQTStorage`. `AffineQuantizedTensor` will have the following: - storage_tensor: AQTStorage (can store data of different storage formats) - storage_layout: str (a string represents the type of storage_tensor we have, can be used in dispatch) Test Plan: python test/quantization/test_quant_api.py Reviewers: Subscribers: Tasks: Tags:
Summary:
Today
AffineQuantizedTensor
has hardcoded storage format ofint_data
,scale
,zero_point
. But this does not work if we want to support packed weight. In this PR, we added support to hide the storage details forAffineQuantizedTensor
in a family of tensor subclasses, all should inherit from the base Layout storage type:AQTLayout
(affine quantized tensor storage). This is part of the project of refactoringAffineQuantizedTensor
to be a proper clean API for both developers (as a showcase of how people can add new dtypes/quantization techniques) and modeling users.This PR just added support for
PlainAQTLayout
) that storesint_data
,scale
andzero_point
tensors directlyTensorCoreTiledAQTLayout
) storing packed weight (result oftorch.ops.aten._convert_weight_to_int4pack
) and the combinedscale_and_zero
tensor, for now we only have numeric checks, we'll do performance checks on example models laterthis improves performance for current AffineQuantizedTensor because we are doing packing beforehand, instead of on the fly.
AffineQuantizedTensor
will have the following:Users can add a new type of layout tensor for a specific layout with:
Target Audience:
As mentioned before,
(1). this serves as an example for developers who are developing new quantization techniques/dtypes, and who also need packing / special storage layout for the weight, they can copy paste what we are doing here and apply to their own use cases
(2). for modeling users, packing weight before hand would give a performance boost to the model, but they'll need to specify the specific extended_layout they want to pack into, e.g.
tensor_core_tiled
Test Plan:
python test/quantization/test_quant_api.py
Reviewers:
Subscribers:
Tasks:
Tags: