Add static quantization as an example for calibration flow #487

jerryzh168 · 2024-07-08T23:19:33Z

Summary:
So far quantization flow API that we provided (quantize_) does not require calibration (calibrate a model with sample data), this PR added a static quantization example that serves as an example for calibration flow

1. first prepare the model for calibration
1. calibrate the prepared model with sample data
1. convert the calibrated model to quantized model

Test Plan:
python torchao/prototype/calibration_flow/static_quant.py

Reviewers:

Subscribers:

Tasks:

Tags:

pytorch-bot · 2024-07-08T23:19:36Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/487

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 780c1f9 with merge base aef7e09 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jerryzh168 · 2024-07-08T23:23:55Z

cc @drisspg @vkuzo this flow can be used for smoothquant as well, probably also float8

also QAT for static quant, cc @andrewor14

torchao/prototype/calibration_flow/static_quant.py

vkuzo · 2024-07-09T02:27:57Z

torchao/prototype/calibration_flow/static_quant.py

+    replacement_fn = lambda m: QuantizedLinear.from_calibrating(m)
+    _replace_with_custom_fn_if_matches_filter(model, replacement_fn, _is_calibrating_linear)
+
+act_obs = MinMaxObserver(dtype=torch.uint8, qscheme=torch.per_tensor_affine).to("cuda")


i'd vote for rewriting this stuff without having to use old concepts like qscheme, instead of trying to reuse the code

agree, this is temporary, the end goal is to implement a generic observer for blockwise quantization

What's the blocker for getting rid of the torch.ao dependency now? Are there toy observers we can include for now?

I think we want to define a general observer that works for the new quant primitives in the end so that we can replace this.

we can define a toy observer I think. although I'm not sure why we want to get rid of torch.ao dep since we already depend on pytorch.

It's more that torchao is supposed to be a full replacement. Id make an exception if some piece of code is very large and complex but that doesn't seem to be the case here

jerryzh168 · 2024-07-10T18:27:46Z

@vkuzo @msaroufim I added a comment to update observers later, please take a look again

torchao/dtypes/affine_quantized_tensor.py

msaroufim · 2024-07-15T20:28:17Z

torchao/dtypes/affine_quantized_tensor.py

+        quant_min: Optional[int] = None,
+        quant_max: Optional[int]  = None,
+        zero_point_domain: ZeroPointDomain = ZeroPointDomain.INT,
+        extended_layout: str = "plain",


this seems like it's using the old API? So I'm guessing you're landing this PR first?

yeah, we can land the other PR first and then update this

msaroufim · 2024-07-15T20:29:53Z

torchao/dtypes/affine_quantized_tensor.py

+        original_shape = input_float.shape
+        if extended_layout == "tensor_core_tiled":
+            orig_out_features, orig_in_features = input_float.shape
+            in_features = find_multiple(orig_in_features, 1024)


Re the 1024 and 8 padding heuristics as well this is what the NVIDIA docs say https://x.com/marksaroufim/status/1621580671776092160

So this heuristic tends to be dtype and device dependent - it's possible 1024 and 8 are fine but that would be right mostly out of luck

this should only be used by the tinygemm use case I think, so we should probably add some verifications

torchao/prototype/calibration_flow/static_quant.py

msaroufim · 2024-07-15T20:33:59Z

torchao/prototype/calibration_flow/static_quant.py

+    replacement_fn = lambda m: QuantizedLinear.from_calibrating(m)
+    _replace_with_custom_fn_if_matches_filter(model, replacement_fn, _is_calibrating_linear)
+
+act_obs = MinMaxObserver(dtype=torch.uint8, qscheme=torch.per_tensor_affine).to("cuda")


What's the blocker for getting rid of the torch.ao dependency now? Are there toy observers we can include for now?

msaroufim · 2024-07-15T20:36:02Z

torchao/prototype/calibration_flow/static_quant.py

+    m(*example_inputs)
+
+after_obs = m(*example_inputs)
+to_quantized_(m)


not sure I love the to_calibrating and to_quantized names or why not calibrate() and quantize()

calibrate is the name for the calibration process instead of the model transformation step I think, quantize is already used by the weight only and dynamic quant flow..

msaroufim

Approving assuming we move the static_quant.py file to either docs or tutorials/ and do a fast follow to remove the torch.ao dependency

As far as renaming

to_calibrating -> insert_observers()
to_quantized -> quantize_()

jerryzh168 · 2024-07-17T01:25:13Z

torchao/quantization/quant_api.py

@@ -259,12 +259,12 @@ def insert_subclass(lin):

    return insert_subclass

-def quantize_(model: torch.nn.Module, apply_tensor_subclass: Callable[[torch.Tensor], torch.Tensor], filter_fn: Optional[Callable[[torch.nn.Module, str], bool]]=None, set_inductor_config: bool=True):
+def quantize_(model: torch.nn.Module, apply_tensor_subclass: Callable[[torch.nn.Module], torch.nn.Module], filter_fn: Optional[Callable[[torch.nn.Module, str], bool]]=None, set_inductor_config: bool=True):


@msaroufim I made some changes to apply_tensor_subclass to accommodate static quant use cases, please take a look again

Summary: So far quantization flow API that we provided (`quantize_`) does not require calibration (calibrate a model with sample data), this PR added a static quantization example that serves as an example for calibration flow * 1. first prepare the model for calibration * 2. calibrate the prepared model with sample data * 3. convert the calibrated model to quantized model Test Plan: python torchao/prototype/calibration_flow/static_quant.py Reviewers: Subscribers: Tasks: Tags:

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 8, 2024

jerryzh168 requested review from andrewor14, msaroufim, jcaip and HDCharles July 8, 2024 23:23

jerryzh168 force-pushed the static branch from 478d14a to 5eddaa3 Compare July 8, 2024 23:25

jerryzh168 requested review from vkuzo and drisspg July 8, 2024 23:26

vkuzo reviewed Jul 9, 2024

View reviewed changes

torchao/prototype/calibration_flow/static_quant.py Outdated Show resolved Hide resolved

vkuzo reviewed Jul 9, 2024

View reviewed changes

jerryzh168 force-pushed the static branch from 5eddaa3 to e4f3e74 Compare July 10, 2024 18:13

jerryzh168 requested a review from vkuzo July 10, 2024 18:27

drisspg mentioned this pull request Jul 10, 2024

[RFC] Float8 Inference pytorch-labs/float8_experimental#314

Closed

msaroufim requested changes Jul 15, 2024

View reviewed changes

msaroufim approved these changes Jul 16, 2024

View reviewed changes

jerryzh168 force-pushed the static branch from e4f3e74 to b63fae5 Compare July 17, 2024 01:22

jerryzh168 commented Jul 17, 2024

View reviewed changes

jerryzh168 force-pushed the static branch from b63fae5 to 780c1f9 Compare July 17, 2024 02:03

jerryzh168 merged commit 6dd82d8 into pytorch:main Jul 17, 2024
13 checks passed

jerryzh168 deleted the static branch July 17, 2024 20:02

jerryzh168 mentioned this pull request Jul 19, 2024

Refactor smoothquant implementation to use tensor subclasses #528

Open

drisspg mentioned this pull request Jul 30, 2024

[RFC]: Float8 Inference #574

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add static quantization as an example for calibration flow #487

Add static quantization as an example for calibration flow #487

jerryzh168 commented Jul 8, 2024

pytorch-bot bot commented Jul 8, 2024 •

edited

Loading

jerryzh168 commented Jul 8, 2024 •

edited

Loading

vkuzo Jul 9, 2024

jerryzh168 Jul 9, 2024 •

edited

Loading

msaroufim Jul 15, 2024

jerryzh168 Jul 15, 2024

msaroufim Jul 16, 2024

jerryzh168 commented Jul 10, 2024

msaroufim Jul 15, 2024

jerryzh168 Jul 15, 2024

msaroufim Jul 15, 2024

jerryzh168 Jul 15, 2024

msaroufim Jul 15, 2024

msaroufim Jul 15, 2024

jerryzh168 Jul 16, 2024

msaroufim left a comment

jerryzh168 Jul 17, 2024

Add static quantization as an example for calibration flow #487

Add static quantization as an example for calibration flow #487

Conversation

jerryzh168 commented Jul 8, 2024

pytorch-bot bot commented Jul 8, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/487

✅ No Failures

jerryzh168 commented Jul 8, 2024 • edited Loading

Choose a reason for hiding this comment

jerryzh168 Jul 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerryzh168 commented Jul 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msaroufim left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pytorch-bot bot commented Jul 8, 2024 •

edited

Loading

jerryzh168 commented Jul 8, 2024 •

edited

Loading

jerryzh168 Jul 9, 2024 •

edited

Loading