Expose hqq through `uintx_weight_only` API #786

jerryzh168 · 2024-08-31T01:59:59Z

Summary:
att, this is a follow up for #605 to make hqq available in quantize_ API

quantize_(model, uintx_weight_only(dtype, group_size, use_hqq=True)

which will use TensorCoreTiledLayoutType for uint4, and PlainLayoutType for others

Test Plan:

python generate.py --compile --quantization uintx-4-64-hqq --precision bfloat16
Average tokens/sec: 45.13
Average Bandwidth: 316.81 GB/s
Peak Memory Usage: 9.43 GB
Model Size: 7.02 GB

python eval.py --compile --quantization uintx-4-64-hqq --precision bfloat16

wikitext: {'word_perplexity,none': 12.774482203447983, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.6102441441484696, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6872794453888409, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}

Reviewers:

Subscribers:

Tasks:

Tags:

pytorch-bot · 2024-08-31T02:00:02Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/786

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit b837fd0 with merge base e05635e ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

supriyar · 2024-08-31T04:34:21Z

torchao/quantization/README.md

+# you can enable [hqq](https://github.com/mobiusml/hqq/tree/master) quantization which is expected to improves accuracy through
+# use_hqq flag for `int4_weight_only` quantization
+use_hqq = False
+quantize_(model, int4_weight_only(group_size=group_size, use_hqq=use_hqq))


why is this different from the way we enable auto-round? Which is its own function like apply_auto_round?

I think this depends on whether we want to just expose int4 weight only quant or all bitwidths. this PR just enables hqq for int4 so it's more convenient to just add this to existing int4_weight_only quant. but if we want to support all bitwidth, then we should follow what auto_round is doing.

cc @mobicham please let me know which one makes more sense

Maybe we can keep that flag in int4_weight_only and have some call like this for the more general intx case ?

to_hqq_quantized(input_float, nbits: int, group_size: int): return to_affine_quantized_intx( input_float=input_float, mapping_type=MappingType.ASYMMETRIC, block_size=(1, group_size), target_dtype=torch.bfloat16, quant_min=0, quant_max=2**nbits - 1, zero_point_domain=ZeroPointDomain.FLOAT, preserve_zero=False, layout_type=TensorCoreTiledLayoutType(inner_k_tiles=8) if nbits in [4] else PlainLayoutType(), use_hqq=True, )

My sense is we should separate out the implementation details from the algorithm name. Internally HQQ can be implemented by calling int4_weight_only but no reason to leak this detail to end users

@mobicham sure, that would align with what auto_round is doing now I think

@msaroufim you are also suggesting to have a separate hqq_weight_only(dtype, group_size, layout_type) method right?

mobicham · 2024-08-31T10:07:28Z

Thank you @jerryzh168 !
Can you please add it to the test file in test/hqq/test_hqq_affine.py, I made a full gist here: https://gist.github.com/mobicham/26f76a9cb06b59d775c97f57a53108c5 (feel free to change the names of the functions etc.)

By the way, the test was failing for 3-bit/7-bit on the 4090 specifically, so I also updated the ref_dot_product_error for those 2-cases. The test should fail when use_hqq=False since the error is higher.

vkuzo · 2024-09-03T23:13:06Z

torchao/quantization/README.md

+# you can enable [hqq](https://github.com/mobiusml/hqq/tree/master) quantization which is expected to improves accuracy through
+# use_hqq flag for `int4_weight_only` quantization
+use_hqq = False
+quantize_(model, int4_weight_only(group_size=group_size, use_hqq=use_hqq))


since this is user facing, should the current bool use_hqq instead be an enum, something like SingleWeightQuantizationAlgorithm.HQQ?

yeah this is not ideal, I'm planning to just have a separate hqq config and remove the flag

@vkuzo I'm integrating hqq into uintx_weight_only API now, and I'm keeping the boolean flag for now to keep it simpler, we can make this an enum if there are more algorithms in the future I think, please let me know if that sounds OK

if you're ok with potentially changing it later, sgtm

HDCharles · 2024-09-05T20:38:37Z

it feels really weird to me to have both hqq_uintx_weight_only as an api and int4_weight_only as both user facing api's.

as is we have a strict heirarchy where we pick the quantization technique and bitwidth first i.e.

int8_weight_only, int4_weight_only, int8_dynamic....etc -> then the user adds configuration like groupsize and layout or whatever.

it feels extremely odd to then have a secondary api where you swap the order and first think about the quantization algorithm and only thereafter think about the quantization type/bitwidth. It makes it hard for a user to navigate because if he wants to do int4 with hqq, does he look for an hqq configuration in the int4 api or an int4 configuration in the hqq api.

I think we should have one or the other, not both.

My suggested design would be instead of having hqq be an intrinsic part of the api function call, we would have it as an optional configuration in uintx_weight_only and doing it through there.

HDCharles

see comment

jerryzh168 · 2024-09-06T00:44:38Z

use_hqq = False

as discussed offline, main thing is that the API today is not corresponding to a dtype, e.g. int4_weight_only quant is actually the specific type of int4_weight_only quant that works with "tinygemm" kernel. I think to make the API easier to understand, it would be better for each of the apis to correspond to a kernel or a group of kernels, maybe we can document this better.

as for the hqq APIs, yeah I can merge into uintx_weight_only, since hqq is reusing existing kernels currently. if we have kernels that works with hqq "raw_output" arg in the future, then it might make sense to have a separate config for it.

jerryzh168 · 2024-09-06T16:49:49Z

considering auto-round, it seems that user will have to think about algorithm, then dtypes and other configs...but hqq is simple enough to be added to existing quant methods though

Summary: att, this is a follow up for pytorch#605 to make hqq available in quantize_ API `quantize_(model, int4_weight_only(group_size, use_hqq=True)` Test Plan: python generate.py --compile --quantization int4wo-hqq-64 --precision bfloat16 Average tokens/sec: 195.24 Average Bandwidth: 729.40 GB/s Peak Memory Usage: 5.09 GB Model Size: 3.74 GB python eval.py --compile --quantization int4wo-hqq-64 --precision bfloat16 wikitext: {'word_perplexity,none': 12.823631773497512, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.611400903914048, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6883154699192412, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'} Reviewers: Subscribers: Tasks: Tags:

Expose hqq through `int4_weight_only` API Summary: att, this is a follow up for #605 to make hqq available in quantize_ API `quantize_(model, int4_weight_only(group_size, use_hqq=True)` Test Plan: python generate.py --compile --quantization int4wo-hqq-64 --precision bfloat16 Average tokens/sec: 195.24 Average Bandwidth: 729.40 GB/s Peak Memory Usage: 5.09 GB Model Size: 3.74 GB python eval.py --compile --quantization int4wo-hqq-64 --precision bfloat16 wikitext: {'word_perplexity,none': 12.823631773497512, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.611400903914048, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6883154699192412, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'} Reviewers: Subscribers: Tasks: Tags:

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 31, 2024

jerryzh168 requested review from HDCharles, msaroufim and mobicham August 31, 2024 02:00

supriyar reviewed Aug 31, 2024

View reviewed changes

jerryzh168 mentioned this pull request Sep 2, 2024

Add HQQ support #605

Merged

mobicham approved these changes Sep 2, 2024

View reviewed changes

vkuzo reviewed Sep 3, 2024

View reviewed changes

jerryzh168 force-pushed the expose_hqq branch 2 times, most recently from f60e8c0 to fa78b5b Compare September 5, 2024 00:20

jerryzh168 changed the title ~~Expose hqq through int4_weight_only API~~ Expose hqq through hqq_uintx_weight_only API Sep 5, 2024

HDCharles requested changes Sep 5, 2024

View reviewed changes

jerryzh168 changed the title ~~Expose hqq through hqq_uintx_weight_only API~~ Expose hqq through uintx_weight_only API Sep 5, 2024

jerryzh168 force-pushed the expose_hqq branch 3 times, most recently from 43ab845 to ded74c8 Compare September 6, 2024 02:25

HDCharles approved these changes Sep 6, 2024

View reviewed changes

jerryzh168 force-pushed the expose_hqq branch from ded74c8 to b837fd0 Compare September 6, 2024 16:51

jerryzh168 merged commit 0601b5c into pytorch:main Sep 6, 2024
17 checks passed

jerryzh168 deleted the expose_hqq branch September 6, 2024 18:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose hqq through `uintx_weight_only` API #786

Expose hqq through `uintx_weight_only` API #786

jerryzh168 commented Aug 31, 2024 •

edited

Loading

pytorch-bot bot commented Aug 31, 2024 •

edited

Loading

supriyar Aug 31, 2024

jerryzh168 Sep 2, 2024

mobicham Sep 2, 2024 •

edited

Loading

msaroufim Sep 2, 2024

jerryzh168 Sep 3, 2024

mobicham commented Aug 31, 2024

vkuzo Sep 3, 2024

jerryzh168 Sep 3, 2024

jerryzh168 Sep 6, 2024

vkuzo Sep 6, 2024

HDCharles commented Sep 5, 2024 •

edited

Loading

HDCharles left a comment

jerryzh168 commented Sep 6, 2024

jerryzh168 commented Sep 6, 2024

Expose hqq through uintx_weight_only API #786

Expose hqq through uintx_weight_only API #786

Conversation

jerryzh168 commented Aug 31, 2024 • edited Loading

pytorch-bot bot commented Aug 31, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/786

✅ No Failures

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mobicham Sep 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mobicham commented Aug 31, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HDCharles commented Sep 5, 2024 • edited Loading

HDCharles left a comment

Choose a reason for hiding this comment

jerryzh168 commented Sep 6, 2024

jerryzh168 commented Sep 6, 2024

Expose hqq through `uintx_weight_only` API #786

Expose hqq through `uintx_weight_only` API #786

jerryzh168 commented Aug 31, 2024 •

edited

Loading

pytorch-bot bot commented Aug 31, 2024 •

edited

Loading

mobicham Sep 2, 2024 •

edited

Loading

HDCharles commented Sep 5, 2024 •

edited

Loading