Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose hqq through uintx_weight_only API #786

Merged
merged 1 commit into from
Sep 6, 2024

Conversation

jerryzh168
Copy link
Contributor

@jerryzh168 jerryzh168 commented Aug 31, 2024

Summary:
att, this is a follow up for #605 to make hqq available in quantize_ API

quantize_(model, uintx_weight_only(dtype, group_size, use_hqq=True)

which will use TensorCoreTiledLayoutType for uint4, and PlainLayoutType for others

Test Plan:

python generate.py --compile --quantization uintx-4-64-hqq --precision bfloat16
Average tokens/sec: 45.13
Average Bandwidth: 316.81 GB/s
Peak Memory Usage: 9.43 GB
Model Size: 7.02 GB

python eval.py --compile --quantization uintx-4-64-hqq --precision bfloat16

wikitext: {'word_perplexity,none': 12.774482203447983, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.6102441441484696, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6872794453888409, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}

Reviewers:

Subscribers:

Tasks:

Tags:

Copy link

pytorch-bot bot commented Aug 31, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/786

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit b837fd0 with merge base e05635e (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 31, 2024
# you can enable [hqq](https://github.com/mobiusml/hqq/tree/master) quantization which is expected to improves accuracy through
# use_hqq flag for `int4_weight_only` quantization
use_hqq = False
quantize_(model, int4_weight_only(group_size=group_size, use_hqq=use_hqq))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this different from the way we enable auto-round? Which is its own function like apply_auto_round?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this depends on whether we want to just expose int4 weight only quant or all bitwidths. this PR just enables hqq for int4 so it's more convenient to just add this to existing int4_weight_only quant. but if we want to support all bitwidth, then we should follow what auto_round is doing.

cc @mobicham please let me know which one makes more sense

Copy link
Collaborator

@mobicham mobicham Sep 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can keep that flag in int4_weight_only and have some call like this for the more general intx case ?

to_hqq_quantized(input_float, nbits: int, group_size: int):
    return to_affine_quantized_intx(
                    input_float=input_float,
                    mapping_type=MappingType.ASYMMETRIC,
                    block_size=(1, group_size),
                    target_dtype=torch.bfloat16,
                    quant_min=0,
                    quant_max=2**nbits - 1,
                    zero_point_domain=ZeroPointDomain.FLOAT,
                    preserve_zero=False,
                    layout_type=TensorCoreTiledLayoutType(inner_k_tiles=8) if nbits in [4] else PlainLayoutType(),
                    use_hqq=True,
                    )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My sense is we should separate out the implementation details from the algorithm name. Internally HQQ can be implemented by calling int4_weight_only but no reason to leak this detail to end users

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mobicham sure, that would align with what auto_round is doing now I think

@msaroufim you are also suggesting to have a separate hqq_weight_only(dtype, group_size, layout_type) method right?

@mobicham
Copy link
Collaborator

Thank you @jerryzh168 !
Can you please add it to the test file in test/hqq/test_hqq_affine.py, I made a full gist here: https://gist.github.com/mobicham/26f76a9cb06b59d775c97f57a53108c5 (feel free to change the names of the functions etc.)

By the way, the test was failing for 3-bit/7-bit on the 4090 specifically, so I also updated the ref_dot_product_error for those 2-cases. The test should fail when use_hqq=False since the error is higher.

@jerryzh168 jerryzh168 mentioned this pull request Sep 2, 2024
# you can enable [hqq](https://github.com/mobiusml/hqq/tree/master) quantization which is expected to improves accuracy through
# use_hqq flag for `int4_weight_only` quantization
use_hqq = False
quantize_(model, int4_weight_only(group_size=group_size, use_hqq=use_hqq))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is user facing, should the current bool use_hqq instead be an enum, something like SingleWeightQuantizationAlgorithm.HQQ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this is not ideal, I'm planning to just have a separate hqq config and remove the flag

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vkuzo I'm integrating hqq into uintx_weight_only API now, and I'm keeping the boolean flag for now to keep it simpler, we can make this an enum if there are more algorithms in the future I think, please let me know if that sounds OK

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you're ok with potentially changing it later, sgtm

@jerryzh168 jerryzh168 force-pushed the expose_hqq branch 2 times, most recently from f60e8c0 to fa78b5b Compare September 5, 2024 00:20
@jerryzh168 jerryzh168 changed the title Expose hqq through int4_weight_only API Expose hqq through hqq_uintx_weight_only API Sep 5, 2024
@HDCharles
Copy link
Contributor

HDCharles commented Sep 5, 2024

it feels really weird to me to have both hqq_uintx_weight_only as an api and int4_weight_only as both user facing api's.

as is we have a strict heirarchy where we pick the quantization technique and bitwidth first i.e.

int8_weight_only, int4_weight_only, int8_dynamic....etc -> then the user adds configuration like groupsize and layout or whatever.

it feels extremely odd to then have a secondary api where you swap the order and first think about the quantization algorithm and only thereafter think about the quantization type/bitwidth. It makes it hard for a user to navigate because if he wants to do int4 with hqq, does he look for an hqq configuration in the int4 api or an int4 configuration in the hqq api.

I think we should have one or the other, not both.

My suggested design would be instead of having hqq be an intrinsic part of the api function call, we would have it as an optional configuration in uintx_weight_only and doing it through there.

Copy link
Contributor

@HDCharles HDCharles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment

@jerryzh168 jerryzh168 changed the title Expose hqq through hqq_uintx_weight_only API Expose hqq through uintx_weight_only API Sep 5, 2024
@jerryzh168
Copy link
Contributor Author

use_hqq = False

as discussed offline, main thing is that the API today is not corresponding to a dtype, e.g. int4_weight_only quant is actually the specific type of int4_weight_only quant that works with "tinygemm" kernel. I think to make the API easier to understand, it would be better for each of the apis to correspond to a kernel or a group of kernels, maybe we can document this better.

as for the hqq APIs, yeah I can merge into uintx_weight_only, since hqq is reusing existing kernels currently. if we have kernels that works with hqq "raw_output" arg in the future, then it might make sense to have a separate config for it.

@jerryzh168 jerryzh168 force-pushed the expose_hqq branch 3 times, most recently from 43ab845 to ded74c8 Compare September 6, 2024 02:25
@jerryzh168
Copy link
Contributor Author

considering auto-round, it seems that user will have to think about algorithm, then dtypes and other configs...but hqq is simple enough to be added to existing quant methods though

Summary:
att, this is a follow up for pytorch#605 to make hqq available in quantize_ API

`quantize_(model, int4_weight_only(group_size, use_hqq=True)`

Test Plan:

python generate.py --compile --quantization int4wo-hqq-64 --precision bfloat16
Average tokens/sec: 195.24
Average Bandwidth: 729.40 GB/s
Peak Memory Usage: 5.09 GB
Model Size: 3.74 GB

python eval.py --compile --quantization int4wo-hqq-64 --precision bfloat16

wikitext: {'word_perplexity,none': 12.823631773497512, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.611400903914048, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6883154699192412, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}

Reviewers:

Subscribers:

Tasks:

Tags:
@jerryzh168 jerryzh168 merged commit 0601b5c into pytorch:main Sep 6, 2024
17 checks passed
@jerryzh168 jerryzh168 deleted the expose_hqq branch September 6, 2024 18:04
andrewor14 pushed a commit that referenced this pull request Sep 6, 2024
Expose hqq through `int4_weight_only` API

Summary:
att, this is a follow up for #605 to make hqq available in quantize_ API

`quantize_(model, int4_weight_only(group_size, use_hqq=True)`

Test Plan:

python generate.py --compile --quantization int4wo-hqq-64 --precision bfloat16
Average tokens/sec: 195.24
Average Bandwidth: 729.40 GB/s
Peak Memory Usage: 5.09 GB
Model Size: 3.74 GB

python eval.py --compile --quantization int4wo-hqq-64 --precision bfloat16

wikitext: {'word_perplexity,none': 12.823631773497512, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.611400903914048, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6883154699192412, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}

Reviewers:

Subscribers:

Tasks:

Tags:
jainapurva pushed a commit that referenced this pull request Sep 9, 2024
Expose hqq through `int4_weight_only` API

Summary:
att, this is a follow up for #605 to make hqq available in quantize_ API

`quantize_(model, int4_weight_only(group_size, use_hqq=True)`

Test Plan:

python generate.py --compile --quantization int4wo-hqq-64 --precision bfloat16
Average tokens/sec: 195.24
Average Bandwidth: 729.40 GB/s
Peak Memory Usage: 5.09 GB
Model Size: 3.74 GB

python eval.py --compile --quantization int4wo-hqq-64 --precision bfloat16

wikitext: {'word_perplexity,none': 12.823631773497512, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.611400903914048, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6883154699192412, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}

Reviewers:

Subscribers:

Tasks:

Tags:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants