-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose hqq through uintx_weight_only
API
#786
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/786
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit b837fd0 with merge base e05635e (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
# you can enable [hqq](https://github.com/mobiusml/hqq/tree/master) quantization which is expected to improves accuracy through | ||
# use_hqq flag for `int4_weight_only` quantization | ||
use_hqq = False | ||
quantize_(model, int4_weight_only(group_size=group_size, use_hqq=use_hqq)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this different from the way we enable auto-round? Which is its own function like apply_auto_round
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this depends on whether we want to just expose int4 weight only quant or all bitwidths. this PR just enables hqq for int4 so it's more convenient to just add this to existing int4_weight_only quant. but if we want to support all bitwidth, then we should follow what auto_round is doing.
cc @mobicham please let me know which one makes more sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can keep that flag in int4_weight_only
and have some call like this for the more general intx case ?
to_hqq_quantized(input_float, nbits: int, group_size: int):
return to_affine_quantized_intx(
input_float=input_float,
mapping_type=MappingType.ASYMMETRIC,
block_size=(1, group_size),
target_dtype=torch.bfloat16,
quant_min=0,
quant_max=2**nbits - 1,
zero_point_domain=ZeroPointDomain.FLOAT,
preserve_zero=False,
layout_type=TensorCoreTiledLayoutType(inner_k_tiles=8) if nbits in [4] else PlainLayoutType(),
use_hqq=True,
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My sense is we should separate out the implementation details from the algorithm name. Internally HQQ can be implemented by calling int4_weight_only but no reason to leak this detail to end users
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mobicham sure, that would align with what auto_round is doing now I think
@msaroufim you are also suggesting to have a separate hqq_weight_only(dtype, group_size, layout_type)
method right?
Thank you @jerryzh168 ! By the way, the test was failing for 3-bit/7-bit on the 4090 specifically, so I also updated the |
# you can enable [hqq](https://github.com/mobiusml/hqq/tree/master) quantization which is expected to improves accuracy through | ||
# use_hqq flag for `int4_weight_only` quantization | ||
use_hqq = False | ||
quantize_(model, int4_weight_only(group_size=group_size, use_hqq=use_hqq)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this is user facing, should the current bool use_hqq
instead be an enum, something like SingleWeightQuantizationAlgorithm.HQQ
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this is not ideal, I'm planning to just have a separate hqq config and remove the flag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vkuzo I'm integrating hqq into uintx_weight_only
API now, and I'm keeping the boolean flag for now to keep it simpler, we can make this an enum if there are more algorithms in the future I think, please let me know if that sounds OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you're ok with potentially changing it later, sgtm
f60e8c0
to
fa78b5b
Compare
int4_weight_only
APIhqq_uintx_weight_only
API
it feels really weird to me to have both hqq_uintx_weight_only as an api and int4_weight_only as both user facing api's. as is we have a strict heirarchy where we pick the quantization technique and bitwidth first i.e. int8_weight_only, int4_weight_only, int8_dynamic....etc -> then the user adds configuration like groupsize and layout or whatever. it feels extremely odd to then have a secondary api where you swap the order and first think about the quantization algorithm and only thereafter think about the quantization type/bitwidth. It makes it hard for a user to navigate because if he wants to do int4 with hqq, does he look for an hqq configuration in the int4 api or an int4 configuration in the hqq api. I think we should have one or the other, not both. My suggested design would be instead of having hqq be an intrinsic part of the api function call, we would have it as an optional configuration in uintx_weight_only and doing it through there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comment
hqq_uintx_weight_only
APIuintx_weight_only
API
as discussed offline, main thing is that the API today is not corresponding to a dtype, e.g. int4_weight_only quant is actually the specific type of int4_weight_only quant that works with "tinygemm" kernel. I think to make the API easier to understand, it would be better for each of the apis to correspond to a kernel or a group of kernels, maybe we can document this better. as for the hqq APIs, yeah I can merge into uintx_weight_only, since hqq is reusing existing kernels currently. if we have kernels that works with hqq "raw_output" arg in the future, then it might make sense to have a separate config for it. |
43ab845
to
ded74c8
Compare
considering auto-round, it seems that user will have to think about algorithm, then dtypes and other configs...but hqq is simple enough to be added to existing quant methods though |
Summary: att, this is a follow up for pytorch#605 to make hqq available in quantize_ API `quantize_(model, int4_weight_only(group_size, use_hqq=True)` Test Plan: python generate.py --compile --quantization int4wo-hqq-64 --precision bfloat16 Average tokens/sec: 195.24 Average Bandwidth: 729.40 GB/s Peak Memory Usage: 5.09 GB Model Size: 3.74 GB python eval.py --compile --quantization int4wo-hqq-64 --precision bfloat16 wikitext: {'word_perplexity,none': 12.823631773497512, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.611400903914048, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6883154699192412, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'} Reviewers: Subscribers: Tasks: Tags:
ded74c8
to
b837fd0
Compare
Expose hqq through `int4_weight_only` API Summary: att, this is a follow up for #605 to make hqq available in quantize_ API `quantize_(model, int4_weight_only(group_size, use_hqq=True)` Test Plan: python generate.py --compile --quantization int4wo-hqq-64 --precision bfloat16 Average tokens/sec: 195.24 Average Bandwidth: 729.40 GB/s Peak Memory Usage: 5.09 GB Model Size: 3.74 GB python eval.py --compile --quantization int4wo-hqq-64 --precision bfloat16 wikitext: {'word_perplexity,none': 12.823631773497512, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.611400903914048, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6883154699192412, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'} Reviewers: Subscribers: Tasks: Tags:
Expose hqq through `int4_weight_only` API Summary: att, this is a follow up for #605 to make hqq available in quantize_ API `quantize_(model, int4_weight_only(group_size, use_hqq=True)` Test Plan: python generate.py --compile --quantization int4wo-hqq-64 --precision bfloat16 Average tokens/sec: 195.24 Average Bandwidth: 729.40 GB/s Peak Memory Usage: 5.09 GB Model Size: 3.74 GB python eval.py --compile --quantization int4wo-hqq-64 --precision bfloat16 wikitext: {'word_perplexity,none': 12.823631773497512, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.611400903914048, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6883154699192412, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'} Reviewers: Subscribers: Tasks: Tags:
Summary:
att, this is a follow up for #605 to make hqq available in quantize_ API
quantize_(model, uintx_weight_only(dtype, group_size, use_hqq=True)
which will use TensorCoreTiledLayoutType for uint4, and PlainLayoutType for others
Test Plan:
python generate.py --compile --quantization uintx-4-64-hqq --precision bfloat16
Average tokens/sec: 45.13
Average Bandwidth: 316.81 GB/s
Peak Memory Usage: 9.43 GB
Model Size: 7.02 GB
python eval.py --compile --quantization uintx-4-64-hqq --precision bfloat16
wikitext: {'word_perplexity,none': 12.774482203447983, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.6102441441484696, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6872794453888409, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}
Reviewers:
Subscribers:
Tasks:
Tags: