Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feedback on quantize() API #384

Open
gau-nernst opened this issue Jun 16, 2024 · 13 comments
Open

Feedback on quantize() API #384

gau-nernst opened this issue Jun 16, 2024 · 13 comments
Assignees
Labels
enhancement New feature or request

Comments

@gau-nernst
Copy link
Collaborator

Previously we do this

from torchao.quantization.quant_api import change_linear_weights_to_int8_woqtensors

model = torch.compile(model, mode="max-autotune", fullgraph=True)
change_linear_weights_to_int8_woqtensors(model)
# compile after quant also works

With the new quantization API, we have to do this

from torchao.quantization.quant_api import quantize, int8wo, unwrap_tensor_subclass

model = quantize(model, int8wo())  # or "int8_weight_only"
model = unwrap_tensor_subclass(model)
model = torch.compile(model, mode='max-autotune', fullgraph=True)  # must compile after unwrap

I think the new API is less user-friendly than the previous one.

  1. int8wo(), int4wo() is a bit unintuitive. I understand it is a mechanism to pass params like group size to the quantization. Alternatives: full-blown class with __call__() method e.g. Int8WeightOnlyConfig (kinda verbose, but intention is clear); just pass quant params as extra args/kwargs e.g. quantize("int4wo", groupsize=128)
  2. It's not clear what unwrap_tensor_subclass() does. Also, why do we need it now to compile the model, but not previously?

@jerryzh168

@msaroufim msaroufim added the enhancement New feature or request label Jun 16, 2024
@msaroufim
Copy link
Member

msaroufim commented Jun 16, 2024

So my understanding of unwrap_tensor_subclass() is this is primarily there to deal with some limitation of torch.export() but perhaps @tugsbayasgalan can shed some more light but if that's the case then we should ONLY recommend people do that for export() since indeed the API is strange cause it introduces concepts like unwrapping and subclasses which are implementation details so I see 2 options here

  1. Either remove the call to unwrap() by default and only recommend people do it for export() and link to an issue in export() as to why this is needed so people can follow progress
  2. Call unwrap automatically as part of quantize() function so end users aren't aware of it

Regarding the int8wo() only point the problem is not all quantization algorithms will share the same algorithms and kwargs are difficult for users to figure out what's actually supported granted we did explore some other ideas like

  1. Don't have a top level user API for quantization
  2. Don't try to shorten the names so for example jus say Int8WeightOnlyConfig, I personally strongly dislike our abbreviations like wo and qat since they're familiar to people that work with quantization all the time but no one else

On your point around compilation, it is indeed unclear when a user should vs must compile and we need to communicate the benefits and the necessity of compilation might drive users back to a module swap api

@gau-nernst
Copy link
Collaborator Author

Using the new quantize() API, unwrap_tensor_subclass() is a MUST. Without it, I'm getting this error (running the snippet above)

  File "/home/---/code/ao/torchao/dtypes/aqt.py", line 240, in __torch_function__
    return _ATEN_OP_OR_TORCH_FN_TABLE[cls][func](*args, **kwargs)
  File "/home/---/code/ao/torchao/dtypes/utils.py", line 25, in wrapper
    return func(*args, **kwargs)
  File "/home/---/code/ao/torchao/dtypes/aqt.py", line 685, in functional_linear
    weight_tensor = weight_tensor.dequantize()
  File "/home/---/code/ao/torchao/dtypes/aqt.py", line 160, in dequantize
    int_data, scale, zero_point = self.layout_tensor.get_plain()
torch._dynamo.exc.TorchRuntimeError: Failed running call_method forward(*(Linear(in_features=4096, out_features=4096, bias=False), FakeTensor(..., device='cuda:0', size=(1, 4096), dtype=torch.float16)), **{}):
'FakeTensor' object has no attribute 'get_plain'

from user code:
   File "/home/---/miniconda3/envs/dev_nightly/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 43, in inner
    return fn(*args, **kwargs)

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

@tugsbayasgalan
Copy link

@msaroufim I am working on the support for unwrapping/wrapping nested tensor subclasses in PT2. In general, we expect we should be able to preserve the tensor subclasses if users are targetting our training IR and they shouldn't have to rely on unwrap_tensor_subclass().

@yiliu30
Copy link
Contributor

yiliu30 commented Jun 17, 2024

Hi, I noticed that the GPTQ-related API was marked to be moved to prototype. Is there any alternative API to use, or are there any plans to support GPTQ formally?

@jerryzh168
Copy link
Contributor

jerryzh168 commented Jun 17, 2024

@gau-nernst thanks for the feedback.

  1. makes sense, Configs sounds reasonable to me, I'll gather a bit more feedback on this one. I think we don't want to pass around kwargs since we can't document them
  2. so the reason we need it for torch.compile now is because we have multiple levels of tensor subclass now, but previous implementation does not. this is a temporary workaround and I hope it get fixed soon, @tugsbayasgalan is working on this one

@jerryzh168
Copy link
Contributor

Hi, I noticed that the GPTQ-related API was marked to be moved to prototype. Is there any alternative API to use, or are there any plans to support GPTQ formally?

we are thinking of deprecating GPTQ when we make HQQ work. cc @HDCharles to confirm that hqq is better than GPTQ in general.

@jerryzh168
Copy link
Contributor

Hi, I noticed that the GPTQ-related API was marked to be moved to prototype. Is there any alternative API to use, or are there any plans to support GPTQ formally?

can you also describe your use case for GPTQ as well?

@HDCharles
Copy link
Contributor

HDCharles commented Jun 17, 2024

Hi, I noticed that the GPTQ-related API was marked to be moved to prototype. Is there any alternative API to use, or are there any plans to support GPTQ formally?

@yiliu30 to add on to what @jerryzh168 is saying, we haven't seen a lot of people interested in this API at the moment so its not something we've invested a ton of effort into, there are some limitations in the existing API/implementation that make it not work on some parts of some models unless they're carefully handled (https://github.com/pytorch/ao/blob/main/torchao/_models/llama/model.py#L89-L96) . We could fix those if we rewrote the whole thing, but until we do that, it hasn't been tested as thoroughly and isn't expected to work as widely as something like int8 weight only quantization. If you have a significant use case for GPTQ that may change what we do with it.

@yiliu30
Copy link
Contributor

yiliu30 commented Jun 18, 2024

Hi, I noticed that the GPTQ-related API was marked to be moved to prototype. Is there any alternative API to use, or are there any plans to support GPTQ formally?

can you also describe your use case for GPTQ as well?

@jerryzh168 @HDCharles My reason for keeping GPTQ support is that it is quite popular within the community :). For instance, Hugging Face currently includes 3000+ GPTQ models.

jerryzh168 added a commit to jerryzh168/ao that referenced this issue Jun 19, 2024
Summary:
Addressing feedback from pytorch#384 and pytorch#375

Test Plan:
regression tests

python test/quantization/test_quant_api.py
python test/integration/test_integration.py

Reviewers:

Subscribers:

Tasks:

Tags:
jerryzh168 added a commit to jerryzh168/ao that referenced this issue Jun 19, 2024
Summary:
Addressing feedback from pytorch#384 and pytorch#375

Test Plan:
regression tests

python test/quantization/test_quant_api.py
python test/integration/test_integration.py

Reviewers:

Subscribers:

Tasks:

Tags:
jerryzh168 added a commit to jerryzh168/ao that referenced this issue Jun 19, 2024
Summary:
Addressing feedback from pytorch#384 and pytorch#375

Test Plan:
regression tests

python test/quantization/test_quant_api.py
python test/integration/test_integration.py

Reviewers:

Subscribers:

Tasks:

Tags:
jerryzh168 added a commit to jerryzh168/ao that referenced this issue Jun 19, 2024
Summary:
Addressing feedback from pytorch#384 and pytorch#375

Test Plan:
regression tests

python test/quantization/test_quant_api.py
python test/integration/test_integration.py

Reviewers:

Subscribers:

Tasks:

Tags:
jerryzh168 added a commit to jerryzh168/ao that referenced this issue Jun 20, 2024
Summary:
Addressing feedback from pytorch#384 and pytorch#375

Test Plan:
regression tests

python test/quantization/test_quant_api.py
python test/integration/test_integration.py

Reviewers:

Subscribers:

Tasks:

Tags:
jerryzh168 added a commit to jerryzh168/ao that referenced this issue Jun 21, 2024
Summary:
Addressing feedback from pytorch#384 and pytorch#375

Test Plan:
regression tests

python test/quantization/test_quant_api.py
python test/integration/test_integration.py

Reviewers:

Subscribers:

Tasks:

Tags:
jerryzh168 added a commit that referenced this issue Jun 21, 2024
Summary:
Addressing feedback from #384 and #375

Test Plan:
regression tests

python test/quantization/test_quant_api.py
python test/integration/test_integration.py

Reviewers:

Subscribers:

Tasks:

Tags:
@gau-nernst
Copy link
Collaborator Author

@jerryzh168 Just visiting this issue again, particularly about unwrap_tensor_subclass(). When I tested with latest main (96d49cd), unwrap_tensor_subclass() is still needed. Are there any drawbacks if we include it inside quantize() so that the users don't need to care about it? (as suggested by @msaroufim #384 (comment)).

@jerryzh168
Copy link
Contributor

@jerryzh168 Just visiting this issue again, particularly about unwrap_tensor_subclass(). When I tested with latest main (96d49cd), unwrap_tensor_subclass() is still needed. Are there any drawbacks if we include it inside quantize() so that the users don't need to care about it? (as suggested by @msaroufim #384 (comment)).

main thing is it makes it a bit harder to debug I think, we'll be removing this soon though, in these two days, stay tuned. we are waiting for pytorch/pytorch#127431 to be landed, and I'll put up a PR to remove it

@gau-nernst
Copy link
Collaborator Author

@jerryzh168 that's good to hear! However, users of previous versions of PyTorch (e.g. v2.3) will still need to unwarp tensor subclass? Might not be that important.

@jerryzh168
Copy link
Contributor

@jerryzh168 that's good to hear! However, users of previous versions of PyTorch (e.g. v2.3) will still need to unwarp tensor subclass? Might not be that important.

yeah that's true, I hope at some point we can just stop supporting 2.2 and 2.3 so we can deprecate the old APIs as well.

also we have an updated timeline of weeks to month, see comments in #462 (comment) for more details

dbyoung18 pushed a commit to dbyoung18/ao that referenced this issue Jul 31, 2024
…orch#400)

Summary:
Addressing feedback from pytorch#384 and pytorch#375

Test Plan:
regression tests

python test/quantization/test_quant_api.py
python test/integration/test_integration.py

Reviewers:

Subscribers:

Tasks:

Tags:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants