Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPTQ Quantized-weight Sequential Updating #177

Merged
merged 17 commits into from
Sep 24, 2024
Merged

Conversation

kylesayrs
Copy link
Collaborator

@kylesayrs kylesayrs commented Sep 16, 2024

Background

There are many ways in which sequential updating can be implemented

  1. (off/False) Since running non-sequentially, ie one forward pass, gives no accuracy benefits and uses the most memory, I would prefer to remove support for this
  2. (block_unquantized) Running sequentially, block-by-block but without outputs from quantized weights saves memory, but does not give accuracy benefits. Mark Kurtz referred to this as essentially a bug in the sequential_update implementation, so I'd prefer not to support this
  3. (block/layer) Unfortunately the names are badly conflicting here (LC refers to transformer blocks as layers), but this option refers to using quantized-weight outputs for each transformer block (ie true_sequential=False). I think this should be supported
  4. (linear/true/True) This option refers to quantizing each linear layer within each block separately and continuing with weight-quantized outputs. I think this should be supported

Purpose

  • Implement options 1,3 for better replication of AutoGPTQ

Changes

  • always pre_compress during initialize_compression (has no effect)
  • run additional weight-quantized forward pass to calculate weight-quantized intermediate outputs
  • clean up typing in LayerCompressor

Testing

  • Compressed a model sequential_update=True and sequential_update=False
  • Compared evaluation from sequential_update=False and sequential_update=False on main and found the model evaluations to be the same

This branch sequential_update=False

vllm (pretrained=/home/ksayers/llm-compressor/actorder20240917_144813,dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 32
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|?  |0.2358|?  |0.0117|
|     |       |strict-match    |     5|exact_match|?  |0.2335|?  |0.0117|

main sequential_update=False

vllm (pretrained=/home/ksayers/llm-compressor/actorder20240917_145431,dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 32
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|?  |0.2358|?  |0.0117|
|     |       |strict-match    |     5|exact_match|?  |0.2335|?  |0.0117|

Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

@kylesayrs kylesayrs changed the title WIP: Sequential WIP: Sequential Methods Sep 16, 2024
@kylesayrs kylesayrs marked this pull request as draft September 16, 2024 21:49
@kylesayrs kylesayrs marked this pull request as ready for review September 17, 2024 15:04
@kylesayrs kylesayrs changed the title WIP: Sequential Methods Quantized-weight Sequential Methods Sep 17, 2024
Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not very clear from this PR in how you're differing between the layer or block cases you're adding in support for

@Satrat
Copy link
Contributor

Satrat commented Sep 18, 2024

I think we should want to keep the current non-sequential implementation (option 1), because it is faster by ~20%. Does option 4 provide any accuracy, memory or speed boost compared to option 3? I know AutoGPTQ supports option 4, but unless we have a strong reason I don't think its priority to add it in now.

As for the change from a boolean to an enum, I don't think its necessary until we decide what to do about option 4, can we leave it as a bool for now? No need to add extra scaffolding for something that isn't on the roadmap

@kylesayrs
Copy link
Collaborator Author

@dsikka

It's not very clear from this PR in how you're differing between the layer or block cases you're adding in support for

Not sure what you're referring to. The PR description specifies implementations for options 1 and 3

@dsikka
Copy link
Collaborator

dsikka commented Sep 18, 2024

@dsikka

It's not very clear from this PR in how you're differing between the layer or block cases you're adding in support for

Not sure what you're referring to. The PR description specifies implementations for options 1 and 3

We shouldn’t be adding it to the enum until it’s supported. We should keep a bool for now

@kylesayrs
Copy link
Collaborator Author

kylesayrs commented Sep 19, 2024

Re-performed replication

vllm (pretrained=/home/ksayers/llm-compressor/actorder20240919_133905,dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 32
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|?  |0.2358|?  |0.0117|
|     |       |strict-match    |     5|exact_match|?  |0.2335|?  |0.0117|

Evaluated sequential

vllm (pretrained=/home/ksayers/llm-compressor/actorder20240919_134528,dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 32
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|?  |0.2229|?  |0.0115|
|     |       |strict-match    |     5|exact_match|?  |0.2221|?  |0.0114|

@kylesayrs kylesayrs requested review from dsikka and Satrat September 19, 2024 14:17
@kylesayrs kylesayrs changed the title Quantized-weight Sequential Methods Quantized-weight Sequential Updating Sep 19, 2024
Copy link
Contributor

@Satrat Satrat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, could you add in eval stats for sequential_update=True to the PR description as well? Just so we can easily track any regression

Copy link
Contributor

@Satrat Satrat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, looks like there is a test failure related to this PR:

FAILED tests/llmcompressor/transformers/obcq/test_consecutive_runs.py::TestConsecutiveRunsSmall_0_commit::test_consecutive_runs_small - pydantic_core._pydantic_core.ValidationError: 1 validation error for GPTQModifier
sequential_update
  Input should be a valid boolean [type=bool_type, input_value=None, input_type=NoneType]

@kylesayrs kylesayrs changed the title Quantized-weight Sequential Updating GPTQ Quantized-weight Sequential Updating Sep 20, 2024
@kylesayrs
Copy link
Collaborator Author

@Satrat My mistake, forgot to return the value after validation

@kylesayrs kylesayrs requested a review from Satrat September 20, 2024 14:46
@kylesayrs
Copy link
Collaborator Author

Looks like the test is passing now

@dsikka dsikka merged commit f447ac3 into main Sep 24, 2024
6 of 7 checks passed
@dsikka dsikka deleted the kylesayrs/true_sequential branch September 24, 2024 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants