GPTQ Quantized-weight Sequential Updating #177

kylesayrs · 2024-09-16T20:29:07Z

Background

There are many ways in which sequential updating can be implemented

(off/False) Since running non-sequentially, ie one forward pass, gives no accuracy benefits and uses the most memory, I would prefer to remove support for this
(block_unquantized) Running sequentially, block-by-block but without outputs from quantized weights saves memory, but does not give accuracy benefits. Mark Kurtz referred to this as essentially a bug in the sequential_update implementation, so I'd prefer not to support this
(block/layer) Unfortunately the names are badly conflicting here (LC refers to transformer blocks as layers), but this option refers to using quantized-weight outputs for each transformer block (ie true_sequential=False). I think this should be supported
(linear/true/True) This option refers to quantizing each linear layer within each block separately and continuing with weight-quantized outputs. I think this should be supported

Purpose

Implement options 1,3 for better replication of AutoGPTQ

Changes

always pre_compress during initialize_compression (has no effect)
run additional weight-quantized forward pass to calculate weight-quantized intermediate outputs
clean up typing in LayerCompressor

Testing

Compressed a model sequential_update=True and sequential_update=False
Compared evaluation from sequential_update=False and sequential_update=False on main and found the model evaluations to be the same

This branch sequential_update=False

vllm (pretrained=/home/ksayers/llm-compressor/actorder20240917_144813,dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 32
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|?  |0.2358|?  |0.0117|
|     |       |strict-match    |     5|exact_match|?  |0.2335|?  |0.0117|

main sequential_update=False

vllm (pretrained=/home/ksayers/llm-compressor/actorder20240917_145431,dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 32
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|?  |0.2358|?  |0.0117|
|     |       |strict-match    |     5|exact_match|?  |0.2335|?  |0.0117|

github-actions · 2024-09-16T20:29:19Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

src/llmcompressor/modifiers/quantization/gptq/base.py

dsikka

It's not very clear from this PR in how you're differing between the layer or block cases you're adding in support for

src/llmcompressor/modifiers/quantization/gptq/base.py

src/llmcompressor/modifiers/utils/layer_compressor.py

src/llmcompressor/modifiers/quantization/gptq/base.py

Satrat · 2024-09-18T20:18:47Z

I think we should want to keep the current non-sequential implementation (option 1), because it is faster by ~20%. Does option 4 provide any accuracy, memory or speed boost compared to option 3? I know AutoGPTQ supports option 4, but unless we have a strong reason I don't think its priority to add it in now.

As for the change from a boolean to an enum, I don't think its necessary until we decide what to do about option 4, can we leave it as a bool for now? No need to add extra scaffolding for something that isn't on the roadmap

kylesayrs · 2024-09-18T22:00:53Z

@dsikka

It's not very clear from this PR in how you're differing between the layer or block cases you're adding in support for

Not sure what you're referring to. The PR description specifies implementations for options 1 and 3

dsikka · 2024-09-18T22:33:22Z

@dsikka

It's not very clear from this PR in how you're differing between the layer or block cases you're adding in support for

Not sure what you're referring to. The PR description specifies implementations for options 1 and 3

We shouldn’t be adding it to the enum until it’s supported. We should keep a bool for now

kylesayrs · 2024-09-19T13:45:22Z

Re-performed replication

vllm (pretrained=/home/ksayers/llm-compressor/actorder20240919_133905,dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 32
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|?  |0.2358|?  |0.0117|
|     |       |strict-match    |     5|exact_match|?  |0.2335|?  |0.0117|

Evaluated sequential

vllm (pretrained=/home/ksayers/llm-compressor/actorder20240919_134528,dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 32
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|?  |0.2229|?  |0.0115|
|     |       |strict-match    |     5|exact_match|?  |0.2221|?  |0.0114|

Satrat

LGTM, could you add in eval stats for sequential_update=True to the PR description as well? Just so we can easily track any regression

Satrat

Actually, looks like there is a test failure related to this PR:

FAILED tests/llmcompressor/transformers/obcq/test_consecutive_runs.py::TestConsecutiveRunsSmall_0_commit::test_consecutive_runs_small - pydantic_core._pydantic_core.ValidationError: 1 validation error for GPTQModifier
sequential_update
  Input should be a valid boolean [type=bool_type, input_value=None, input_type=NoneType]

kylesayrs · 2024-09-20T14:45:58Z

@Satrat My mistake, forgot to return the value after validation

kylesayrs · 2024-09-20T16:37:41Z

Looks like the test is passing now

src/llmcompressor/modifiers/quantization/gptq/base.py

implement

fb66c4a

remove support for false

3e5b1d2

kylesayrs changed the title ~~WIP: Sequential~~ WIP: Sequential Methods Sep 16, 2024

rename linear to module

0a71802

kylesayrs marked this pull request as draft September 16, 2024 21:49

Kyle Sayers added 2 commits September 17, 2024 14:49

add back off option

5212123

Merge remote-tracking branch 'origin' into kylesayrs/true_sequential

27a3659

kylesayrs marked this pull request as ready for review September 17, 2024 15:04

kylesayrs changed the title ~~WIP: Sequential Methods~~ Quantized-weight Sequential Methods Sep 17, 2024

kylesayrs requested review from Satrat, markurtz, anmarques, dsikka and rahul-tuli September 17, 2024 15:05

Kyle Sayers added 2 commits September 17, 2024 15:12

del unquantized_outputs

c00035f

fix inplace operation, clear cache

1867a30

Satrat reviewed Sep 18, 2024

View reviewed changes

src/llmcompressor/modifiers/quantization/gptq/base.py Outdated Show resolved Hide resolved

Satrat reviewed Sep 18, 2024

View reviewed changes

src/llmcompressor/modifiers/quantization/gptq/base.py Show resolved Hide resolved

dsikka requested changes Sep 18, 2024

View reviewed changes

Kyle Sayers added 2 commits September 18, 2024 21:33

use None option, precompress in non-sequential case

fd24e4d

break out helper, precompress

d536e13

remove enum

1075ddc

Merge remote-tracking branch 'origin' into kylesayrs/true_sequential

e86d78b

kylesayrs requested review from dsikka and Satrat September 19, 2024 14:17

kylesayrs changed the title ~~Quantized-weight Sequential Methods~~ Quantized-weight Sequential Updating Sep 19, 2024

add warning

930dd76

Satrat approved these changes Sep 19, 2024

View reviewed changes

Satrat suggested changes Sep 19, 2024

View reviewed changes

return after validation

841e4ed

kylesayrs changed the title ~~Quantized-weight Sequential Updating~~ GPTQ Quantized-weight Sequential Updating Sep 20, 2024

kylesayrs requested a review from Satrat September 20, 2024 14:46

Satrat approved these changes Sep 20, 2024

View reviewed changes

dsikka reviewed Sep 20, 2024

View reviewed changes

src/llmcompressor/modifiers/quantization/gptq/base.py Show resolved Hide resolved

Kyle Sayers and others added 4 commits September 23, 2024 15:36

add function contract, better error

f99351b

Merge branch 'main' into kylesayrs/true_sequential

aac80e9

Merge branch 'main' into kylesayrs/true_sequential

221d97c

Merge branch 'main' into kylesayrs/true_sequential

692e734

dsikka merged commit f447ac3 into main Sep 24, 2024
6 of 7 checks passed

dsikka deleted the kylesayrs/true_sequential branch September 24, 2024 13:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPTQ Quantized-weight Sequential Updating #177

GPTQ Quantized-weight Sequential Updating #177

kylesayrs commented Sep 16, 2024 •

edited

Loading

github-actions bot commented Sep 16, 2024

dsikka left a comment

Satrat commented Sep 18, 2024

kylesayrs commented Sep 18, 2024

dsikka commented Sep 18, 2024 •

edited

Loading

kylesayrs commented Sep 19, 2024 •

edited

Loading

Satrat left a comment

Satrat left a comment

kylesayrs commented Sep 20, 2024

kylesayrs commented Sep 20, 2024

GPTQ Quantized-weight Sequential Updating #177

GPTQ Quantized-weight Sequential Updating #177

Conversation

kylesayrs commented Sep 16, 2024 • edited Loading

Background

Purpose

Changes

Testing

github-actions bot commented Sep 16, 2024

dsikka left a comment

Choose a reason for hiding this comment

Satrat commented Sep 18, 2024

kylesayrs commented Sep 18, 2024

dsikka commented Sep 18, 2024 • edited Loading

kylesayrs commented Sep 19, 2024 • edited Loading

Satrat left a comment

Choose a reason for hiding this comment

Satrat left a comment

Choose a reason for hiding this comment

kylesayrs commented Sep 20, 2024

kylesayrs commented Sep 20, 2024

kylesayrs commented Sep 16, 2024 •

edited

Loading

dsikka commented Sep 18, 2024 •

edited

Loading

kylesayrs commented Sep 19, 2024 •

edited

Loading