Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Activation Ordering Strategies #121

Merged
merged 80 commits into from
Sep 3, 2024
Merged

Conversation

kylesayrs
Copy link
Collaborator

@kylesayrs kylesayrs commented Aug 28, 2024

Activation Ordering Strategies

activation-ordering-diagram

Features

  • Add group activation ordering, which produces the best accuracy with increased latency
  • Add weight activation ordering, which produces similar accuracy gains to group activation ordering, but with no added latency

Usage

QuantizationArgs(actorder="group")  # original actorder option, reorders groups and weight
QuantizationArgs(actorder="weight")  # "static_group" option, reorders weight but uses sequential groupings
QuantizationArgs(actorder=None)  # no activation ordering
compress_llama.py
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

import torch
from transformers import AutoTokenizer
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
from datasets import Dataset, load_dataset, load_from_disk
from llmcompressor.transformers.compression.helpers import calculate_offload_device_map
import random


model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"

num_samples = 2048
max_seq_len = 8192

tokenizer = AutoTokenizer.from_pretrained(model_id)

#dataset_name = "neuralmagic/LLM_compression_calibration"
dataset_name = "cosmopedia_mix_llama3.pth"

input_ids = torch.stack(torch.load(dataset_name))[:, 0].to(dtype=torch.int)  # torch.Size([2048, 8192])
attention_mask = torch.ones(num_samples, max_seq_len).to(dtype=torch.int)  # torch.Size([2048, 8192])
dataset = Dataset.from_dict({
    "input_ids": input_ids,
    "attention_mask": attention_mask
})
#dataset = load_from_disk(dataset_name)
#ds = dataset.shuffle().select(range(num_samples))

model = SparseAutoModelForCausalLM.from_pretrained(
    model_id, device_map="auto", torch_dtype="auto"
)

recipe = """
    quant_stage:
        quant_modifiers:
            GPTQModifier:
                sequential_update: true
                ignore: ["lm_head"]
                config_groups:
                    group_0:
                        weights:
                            num_bits: 4
                            type: "int"
                            symmetric: true
                            strategy: "group"
                            group_size: 128
                            actorder: "weight"
                        targets: ["Linear"]
"""
# Apply algorithm
oneshot(
    model=model,
    dataset=dataset,
    recipe=recipe,
    max_seq_length=max_seq_len,
    num_calibration_samples=num_samples,
)

breakpoint()
name = "llama31_actorder_weight"
model.save_pretrained(name, save_compresed=True)
model.save_pretrained(name, save_compresed=True)
infer_actorder.py
from vllm import LLM
llm = LLM("nm-testing/TinyLlama-1.1B-Chat-v1.0-actorder-group")
llm.generate("The future of AI is")

Changes

  • Small cleanup to gptqwrapper
  • Implement weight and group activation ordering
  • Breakout _update_quantization_parameters
  • Add e2e tests for activation ordering strategies

Dependencies

Testing

  • Added unit tests (need ct to land before checking)
  • Performed small accuracy regression tests with different strategies

Accuracy

Full Precision

vllm (pretrained=Qwen/Qwen2-0.5B-Instruct,add_bos_token=True), gen_kwargs: (None), limit: 1000.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|?  |0.394|?  |0.0155|
|     |       |strict-match    |     5|exact_match|?  |0.393|?  |0.0155|

No Activation Ordering

vllm (pretrained=/home/ksayers/llm-compressor/qwen_group_only,add_bos_token=True), gen_kwargs: (None), limit: 1000.0, num_fewshot: 5, batch_size: auto                               
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|                                                                                                             
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|                                                                                                             
|gsm8k|      3|flexible-extract|     5|exact_match|?  |0.228|?  |0.0133|                                                                                                             
|     |       |strict-match    |     5|exact_match|?  |0.217|?  |0.0130| 

Group Activation Ordering

vllm (pretrained=/home/ksayers/llm-compressor/qwen_actorder_group,add_bos_token=True), gen_kwargs: (None), limit: 1000.0, num_fewshot: 5, batch_size: auto                           
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|                                                                                                             
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|                                                                                                             
|gsm8k|      3|flexible-extract|     5|exact_match|?  |0.241|?  |0.0135|                                                                                                             
|     |       |strict-match    |     5|exact_match|?  |0.236|?  |0.0134|

Weight-only Activation Ordering

vllm (pretrained=/home/ksayers/llm-compressor/qwen_actorder_weight,add_bos_token=True), gen_kwargs: (None), limit: 1000.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|?  |0.254|?  |0.0138|
|     |       |strict-match    |     5|exact_match|?  |0.213|?  |0.0130|

Latency

Full Precision

Avg latency: 0.612304248350362 seconds
10% percentile latency: 0.6050410680472851 seconds
25% percentile latency: 0.6059907954186201 seconds
50% percentile latency: 0.6098776273429394 seconds
75% percentile latency: 0.6107742763124406 seconds
90% percentile latency: 0.6114723403006792 seconds
99% percentile latency: 0.6905647566355766 seconds

No Activation Ordering

Avg latency: 0.451185735501349 seconds
10% percentile latency: 0.4459945809096098 seconds
25% percentile latency: 0.44648625515401363 seconds
50% percentile latency: 0.4473606375977397 seconds
75% percentile latency: 0.4487249795347452 seconds
90% percentile latency: 0.45016821939498186 seconds
99% percentile latency: 0.5247877304255963 seconds

Group Activation Ordering

Avg latency: 0.47711189221590755 seconds
10% percentile latency: 0.47196690943092107 seconds
25% percentile latency: 0.4725433448329568 seconds
50% percentile latency: 0.47307876218110323 seconds
75% percentile latency: 0.4750088737346232 seconds
90% percentile latency: 0.4756721451878548 seconds
99% percentile latency: 0.5509468417987229 seconds

Weight Activation Ordering

Avg latency: 0.4507347485671441 seconds
10% percentile latency: 0.4456333613023162 seconds
25% percentile latency: 0.446005588863045 seconds
50% percentile latency: 0.44688841979950666 seconds
75% percentile latency: 0.44848238583654165 seconds
90% percentile latency: 0.4493972914293408 seconds
99% percentile latency: 0.5238330581225455 seconds

@kylesayrs kylesayrs self-assigned this Sep 1, 2024
@kylesayrs kylesayrs marked this pull request as ready for review September 1, 2024 21:48
@kylesayrs kylesayrs requested review from Satrat and horheynm September 1, 2024 21:48
Copy link
Contributor

@Satrat Satrat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but seeing test failures that should be resolved before merging

@kylesayrs kylesayrs requested review from Satrat and horheynm September 3, 2024 19:34
Satrat
Satrat previously approved these changes Sep 3, 2024
horheynm
horheynm previously approved these changes Sep 3, 2024
@kylesayrs kylesayrs dismissed stale reviews from horheynm and Satrat via 00f2fa0 September 3, 2024 20:43
@kylesayrs kylesayrs requested review from Satrat and horheynm September 3, 2024 21:11
@kylesayrs kylesayrs merged commit 8c32b1a into main Sep 3, 2024
7 checks passed
@kylesayrs kylesayrs deleted the kylesayrs/actorder_static_group branch September 3, 2024 21:19
markmc pushed a commit to markmc/llm-compressor that referenced this pull request Nov 13, 2024
* Update README.md

* Update README.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants