[Kernel] Dynamic Per-Token Activation Quantization #5037

dsikka · 2024-05-24T17:40:27Z

Summary

Adds an additional CompressedTensorsScheme to support w8a8 models with dynamic per-token activation quantization, CompressedTensorsW8A8DynamicToken. This scheme adds support for w8a8 dynamic per-token models quantized through sparseml and saved through compressed-tensors
Expands CompressedTensorsConfig to use QuantizationArgs, QuantizationStrategy and find_first_name_or_class_match to help match the appropriate scheme to each layer.
Add dynamic_int8_quant_kernel CUDA kernel that performs int8 dynamic-per-token quantization
Refactor utilities in reduction_utils.cuh

From Neural Magic, co-authored by @varun-sundar-rabindranath

@varun-sundar-rabindranath

…for static W8A8 per tensor (#195) - Depending on how we end up parsing `ignore` and `targets` (layer_name vs layer_type) we may not need layer_name to be added to the linear_method. Will experiment using a compressed-tensors function in a follow-up PR - Initial implementation for Compressed Config support + Activation Quantization for static per tensor w8a8 - Includes fused kernels added by @varun-sundar-rabindranath ```python from vllm import LLM, SamplingParams import torch prompts = [ "Hello, my name is", "The capital of France is", "The US president is", "The future of AI is" ] sampling_params = SamplingParams(temperature=0.80, top_p=0.95) llm = LLM(model="nm-testing/tinyllama-one-shot-static-quant-test", enforce_eager=True, dtype=torch.float32, quantization="sparseml") outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` - Verification of the different inputs expected for `targets` and `ignore` --> use functions to parse the layer names which can be shared by sparseml and vllm; would live in compressed tensors (https://github.com/neuralmagic/compressed-tensors/blob/67005d76107d4659787f1efd53fe7e6b1d192818/src/compressed_tensors/quantization/lifecycle/apply.py#L86) - Updates to further optimize fake qunat --------- Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

vllm CI fixes --------- Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

lazy cutlass_gemm_dq import --------- Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py

...cutor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_dynamictoken.py

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py

robertgshaw2-redhat · 2024-06-06T15:20:26Z

vllm/_custom_ops.py

+    # Dynamic-per-token quantization.
+    input_scales = torch.empty((input.numel() // input.shape[-1], 1),
+                               dtype=torch.float32,
+                               device="cuda")


to keep consistent with fp8, maybe this should be input.device

Fixed it 👍

robertgshaw2-redhat · 2024-06-06T15:21:07Z

vllm/_custom_ops.py

    """
    q = torch.empty_like(input, dtype=torch.int8)
-    vllm_ops.static_scaled_int8_quant(q, input, scale)
-    return q
+    if scale is not None:


Can we make the names of the variables used internally in this function match the scaled_fp8_quant function?

renamed q to output. I believe the other variables are good as it is. please take a look. Thanks.

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py

vllm/model_executor/layers/quantization/compressed_tensors/utils.py

robertgshaw2-redhat

LGTM.

tlrmchlsmth · 2024-06-06T17:03:16Z

csrc/ops.h

+void dynamic_scaled_int8_quant(torch::Tensor& out, torch::Tensor& input,
+                               torch::Tensor& scales);


Suggested change

void dynamic_scaled_int8_quant(torch::Tensor& out, torch::Tensor& input,

torch::Tensor& scales);

void dynamic_scaled_int8_quant(torch::Tensor& out, torch::Tensor const& input,

torch::Tensor& scales);

tlrmchlsmth · 2024-06-06T17:07:42Z

csrc/quantization/compressed_tensors/int8_quant_kernels.cu

+  const int tid = threadIdx.x;
+  const int token_idx = blockIdx.x;
+
+  float amax_val = 0.0f;


nit: would it be more readable as absmax_val?

Yes. amax is confusing.

tlrmchlsmth · 2024-06-06T17:09:17Z

csrc/quantization/compressed_tensors/int8_quant_kernels.cu

+  const float zero = 0.0f;
+
+  for (int i = tid; i < hidden_size; i += blockDim.x) {
+    float val = (float)input[token_idx * hidden_size + i];


nit: It's best to use static cast instead of C-style casts when possible, since they are checked by the compiler.

Suggested change

float val = (float)input[token_idx * hidden_size + i];

float val = static_cast<float>(input[token_idx * hidden_size + i]);

tlrmchlsmth · 2024-06-06T17:09:45Z

csrc/quantization/compressed_tensors/int8_quant_kernels.cu

+    out[token_idx * hidden_size + i] = float_to_int8_rn(
+        ((float)input[token_idx * hidden_size + i]) * tmp_scale);


Suggested change

out[token_idx * hidden_size + i] = float_to_int8_rn(

((float)input[token_idx * hidden_size + i]) * tmp_scale);

out[token_idx * hidden_size + i] = float_to_int8_rn(

(static_cast<float>(input[token_idx * hidden_size + i]) * tmp_scale));

tlrmchlsmth · 2024-06-06T17:13:45Z

csrc/reduction_utils.cuh

+// Helper function to return the next largest power of 2
+static constexpr int _nextPow2(unsigned int num) {
+  if (num <= 1) return num;
+  return 1 << (CHAR_BIT * sizeof(num) - __builtin_clz(num - 1));
+}


Is there a common place we can put CUDA utils like this? We have the exact same helper fn in csrc/quantization/cutlass_w8a8/scaled_mm_dq_c3x.cu

I did some sleuthing, but can't find a good place to put it. Should we create a math_utils.cuh file ? @robertgshaw2-neuralmagic @mgoin

We definitely need another refactoring for csrc/quantization...but I don't have an out-of-box solution for this ATM.

tlrmchlsmth · 2024-06-06T17:17:35Z

tests/kernels/test_int8_quant.py

@@ -10,21 +10,52 @@
 SCALE = [0.1, 0.5, 0.8, 1.2, 2.1]


+@pytest.mark.parametrize("num_tokens", NUM_TOKENS)
+@pytest.mark.parametrize("hidden_size", HIDDEN_SIZES)


should we add a larger hidden size (> 1024) that's not nice number as well? I see 5120, but it is a multiple of 256

Added hidden-sizes 5137 and 8193

comaniac

LGTM

comaniac · 2024-06-06T21:07:03Z

csrc/reduction_utils.cuh

+// Helper function to return the next largest power of 2
+static constexpr int _nextPow2(unsigned int num) {
+  if (num <= 1) return num;
+  return 1 << (CHAR_BIT * sizeof(num) - __builtin_clz(num - 1));
+}


We definitely need another refactoring for csrc/quantization...but I don't have an out-of-box solution for this ATM.

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py

Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

dsikka and others added 30 commits April 30, 2024 18:50

add get_quant method to compressed tensors config

92b3703

small rebase fixed

2a3eb83

format

3dd1fe8

fix mypy complaints

f2f8c52

Merge branch 'main' into ds-quant

c9308eb

format fixes

d9d49b5

Merge branch 'main' into ds-quant

b111ee6

format fix post rebase

c31a7af

lazy import CompressedTensorsW8A8StaticTensor (#220)

ca01b39

vllm CI fixes --------- Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

lazy cutlass_gemm_dq import (#221)

f0197d4

lazy cutlass_gemm_dq import --------- Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

fix asm

4624b46

update shape change

75757d5

add todo

e1df0eb

Rename quant_per_tensor -> static_scaled_int8_quant

bc0991c

Remove cruft

74ad650

Merge branch 'main' into ds-quant

43c43f3

fixes : typo

cf5600f

py-cutlass temporary hack for num_prompts==1

169ce7f

yapf

03b53e7

add test_int8_quant

f9df31b

call cpp cutlass

ba4b6b3

Merge branch 'main' into ds-quant

3c223c6

remove cutlass py interface

b27f31a

format.sh

b589cdd

remove fake-quant

98159cf

add compressed tensors test

8dbeb31

remove torch.int8

5eeb40a

format

c55e023

fix config parsing to match new model

f5cbbd3

robertgshaw2-redhat requested review from comaniac and pcmoritz June 6, 2024 15:09