Add KV-Cache int8 quant support #10354

YanyunDuanIEI · 2024-11-15T06:00:52Z

Add KV-Cache int8 quant support

Support [layer_level] and [group_level] KV-Cache int8 quant.

[layer_level] use common scale factors for each layer.
[group_level] group the head_size according to group_size, with each group_size, the scaling factor of key/value corresponding to the same value.

KV-Cache int8 quant (Click to Expand)

Get the scaling factor by calibration

Support to calibrate the KV-cache by datasets:

[examples/int8/calibrate.py] calibrate and save to pth.
[export_kv_params.py] save scaling factors to json.

Using KV-Cache int8

kv_cache_dtype="int8"
kv_quant_params_path=kv_quant_params_path
kv_quant_group=kv_quant_group

Signed-off-by: Yanyun Duan <duanyanyun@inspur.com>

github-actions · 2024-11-15T06:01:04Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

mergify · 2024-11-17T02:04:05Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @YanyunDuanIEI.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

csrc/attention/attention_kernels.cuh

…tensors

YanyunDuanIEI · 2024-12-11T05:59:15Z

Would it be viable to hasten the review process?

mergify · 2024-12-17T06:14:45Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @YanyunDuanIEI.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

xiabo123 · 2025-01-20T06:29:15Z

@YanyunDuanIEI Hello, is the kv cache int8 quantization in this PR online or offline? The code requires calibration sets such as' c4 '. Can it be like this PR（ #1507 ）Write down the operation of model transformation?

YanyunDuanIEI · 2025-01-20T08:27:38Z

@YanyunDuanIEI Hello, is the kv cache int8 quantization in this PR online or offline? The code requires calibration sets such as' c4 '. Can it be like this PR（ #1507 ）Write down the operation of model transformation?

It is offline, and the demo is located in the examples/int8/ directory. There is an execution demo named examples/int8/run_calibrate.sh.

xiabo123 · 2025-01-20T08:30:15Z

@YanyunDuanIEI Hello, is the kv cache int8 quantization in this PR online or offline? The code requires calibration sets such as' c4 '. Can it be like this PR（ #1507 ）Write down the operation of model transformation?

It is offline, and the demo is located in the examples/int8/ directory. There is an execution demo named examples/int8/run_calibrate.sh.

Thank you for your answer.

xiabo123 · 2025-01-20T12:57:34Z

@YanyunDuanIEI Hello, May I ask if there is a download path for the calibration set files "ceval_val_cmcc.jsonl" and "mapping. json" for "ceval_val_cmcc" and "ceval"?

simon-mo · 2025-01-21T17:45:27Z

Hi, if you are still interested in getting this in, please fix the merge conflict, thank you!

YanyunDuanIEI · 2025-01-22T08:31:30Z

@YanyunDuanIEI Hello, May I ask if there is a download path for the calibration set files "ceval_val_cmcc.jsonl" and "mapping. json" for "ceval_val_cmcc" and "ceval"?

Most of the datasets are in LLaMA-Factory, located in the LLaMA-Factory/evaluation/.

xiabo123 · 2025-01-23T07:04:36Z

@YanyunDuanIEI This doesn't seem to support models from the qwen2 series. Is it?

mgoin · 2024-12-21T15:58:34Z

examples/int8/calib_dataloader.py

This examples directory has a lot of lines, especially due to the scales in the work_dir. If you want to keep this example, please try to:

Rename dir to int8_kv_cache

Write a README describing how to use

Cleanup/consolidate these scripts if possible

Possibly remove the work_dir? I think it is reasonable to keep one set of scales as demonstration, but I don't see a reason to keep so many

I think once this support lands, we can easily update llmcompressor with examples to produce calibrated int8 kv cache scales - similar to like we have for FP8 now https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_kv_cache

mgoin · 2024-12-21T16:03:05Z

csrc/attention/attention_kernels.cuh

+  float k_scale = 0;
+  float v_scale = 0;
+  if constexpr (KV_DTYPE == Fp8KVCacheDataType::kInt8Group128) {
+    int64_t tgt_kvs_idx = floor((kv_head_idx*HEAD_SIZE)/quant_group);


I think there is no need to keep quant_group as an argument since we have the KVCacheDataType as template parameter. We know in the kInt8Group128 that the quant_group will be 128, so I think we can remove this parameter completely.

mgoin · 2024-12-21T16:04:13Z

csrc/quantization/int8_kvcache/quant_utils.cuh

+  // printf("\n dequant scale= %f, zero_point= %f \n", scale, zero_point);
+  // if(abs(res+1.268555)<=0.01)
+  //   printf("\nI am here int8_to_float, x = %d, a= %d, res=%f, scale=%f, zero_point=%f \n",
+  //           x, a, res, scale, zero_point);


Leftover cruft

mgoin · 2024-12-21T16:04:18Z

csrc/quantization/int8_kvcache/quant_utils.cuh

+  // printf("\n quant scale= %f \n", scale);
+  // if(abs(x+1.268555)<=0.00001)
+  //   printf("\nI am here float_to_int8, x = %f, fx= %d, res=%d, scale=%f, zero_point=%f, (x-zero_point) / scale)=%f \n",
+  //           x, fx, res, scale, zero_point, (x-zero_point) / scale);


mgoin · 2024-12-21T16:04:56Z

csrc/quantization/int8_kvcache/quant_utils.cuh

+template <typename Tout, typename Tin>
+__inline__ __device__ Tout scaled_vec_conversion_int8(const Tin& x,
+                                                      const float scale, const float zero_point) {
+  return x;
+}


This does not seem right, what is the purpose of this definition?

mgoin · 2024-12-21T16:27:39Z

vllm/_custom_ops.py

+    k_scales: torch.Tensor,
+    v_scales: torch.Tensor,


nit: we tend to just use scale rather than scales even in the case of using tensors, see these kernels as example

vllm/vllm/model_executor/layers/quantization/utils/w8a8_utils.py

Lines 204 to 212 in c2d1b07

def apply_int8_linear(

input: torch.Tensor,

weight: torch.Tensor,

weight_scale: torch.Tensor,

input_scale: Optional[torch.Tensor] = None,

input_zero_point: Optional[torch.Tensor] = None,

azp_adj: Optional[torch.Tensor] = None,

bias: Optional[torch.Tensor] = None,

):

mgoin · 2024-12-21T16:30:19Z

vllm/attention/backends/flashinfer.py

-                    k_scale=k_scale,
-                    v_scale=v_scale,
+                    quant_group,
+                    k_scales,
+                    v_scales,


Please use named assignment of args here

mgoin · 2024-12-21T16:30:52Z

vllm/attention/backends/flashinfer.py

-                k_scale=k_scale,
-                v_scale=v_scale,
+                quant_group,
+                k_scales,
+                v_scales,


Please use named assignment of args here

mgoin · 2024-12-21T16:34:08Z

vllm/attention/layer.py

+        k_scales_lists = v_scales_lists = [1.0]
+        # k_scales_lists = [0.16]
+        # v_scales_lists = [0.005]
+        self._k_scales = torch.Tensor(k_scales_lists).type(torch.float32).to("cuda")
+        self._v_scales = torch.Tensor(v_scales_lists).type(torch.float32).to("cuda")
+        self._quant_group = cache_config.kv_quant_group
+        if cache_config.cache_dtype.startswith("int8"):
+            if cache_config.kv_quant_params_path is not None:
+                k_scales_lists = cache_config.kv_quant_params[0].pop(0)
+                v_scales_lists = cache_config.kv_quant_params[1].pop(0)
+                self._k_scales = torch.Tensor(k_scales_lists).type(torch.float32).to("cuda")
+                self._v_scales = torch.Tensor(v_scales_lists).type(torch.float32).to("cuda")
+                if self._quant_group !=0:
+                    self._k_scales = self._k_scales.reshape((-1, num_kv_heads, head_size//self._quant_group))
+                    self._v_scales = self._v_scales.reshape((-1, num_kv_heads, head_size//self._quant_group))


Suggested change

k_scales_lists = v_scales_lists = [1.0]

# k_scales_lists = [0.16]

# v_scales_lists = [0.005]

self._k_scales = torch.Tensor(k_scales_lists).type(torch.float32).to("cuda")

self._v_scales = torch.Tensor(v_scales_lists).type(torch.float32).to("cuda")

self._quant_group = cache_config.kv_quant_group

if cache_config.cache_dtype.startswith("int8"):

if cache_config.kv_quant_params_path is not None:

k_scales_lists = cache_config.kv_quant_params[0].pop(0)

v_scales_lists = cache_config.kv_quant_params[1].pop(0)

self._k_scales = torch.Tensor(k_scales_lists).type(torch.float32).to("cuda")

self._v_scales = torch.Tensor(v_scales_lists).type(torch.float32).to("cuda")

if self._quant_group !=0:

self._k_scales = self._k_scales.reshape((-1, num_kv_heads, head_size//self._quant_group))

self._v_scales = self._v_scales.reshape((-1, num_kv_heads, head_size//self._quant_group))

default_scale = [1.0]

self._k_scales = torch.Tensor(default_scale).type(torch.float32).to("cuda")

self._v_scales = torch.Tensor(default_scale).type(torch.float32).to("cuda")

self._quant_group = cache_config.kv_quant_group

if cache_config.cache_dtype.startswith("int8"):

if cache_config.kv_quant_params_path is not None:

k_scales_lists = cache_config.kv_quant_params[0].pop(0)

v_scales_lists = cache_config.kv_quant_params[1].pop(0)

self._k_scales = torch.Tensor(default_scale).type(torch.float32).to("cuda")

self._v_scales = torch.Tensor(default_scale).type(torch.float32).to("cuda")

if self._quant_group !=0:

self._k_scales = self._k_scales.reshape((-1, num_kv_heads, head_size//self._quant_group))

self._v_scales = self._v_scales.reshape((-1, num_kv_heads, head_size//self._quant_group))

mgoin · 2024-12-21T16:35:37Z

vllm/attention/layer.py

+        # v_scales_lists = [0.005]
+        self._k_scales = torch.Tensor(k_scales_lists).type(torch.float32).to("cuda")
+        self._v_scales = torch.Tensor(v_scales_lists).type(torch.float32).to("cuda")
+        self._quant_group = cache_config.kv_quant_group


We can deduce kv_quant_group from the cache_config.cache_dtype, as mentioned in the kernels

mergify · 2025-01-28T03:21:30Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @YanyunDuanIEI.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

xiabo123 · 2025-02-08T09:34:06Z

@YanyunDuanIEI May I ask, in rocm, when performing accuracy verification, the variable 'quantization_param_path' needs to specify a file path, is it the same as the variable 'kv_quant_params_path'? Or, can we specify the generated JSON files' kv_cache_scales_layer_level.json 'and'kv_cache_scales_quant_group128.json 'separately?

int8 kv-cache support

b8e7779

Signed-off-by: Yanyun Duan <duanyanyun@inspur.com>

YanyunDuanIEI requested review from WoosukKwon, zhuohan123, youkaichao, alexm-redhat, comaniac and njhill as code owners November 15, 2024 06:00

mergify bot added the needs-rebase label Nov 17, 2024

Merge branch 'main' into int8-kv-cache

6309f86

YanyunDuanIEI requested review from robertgshaw2-redhat and ywang96 as code owners November 18, 2024 01:59

mergify bot removed the needs-rebase label Nov 18, 2024

mgoin reviewed Nov 19, 2024

View reviewed changes

csrc/attention/attention_kernels.cuh Outdated Show resolved Hide resolved

csrc/attention/attention_kernels.cuh Outdated Show resolved Hide resolved

YanyunDuanIEI added 13 commits November 20, 2024 10:58

Update dtype_fp8.cuh

3b3d784

Update quant_utils.cuh

f09c559

Update quant_utils.cuh

03476e5

Update cache_kernels.cu

9a426c0

Update attention_kernels.cuh

2a67ed4

Update paged_attention_v1.cu

83b3d41

Update paged_attention_v2.cu

1056a5f

Update layer.py

e870565

Update selector.py

95effce

Update config.py

4e8ddb6

Update utils.py

cf98d49

Update model_runner.py

c5da7f5

merge the k_scale/v_scale and k_scaling_factor/v_scaling_factor into …

36afe99

…tensors

mergify bot added the needs-rebase label Dec 17, 2024

Merge branch 'main' into int8-kv-cache

bc76ce2

mergify bot removed the needs-rebase label Dec 20, 2024

Update layer.py

5987d55

update to vllm-0.6.6

23592ee

YanyunDuanIEI requested review from tlrmchlsmth, LiuXiaoxuanPKU, DarkLight1337 and simon-mo as code owners January 21, 2025 07:07

mergify bot added documentation ci/build frontend labels Jan 21, 2025

Merge branch 'main' into int8-kv-cache

d1dffdb

mgoin reviewed Jan 28, 2025

View reviewed changes

mergify bot added the needs-rebase label Jan 28, 2025

mergify bot added the v1 label Feb 5, 2025

xiabo123 mentioned this pull request Feb 18, 2025

[Feature]: kv cahce int8：Dynamic kv cache scaling factors computation #13478

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add KV-Cache int8 quant support #10354

Add KV-Cache int8 quant support #10354

YanyunDuanIEI commented Nov 15, 2024

github-actions bot commented Nov 15, 2024

mergify bot commented Nov 17, 2024

YanyunDuanIEI commented Dec 11, 2024

mergify bot commented Dec 17, 2024

xiabo123 commented Jan 20, 2025

YanyunDuanIEI commented Jan 20, 2025 •

edited

Loading

xiabo123 commented Jan 20, 2025

xiabo123 commented Jan 20, 2025

simon-mo commented Jan 21, 2025

YanyunDuanIEI commented Jan 22, 2025

xiabo123 commented Jan 23, 2025

mgoin Dec 21, 2024

mgoin Dec 21, 2024

mgoin Dec 21, 2024

mgoin Dec 21, 2024

mgoin Dec 21, 2024

mgoin Dec 21, 2024

mgoin Dec 21, 2024

mgoin Dec 21, 2024

mgoin Dec 21, 2024

mgoin Dec 21, 2024

mergify bot commented Jan 28, 2025

xiabo123 commented Feb 8, 2025

	def apply_int8_linear(
	input: torch.Tensor,
	weight: torch.Tensor,
	weight_scale: torch.Tensor,
	input_scale: Optional[torch.Tensor] = None,
	input_zero_point: Optional[torch.Tensor] = None,
	azp_adj: Optional[torch.Tensor] = None,
	bias: Optional[torch.Tensor] = None,
	):

Add KV-Cache int8 quant support #10354

Are you sure you want to change the base?

Add KV-Cache int8 quant support #10354

Conversation

YanyunDuanIEI commented Nov 15, 2024

Get the scaling factor by calibration

Using KV-Cache int8

github-actions bot commented Nov 15, 2024

mergify bot commented Nov 17, 2024

YanyunDuanIEI commented Dec 11, 2024

mergify bot commented Dec 17, 2024

xiabo123 commented Jan 20, 2025

YanyunDuanIEI commented Jan 20, 2025 • edited Loading

xiabo123 commented Jan 20, 2025

xiabo123 commented Jan 20, 2025

simon-mo commented Jan 21, 2025

YanyunDuanIEI commented Jan 22, 2025

xiabo123 commented Jan 23, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented Jan 28, 2025

xiabo123 commented Feb 8, 2025

YanyunDuanIEI commented Jan 20, 2025 •

edited

Loading