-
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add KV-Cache int8 quant support #10354
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Yanyun Duan <duanyanyun@inspur.com>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
This pull request has merge conflicts that must be resolved before it can be |
Would it be viable to hasten the review process? |
This pull request has merge conflicts that must be resolved before it can be |
@YanyunDuanIEI Hello, is the kv cache int8 quantization in this PR online or offline? The code requires calibration sets such as' c4 '. Can it be like this PR( #1507 )Write down the operation of model transformation? |
It is offline, and the demo is located in the |
Thank you for your answer. |
@YanyunDuanIEI Hello, May I ask if there is a download path for the calibration set files "ceval_val_cmcc.jsonl" and "mapping. json" for "ceval_val_cmcc" and "ceval"? |
Hi, if you are still interested in getting this in, please fix the merge conflict, thank you! |
Most of the datasets are in LLaMA-Factory, located in the |
@YanyunDuanIEI This doesn't seem to support models from the qwen2 series. Is it? |
examples/int8/calib_dataloader.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This examples directory has a lot of lines, especially due to the scales in the work_dir. If you want to keep this example, please try to:
- Rename dir to int8_kv_cache
- Write a README describing how to use
- Cleanup/consolidate these scripts if possible
- Possibly remove the work_dir? I think it is reasonable to keep one set of scales as demonstration, but I don't see a reason to keep so many
I think once this support lands, we can easily update llmcompressor with examples to produce calibrated int8 kv cache scales - similar to like we have for FP8 now https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_kv_cache
float k_scale = 0; | ||
float v_scale = 0; | ||
if constexpr (KV_DTYPE == Fp8KVCacheDataType::kInt8Group128) { | ||
int64_t tgt_kvs_idx = floor((kv_head_idx*HEAD_SIZE)/quant_group); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is no need to keep quant_group
as an argument since we have the KVCacheDataType as template parameter. We know in the kInt8Group128
that the quant_group
will be 128, so I think we can remove this parameter completely.
// printf("\n dequant scale= %f, zero_point= %f \n", scale, zero_point); | ||
// if(abs(res+1.268555)<=0.01) | ||
// printf("\nI am here int8_to_float, x = %d, a= %d, res=%f, scale=%f, zero_point=%f \n", | ||
// x, a, res, scale, zero_point); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover cruft
// printf("\n quant scale= %f \n", scale); | ||
// if(abs(x+1.268555)<=0.00001) | ||
// printf("\nI am here float_to_int8, x = %f, fx= %d, res=%d, scale=%f, zero_point=%f, (x-zero_point) / scale)=%f \n", | ||
// x, fx, res, scale, zero_point, (x-zero_point) / scale); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
template <typename Tout, typename Tin> | ||
__inline__ __device__ Tout scaled_vec_conversion_int8(const Tin& x, | ||
const float scale, const float zero_point) { | ||
return x; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not seem right, what is the purpose of this definition?
k_scales: torch.Tensor, | ||
v_scales: torch.Tensor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we tend to just use scale
rather than scales
even in the case of using tensors, see these kernels as example
vllm/vllm/model_executor/layers/quantization/utils/w8a8_utils.py
Lines 204 to 212 in c2d1b07
def apply_int8_linear( | |
input: torch.Tensor, | |
weight: torch.Tensor, | |
weight_scale: torch.Tensor, | |
input_scale: Optional[torch.Tensor] = None, | |
input_zero_point: Optional[torch.Tensor] = None, | |
azp_adj: Optional[torch.Tensor] = None, | |
bias: Optional[torch.Tensor] = None, | |
): |
k_scale=k_scale, | ||
v_scale=v_scale, | ||
quant_group, | ||
k_scales, | ||
v_scales, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use named assignment of args here
k_scale=k_scale, | ||
v_scale=v_scale, | ||
quant_group, | ||
k_scales, | ||
v_scales, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use named assignment of args here
k_scales_lists = v_scales_lists = [1.0] | ||
# k_scales_lists = [0.16] | ||
# v_scales_lists = [0.005] | ||
self._k_scales = torch.Tensor(k_scales_lists).type(torch.float32).to("cuda") | ||
self._v_scales = torch.Tensor(v_scales_lists).type(torch.float32).to("cuda") | ||
self._quant_group = cache_config.kv_quant_group | ||
if cache_config.cache_dtype.startswith("int8"): | ||
if cache_config.kv_quant_params_path is not None: | ||
k_scales_lists = cache_config.kv_quant_params[0].pop(0) | ||
v_scales_lists = cache_config.kv_quant_params[1].pop(0) | ||
self._k_scales = torch.Tensor(k_scales_lists).type(torch.float32).to("cuda") | ||
self._v_scales = torch.Tensor(v_scales_lists).type(torch.float32).to("cuda") | ||
if self._quant_group !=0: | ||
self._k_scales = self._k_scales.reshape((-1, num_kv_heads, head_size//self._quant_group)) | ||
self._v_scales = self._v_scales.reshape((-1, num_kv_heads, head_size//self._quant_group)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k_scales_lists = v_scales_lists = [1.0] | |
# k_scales_lists = [0.16] | |
# v_scales_lists = [0.005] | |
self._k_scales = torch.Tensor(k_scales_lists).type(torch.float32).to("cuda") | |
self._v_scales = torch.Tensor(v_scales_lists).type(torch.float32).to("cuda") | |
self._quant_group = cache_config.kv_quant_group | |
if cache_config.cache_dtype.startswith("int8"): | |
if cache_config.kv_quant_params_path is not None: | |
k_scales_lists = cache_config.kv_quant_params[0].pop(0) | |
v_scales_lists = cache_config.kv_quant_params[1].pop(0) | |
self._k_scales = torch.Tensor(k_scales_lists).type(torch.float32).to("cuda") | |
self._v_scales = torch.Tensor(v_scales_lists).type(torch.float32).to("cuda") | |
if self._quant_group !=0: | |
self._k_scales = self._k_scales.reshape((-1, num_kv_heads, head_size//self._quant_group)) | |
self._v_scales = self._v_scales.reshape((-1, num_kv_heads, head_size//self._quant_group)) | |
default_scale = [1.0] | |
self._k_scales = torch.Tensor(default_scale).type(torch.float32).to("cuda") | |
self._v_scales = torch.Tensor(default_scale).type(torch.float32).to("cuda") | |
self._quant_group = cache_config.kv_quant_group | |
if cache_config.cache_dtype.startswith("int8"): | |
if cache_config.kv_quant_params_path is not None: | |
k_scales_lists = cache_config.kv_quant_params[0].pop(0) | |
v_scales_lists = cache_config.kv_quant_params[1].pop(0) | |
self._k_scales = torch.Tensor(default_scale).type(torch.float32).to("cuda") | |
self._v_scales = torch.Tensor(default_scale).type(torch.float32).to("cuda") | |
if self._quant_group !=0: | |
self._k_scales = self._k_scales.reshape((-1, num_kv_heads, head_size//self._quant_group)) | |
self._v_scales = self._v_scales.reshape((-1, num_kv_heads, head_size//self._quant_group)) |
# v_scales_lists = [0.005] | ||
self._k_scales = torch.Tensor(k_scales_lists).type(torch.float32).to("cuda") | ||
self._v_scales = torch.Tensor(v_scales_lists).type(torch.float32).to("cuda") | ||
self._quant_group = cache_config.kv_quant_group |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can deduce kv_quant_group
from the cache_config.cache_dtype
, as mentioned in the kernels
This pull request has merge conflicts that must be resolved before it can be |
@YanyunDuanIEI May I ask, in rocm, when performing accuracy verification, the variable 'quantization_param_path' needs to specify a file path, is it the same as the variable 'kv_quant_params_path'? Or, can we specify the generated JSON files' kv_cache_scales_layer_level.json 'and'kv_cache_scales_quant_group128.json 'separately? |
Add KV-Cache int8 quant support
Support
[layer_level]
and[group_level]
KV-Cache int8 quant.[layer_level]
use common scale factors for each layer.[group_level]
group the head_size according to group_size, with each group_size, the scaling factor of key/value corresponding to the same value.KV-Cache int8 quant (Click to Expand)
Get the scaling factor by calibration
Support to calibrate the KV-cache by datasets:
[examples/int8/calibrate.py]
calibrate and save to pth.[export_kv_params.py]
save scaling factors to json.Using KV-Cache int8