-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
moe quantization support int8 and fp8 #702
base: main_perf
Are you sure you want to change the base?
Conversation
tensor = tensor * scale | ||
tensor = tensor.round_() | ||
tensor.clamp_(-max_repr_val, max_repr_val) | ||
tensor_quantized = tensor.to(torch.int8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is returning the quantized tensor as int8 but the dtype can be fp8 as well right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right... now I fixed it. Apparently the torch test passed because they are using the same quantized input
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also great thanks for spotting it!! Otherwise my code could be terribly wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@zhanglx13 would it be possible to review this? A customer is asking for it... |
max_vals[max_vals == 0] = 1e-8 | ||
|
||
# Compute scale factors for each channel | ||
scale: torch.Tensor = max_repr_val / max_vals.to(torch.float32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for a tensor of shape M x K, what is the shape of the scale?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tensor shapes
a = torch.randn((M, K), dtype=dtype, device='cuda')
b = torch.randn((E, N, K), dtype=dtype, device='cuda')
In the case of fp8_w8a8:
- a_descale is a scalar
- b_descale is (E, ), per expert
in the case of use_int8_w8a16:
- b_quantized is (E, N), per expert and per n
@vgokhale @Chi-Chu319 How long does it take to run these unit tests? Upstream vLLM usually asks for shorter unit tests because of CI. Would it be possible to get a set of parameterizations that can be ran in upstream vLLM but won't take a long time? |
@rasmith I don't know. How long can we take in the CI? This is just the regular MoE kernel adapted to support int8. You can change use the same set of UTs as the current vllm MoE kernel. |
30 seconds according to upstream |
MOE int8, fp8 quantization support
FP8_W8A8 Benchmark
Model Results:
INT8_W8A16 Benchmark
Model Results:
Baseline performance:
Baseline model performance:
I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run
pre-commit run --from-ref origin/main --to-ref HEAD
.Select one of the following.
/test
forlit
tests/unittest
for C++ tests/python/test
for end-to-end testsFILL THIS IN
.Select one of the following.
lit
tests.lit
tests I have added follow these best practices,including the "tests should be minimal" section. (Usually running Python code
and using the instructions it generates is not minimal.)