Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel] Dynamic Per-Token Activation Quantization #5037

Merged
merged 80 commits into from
Jun 7, 2024
Merged
Changes from 1 commit
Commits
Show all changes
80 commits
Select commit Hold shift + click to select a range
4d27a2c
Initial `CompressedTensors` config + Activation Quantization support …
dsikka Apr 30, 2024
92b3703
add get_quant method to compressed tensors config
dsikka Apr 30, 2024
2a3eb83
small rebase fixed
dsikka Apr 30, 2024
3dd1fe8
format
dsikka Apr 30, 2024
f2f8c52
fix mypy complaints
Apr 30, 2024
c9308eb
Merge branch 'main' into ds-quant
dsikka Apr 30, 2024
d9d49b5
format fixes
dsikka Apr 30, 2024
b111ee6
Merge branch 'main' into ds-quant
dsikka May 1, 2024
c31a7af
format fix post rebase
dsikka May 1, 2024
ca01b39
lazy import CompressedTensorsW8A8StaticTensor (#220)
varun-sundar-rabindranath May 1, 2024
f0197d4
lazy cutlass_gemm_dq import (#221)
varun-sundar-rabindranath May 1, 2024
4624b46
fix asm
May 1, 2024
75757d5
update shape change
dsikka May 2, 2024
e1df0eb
add todo
dsikka May 2, 2024
bc0991c
Rename quant_per_tensor -> static_scaled_int8_quant
May 2, 2024
74ad650
Remove cruft
May 2, 2024
43c43f3
Merge branch 'main' into ds-quant
dsikka May 14, 2024
cf5600f
fixes : typo
May 14, 2024
169ce7f
py-cutlass temporary hack for num_prompts==1
May 15, 2024
03b53e7
yapf
May 15, 2024
f9df31b
add test_int8_quant
May 16, 2024
ba4b6b3
call cpp cutlass
May 17, 2024
3c223c6
Merge branch 'main' into ds-quant
dsikka May 17, 2024
b27f31a
remove cutlass py interface
May 17, 2024
b589cdd
format.sh
May 17, 2024
98159cf
remove fake-quant
May 17, 2024
8dbeb31
add compressed tensors test
dsikka May 17, 2024
5eeb40a
remove torch.int8
dsikka May 17, 2024
c55e023
format
dsikka May 17, 2024
f5cbbd3
fix config parsing to match new model
dsikka May 20, 2024
a685957
revert parsing to use default pathway
dsikka May 20, 2024
4dfb37f
PR comments
dsikka May 21, 2024
de81f9e
Fix scales/zero-points device allocation
May 21, 2024
15f1863
ruff
May 21, 2024
bd53847
add better comments
May 21, 2024
b2926f3
add comment
dsikka May 22, 2024
1274386
Merge branch 'main' into ds-quant
dsikka May 22, 2024
18640c8
clang format
dsikka May 22, 2024
5c5dc84
clang format again
dsikka May 22, 2024
a44b4a0
address PR comments
May 22, 2024
6f0e6e1
clang-format
May 22, 2024
0090454
remove layer name
dsikka May 23, 2024
4b10fd7
remove unused import
dsikka May 23, 2024
68a59c7
remove parent name
dsikka May 23, 2024
b0afe67
Fix rounding
May 22, 2024
4f4951e
comment
May 23, 2024
869de3f
cruft
May 23, 2024
e68e391
yapf
May 23, 2024
d77cf50
remove unquantized check
dsikka May 23, 2024
51a4e59
update parsing to use compressed-tensors; add dynamic per token parsi…
dsikka May 2, 2024
6777319
add dynamic quantization arg, fill out create_weights/apply
dsikka May 2, 2024
54c797a
Add quant_per_token kernels
May 2, 2024
6bcab22
make changes to config parsing based on sparseml updates; test dynami…
dsikka May 3, 2024
ece93e1
fix shape for cutlass issues
dsikka May 6, 2024
1d87a99
remove dicts; use quantization args directly
dsikka May 6, 2024
3dd1b5f
update compressed-tensors; add docstring
dsikka May 7, 2024
fed7cdd
Dyn per token varun cleanup (#227)
varun-sundar-rabindranath May 13, 2024
66719a9
add test_int8_quant
May 16, 2024
2ec6a2c
remove fake quant
dsikka May 24, 2024
0c7f870
Merge branch 'main' into dyn-per-token
dsikka May 24, 2024
34e2e12
format
dsikka May 24, 2024
e79517e
combine static and dynamic quant computation
May 24, 2024
39e66d1
TORCH_CHECK and nits
May 24, 2024
59f8ec1
use Union
May 24, 2024
7a83601
clang-format
May 24, 2024
9ea47c8
fix typo
May 24, 2024
7abb2c8
isort
May 24, 2024
eb4e119
update test case
dsikka May 28, 2024
d62930d
fix isort
dsikka May 28, 2024
80b6fac
store input scales in gpu
May 29, 2024
fa1ceef
Merge branch 'main' into dyn-per-token
dsikka Jun 5, 2024
7075318
tensor device location fixes
Jun 6, 2024
60a6d73
format.sh
Jun 6, 2024
f36519b
remove compressed tensors
dsikka Jun 6, 2024
2c6e580
format fix
dsikka Jun 6, 2024
b3d692a
add comments; some clean-up
dsikka Jun 6, 2024
f3bf9e3
review comments
Jun 6, 2024
2bd62e0
review comments and const correctness
Jun 6, 2024
460f514
format.sh
Jun 6, 2024
dfcd61a
nit fixes
dsikka Jun 7, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
remove unquantized check
  • Loading branch information
dsikka committed May 23, 2024
commit d77cf5044fa74d419d039497ef17d6c804197356
5 changes: 1 addition & 4 deletions tests/quantization/test_compressed_tensors.py
Original file line number Diff line number Diff line change
@@ -6,8 +6,7 @@
import torch

from vllm.model_executor.layers.quantization.compressed_tensors.compressed_tensors import ( # noqa: E501
CompressedTensorsLinearMethod, CompressedTensorsUnquantized,
CompressedTensorsW8A8StaticTensor)
CompressedTensorsLinearMethod, CompressedTensorsW8A8StaticTensor)


def test_compressed_tensors_w8a8_static_setup(vllm_runner):
@@ -27,12 +26,10 @@ def test_compressed_tensors_w8a8_static_setup(vllm_runner):
assert isinstance(down_proj.quant_method, CompressedTensorsLinearMethod)

assert isinstance(qkv_proj.scheme, CompressedTensorsW8A8StaticTensor)
assert isinstance(down_proj.scheme, CompressedTensorsUnquantized)

assert qkv_proj.weight.dtype is torch.int8
assert o_proj.weight.dtype is torch.int8
assert gate_up_proj.weight.dtype is torch.int8
assert down_proj.weight.dtype is torch.float16

assert qkv_proj.weight_scale.shard_splitter is not None
assert qkv_proj.weight_scale.logical_widths is not None