[Kernel] Initial Machete W4A8 support + Refactors #9855

LucasWilkinson · 2024-10-30T20:57:46Z

add machete kernel support for QQQ style w4a8 quantization (including fp8 activations)
refactor machete dispatching logic
refactor machete file generation
reduce the number of prepack kernels generated (e.g. don't need separate kernels for uint4b8 and uint4 since we are just shuffling data around)

TODO (Future PR):

end2end integration
perf improvements (mostly hoping to land now for the refactor)

github-actions · 2024-10-30T20:57:57Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

ProExpertProg

2 minor comments, looks good otherwise

csrc/cutlass_extensions/epilogue/broadcast_load_epilogue_c2x.hpp

csrc/cutlass_extensions/vllm_numeric_conversion.cuh

mergify · 2024-11-06T07:13:48Z

This pull request has merge conflicts that must be resolved before it can be
merged. @LucasWilkinson please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

benchmarks/kernels/benchmark_machete.py

varun-sundar-rabindranath · 2024-11-12T22:23:08Z

benchmarks/kernels/weight_shapes.py

+        ([16384, 16384], 0),
+        ([16384, 106496], 1),
+        ([53248, 16384], 0),
+    ],


nit : for big models, I have found it useful to have their realistic TPn counter-parts also (e.g. for the 70B case, add a 70B-TP4 case). That way we can just list that version in the 1GPU model benchmarking.

You mean so you can list it as a string as opposed to using the --tp-sizes args?

vllm/model_executor/layers/quantization/utils/quant_utils.py

vllm/_custom_ops.py

benchmarks/kernels/benchmark_machete.py

tlrmchlsmth · 2024-11-12T22:11:27Z

csrc/cutlass_extensions/vllm_numeric_conversion.cuh

+          r;
+      uint32_t src = src_[0];
+      // Determines if to get from the signed or unsigned candidates
+      uint32_t sign = (src & 0x88888888) >> 1;


What is the right shift for?

updated the comment to be more verbose reads as:

// Determines if to get from the signed or unsigned candidates // move into bit position 0x4 of each nibble so when or'd with // final_prmt_base it selects the correct candidate, when elements // in final_prmt_base are >= 0x4, the negative candidate is selected // (i.e. from NEG_INT8_REG{1}{2}), when elements are < 0x4, the positive // candidate is selected (i.e. from POS_INT8_REG{1}{2}) uint32_t sign = (src & 0x88888888) >> 1; // `sign` is OR'd with 0x31203120 to find the correct value in the LUT // (selects correct positive or negative candidate) const uint32_t final_prmt_base = 0x32103210; // Ignore sign bit when indexing into LUT, for each 4bit value // we index into both the positive and negative candidates then use // sign | final_prmt_base to select the correct candidate uint32_t lut_idx = (src & 0x77777777);

nvm refactored to use lut_4bit_to_8bit_convert utility function, now the comment looks like

// Determines if the value is in the top half of the LUT if set or // (i.e. LUT[8:15]) in the bottom half (i.e. LUT[0:7]) if not set. Then move // into bit position 0x4 of each nibble so when or'd with final_prmt_base it // selects the correct candidate. When elements in final_prmt_base // are >= 0x4, the high candidate is selected (i.e. LUT[8:15]), when elements // are < 0x4, the low candidate is selected (i.e. LUT[0:7]) uint32_t high_bit = (src & 0x88888888) >> 1; // `high_bit` is OR'd with 0x31203120 to find the correct value in the LUT // (selects correct high or low candidate) const uint32_t final_prmt_base = 0x32103210; // Ignore the high bit when indexing into LUT, for each 4bit value // we index into both the high and low candidates then use // high_bit | final_prmt_base to select the correct candidate uint32_t lut_idx = (src & 0x77777777);

csrc/cutlass_extensions/vllm_numeric_conversion.cuh

tlrmchlsmth · 2024-11-12T22:20:35Z

csrc/cutlass_extensions/vllm_numeric_conversion.cuh

+      static constexpr uint32_t POS_E4M3s_REG1 = 0x44403800;  // [0, 1, 2, 3]
+      static constexpr uint32_t POS_E4M3s_REG2 = 0x4E4C4A48;  // [4, 5, 6, 7]
+      static constexpr uint32_t NEG_E4M3s_REG1 = 0xCACCCED0;  // [-8,-7,-6,-5]
+      static constexpr uint32_t NEG_E4M3s_REG2 = 0xB8C0C4C8;  // [-4,-3,-2,-1]


It looks like the int4 -> fp8 and int4 -> int8 converters the same except for these constants? If so, might be nice to factor these out. Not a big deal though, because it'd be kind of annoying to do.

Good call, bit janky but refactored to lut_4bit_to_8bit_convert

tlrmchlsmth · 2024-11-12T22:25:59Z

csrc/quantization/machete/machete_mainloop.cuh

+  using SmemLayoutACopy = decltype(GmemLayoutA::TVbNbKL_to_offset_copy(
+      make_shape(size<0>(TileShape_MNK{}), size<2>(TileShape_MNK{}),
+                 Int<DispatchPolicy::Stages>{})));


Could you explain this change?

Its just moved up (to be closer to SmemLayoutA) from below, where the following is deleted:

using SmemLayoutACopy = decltype(tile_to_shape( SmemLayoutAtomARowMajor{}, make_shape(shape<0>(TileShape{}), shape<2>(TileShape{}), Int<DispatchPolicy::Stages>{}), conditional_t<::cutlass::gemm::detail::is_major<0, StrideA>(), Step<_2, _1, _3>, Step<_1, _2, _3>>{}));

the

conditional_t<::cutlass::gemm::detail::is_major<0, StrideA>(), Step<_2, _1, _3>, Step<_1, _2, _3>>{}))

is removed since it was just cruft from the original PR thats not actually exercised (that was my bad)

varun-sundar-rabindranath · 2024-11-12T22:35:50Z

Reviewed the cutlass refactor part - LGTM!

benchmarks/kernels/benchmark_machete.py

tlrmchlsmth

Great work, LGTM!

csrc/quantization/machete/generate.py

tests/kernels/test_machete_gemm.py

vllm/model_executor/layers/quantization/utils/quant_utils.py

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

mgoin

LGTM thanks for getting it green!

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Manjul Mohan <manjul.mohan@ibm.com>

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: rickyx <rickyx@anyscale.com>

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

LucasWilkinson mentioned this pull request Oct 30, 2024

[WIP, Kernel] (3/N) Machete W4A8 #8046

Closed

LucasWilkinson changed the title ~~[WIP, Kernel] (3/N) Machete W4A8 (signed)~~ [WIP, Kernel] (3/N) Machete W4A8 Oct 30, 2024

LucasWilkinson marked this pull request as ready for review November 1, 2024 03:17

LucasWilkinson requested review from tlrmchlsmth and WoosukKwon as code owners November 1, 2024 03:17

LucasWilkinson changed the title ~~[WIP, Kernel] (3/N) Machete W4A8~~ [Kernel] (3/N) Machete W4A8 Nov 1, 2024

LucasWilkinson changed the title ~~[Kernel] (3/N) Machete W4A8~~ [Kernel] (3/N) Initial Machete W4A8 support + Refactors Nov 1, 2024

LucasWilkinson changed the title ~~[Kernel] (3/N) Initial Machete W4A8 support + Refactors~~ [Kernel] Initial Machete W4A8 support + Refactors Nov 1, 2024

ProExpertProg approved these changes Nov 4, 2024

View reviewed changes

csrc/cutlass_extensions/epilogue/broadcast_load_epilogue_c2x.hpp Show resolved Hide resolved

csrc/cutlass_extensions/vllm_numeric_conversion.cuh Outdated Show resolved Hide resolved

mergify bot added the needs-rebase label Nov 6, 2024

LucasWilkinson force-pushed the lwilkinson/machete-w4a8-signed branch 2 times, most recently from 3555a56 to 5c09f95 Compare November 6, 2024 15:49

mergify bot removed the needs-rebase label Nov 6, 2024

LucasWilkinson force-pushed the lwilkinson/machete-w4a8-signed branch 2 times, most recently from 565770c to 630c540 Compare November 6, 2024 20:05

LucasWilkinson mentioned this pull request Nov 12, 2024

[Kernel] Refactor Cutlass c3x #10049

Open

varun-sundar-rabindranath reviewed Nov 12, 2024

View reviewed changes

benchmarks/kernels/benchmark_machete.py Show resolved Hide resolved

varun-sundar-rabindranath reviewed Nov 12, 2024

View reviewed changes

tlrmchlsmth reviewed Nov 12, 2024

View reviewed changes

mgoin reviewed Nov 13, 2024

View reviewed changes

benchmarks/kernels/benchmark_machete.py Outdated Show resolved Hide resolved

tlrmchlsmth approved these changes Nov 13, 2024

View reviewed changes

tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 13, 2024

mgoin reviewed Nov 14, 2024

View reviewed changes

csrc/quantization/machete/generate.py Outdated Show resolved Hide resolved

tests/kernels/test_machete_gemm.py Outdated Show resolved Hide resolved

vllm/model_executor/layers/quantization/utils/quant_utils.py Outdated Show resolved Hide resolved

LucasWilkinson force-pushed the lwilkinson/machete-w4a8-signed branch from 479cf50 to ef43d89 Compare November 14, 2024 16:08

LucasWilkinson added 3 commits November 15, 2024 16:04

rebase and sign

09a7060

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

fix format

88426d0

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

format

2f3a49e

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

LucasWilkinson added 6 commits November 15, 2024 16:04

minor cleanup

78dd9dd

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

review comments

30d0af3

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

review comments

f140152

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

minor comment tweak

1993f3b

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

review comments

563f80c

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

review comments

70ad239

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

LucasWilkinson force-pushed the lwilkinson/machete-w4a8-signed branch from 6af8654 to 70ad239 Compare November 15, 2024 16:04

mgoin approved these changes Nov 18, 2024

View reviewed changes

mgoin merged commit 96d999f into vllm-project:main Nov 18, 2024
71 checks passed

mikejuliet13 pushed a commit to mikejuliet13/vllm that referenced this pull request Nov 19, 2024

[Kernel] Initial Machete W4A8 support + Refactors (vllm-project#9855)

2b855b1

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Manjul Mohan <manjul.mohan@ibm.com>

coolkp pushed a commit to coolkp/vllm that referenced this pull request Nov 20, 2024

[Kernel] Initial Machete W4A8 support + Refactors (vllm-project#9855)

6adaa08

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024

[Kernel] Initial Machete W4A8 support + Refactors (vllm-project#9855)

52c92c9

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

rickyyx pushed a commit to rickyyx/vllm that referenced this pull request Nov 20, 2024

[Kernel] Initial Machete W4A8 support + Refactors (vllm-project#9855)

d3c9fa3

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: rickyx <rickyx@anyscale.com>

tlrmchlsmth pushed a commit to neuralmagic/vllm that referenced this pull request Nov 23, 2024

[Kernel] Initial Machete W4A8 support + Refactors (vllm-project#9855)

d18132e

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel] Initial Machete W4A8 support + Refactors #9855

[Kernel] Initial Machete W4A8 support + Refactors #9855

LucasWilkinson commented Oct 30, 2024 •

edited

Loading

github-actions bot commented Oct 30, 2024

ProExpertProg left a comment

mergify bot commented Nov 6, 2024

varun-sundar-rabindranath Nov 12, 2024

LucasWilkinson Nov 13, 2024

tlrmchlsmth Nov 12, 2024

LucasWilkinson Nov 13, 2024 •

edited

Loading

LucasWilkinson Nov 13, 2024

tlrmchlsmth Nov 13, 2024

tlrmchlsmth Nov 12, 2024

LucasWilkinson Nov 13, 2024

tlrmchlsmth Nov 12, 2024

LucasWilkinson Nov 13, 2024 •

edited

Loading

varun-sundar-rabindranath commented Nov 12, 2024

tlrmchlsmth left a comment

mgoin left a comment

[Kernel] Initial Machete W4A8 support + Refactors #9855

[Kernel] Initial Machete W4A8 support + Refactors #9855

Conversation

LucasWilkinson commented Oct 30, 2024 • edited Loading

github-actions bot commented Oct 30, 2024

ProExpertProg left a comment

Choose a reason for hiding this comment

mergify bot commented Nov 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LucasWilkinson Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LucasWilkinson Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

varun-sundar-rabindranath commented Nov 12, 2024

tlrmchlsmth left a comment

Choose a reason for hiding this comment

mgoin left a comment

Choose a reason for hiding this comment

LucasWilkinson commented Oct 30, 2024 •

edited

Loading

LucasWilkinson Nov 13, 2024 •

edited

Loading

LucasWilkinson Nov 13, 2024 •

edited

Loading