Add silu mul kernel #2469

grimoire · 2024-09-14T11:52:50Z

Add kernel to fuse silu and mul in MLP.
Optimize apply_rotary kernel and rmsnorm kernel.

lvhan028 · 2024-09-18T08:54:26Z

lmdeploy/pytorch/kernels/cuda/activation.py

+TRITON_VERSION = version.parse(triton.__version__)
+
+if TRITON_VERSION >= version.parse('3.0.0'):
+


Do we support triton 3.0.0?

Tested on llama3

lvhan028 · 2024-09-18T12:48:35Z

lmdeploy/pytorch/kernels/cuda/apply_rotary_pos_emb.py

+        qeh_ptrs = qe_ptr[:, None] + feat_offset_h[None, :] * stride_qed
+        ql_ptrs += head_id * stride_qh
+        qh_ptrs += head_id * stride_qh
+        qel_ptrs += head_id * stride_qeh


Is it possible that stride_qeh is not equal to stride_qh?
I was wondering if it is necessary to pass stride_qes, stride_qeh, stride_qed, stride_kes, stride_keh and stride_ked

q k can be slice of qkv tensor. stride can be different if output is not inplaced.

AllentDan · 2024-09-19T03:11:13Z

lmdeploy/pytorch/backends/cuda/activation.py

+    def forward(self, x):
+        """forward."""
+
+        if x.size(-1) % 2048 != 0:


why only use fused kernel when x.size(-1) % 2048 == 0.

I fixed block size in

lmdeploy/lmdeploy/pytorch/kernels/cuda/activation.py

Line 62 in f662332

BLOCK_SIZE_N = min(N, 1024)

I am so lazy.

I didn't get it

The kernel would be more complex and slow if we had to support an arbitrary input shape.
This kernel only support aligned input. Unaligned input would be computed using the default implementation.

grimoire added 4 commits September 14, 2024 16:02

add silu kernel, optimize apply rotary kernek

d332d7a

add ut

0d86fcf

lint

e3cce5b

fix fused moe

f662332

lvhan028 added the improvement label Sep 18, 2024

lvhan028 requested a review from AllentDan September 18, 2024 06:37

lvhan028 reviewed Sep 18, 2024

View reviewed changes

lvhan028 approved these changes Sep 18, 2024

View reviewed changes

AllentDan reviewed Sep 19, 2024

View reviewed changes

AllentDan approved these changes Sep 19, 2024

View reviewed changes

lvhan028 merged commit 97449e3 into InternLM:main Sep 19, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add silu mul kernel #2469

Add silu mul kernel #2469

grimoire commented Sep 14, 2024

lvhan028 Sep 18, 2024

grimoire Sep 18, 2024

lvhan028 Sep 18, 2024 •

edited

Loading

grimoire Sep 19, 2024

AllentDan Sep 19, 2024

grimoire Sep 19, 2024

lvhan028 Sep 19, 2024

grimoire Sep 19, 2024

		TRITON_VERSION = version.parse(triton.__version__)

		if TRITON_VERSION >= version.parse('3.0.0'):

Add silu mul kernel #2469

Add silu mul kernel #2469

Conversation

grimoire commented Sep 14, 2024

lvhan028 Sep 18, 2024

Choose a reason for hiding this comment

grimoire Sep 18, 2024

Choose a reason for hiding this comment

lvhan028 Sep 18, 2024 • edited Loading

Choose a reason for hiding this comment

grimoire Sep 19, 2024

Choose a reason for hiding this comment

AllentDan Sep 19, 2024

Choose a reason for hiding this comment

grimoire Sep 19, 2024

Choose a reason for hiding this comment

lvhan028 Sep 19, 2024

Choose a reason for hiding this comment

grimoire Sep 19, 2024

Choose a reason for hiding this comment

lvhan028 Sep 18, 2024 •

edited

Loading