Rope imbedding kernel to use avx2 #23694

liqunfu · 2025-02-14T00:04:17Z

Description

Credit to chethanpk who provided with Rope Embedding in a patch. The patch is in the first commit of this PR.

Motivation and Context

improve GQP performance on intel x64.

Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

…rnel-to-use-AVX2

github-actions

You can commit the suggested changes from lintrunner.

github-actions · 2025-02-14T00:09:03Z

onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc

+  auto ret = ApplyAttention(q_rotary, packed_qkv ? nullptr : k_rotary, packed_qkv ? nullptr : V.Get<Tensor>().Data<T>(),
                        past_key, past_value, output, present_k, present_v,
                        seqlens_k, parameters, allocator, context);
+  if (profiler_->IsEnabled()) {


Suggested change

auto ret = ApplyAttention(q_rotary, packed_qkv ? nullptr : k_rotary, packed_qkv ? nullptr : V.Get<Tensor>().Data<T>(),

past_key, past_value, output, present_k, present_v,

seqlens_k, parameters, allocator, context);

if (profiler_->IsEnabled()) {

auto ret = ApplyAttention(q_rotary, packed_qkv ? nullptr : k_rotary, packed_qkv ? nullptr : V.Get<Tensor>().Data<T>(),

past_key, past_value, output, present_k, present_v,

seqlens_k, parameters, allocator, context);

if (profiler_->IsEnabled()) {

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

github-actions

You can commit the suggested changes from lintrunner.

github-actions · 2025-02-14T00:36:08Z

onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc

 Status GroupQueryAttention<T>::Compute(OpKernelContext* context) const {
+
+  const std::string node_name = this->Node().Name();


Suggested change

Status GroupQueryAttention<T>::Compute(OpKernelContext* context) const {

const std::string node_name = this->Node().Name();

Status GroupQueryAttention<T>::Compute(OpKernelContext* context) const {

const std::string node_name = this->Node().Name();

github-actions · 2025-02-14T00:36:09Z

onnxruntime/test/python/transformers/test_gqa_cpu.py

            domain="com.microsoft",
+
        ),


Suggested change

domain="com.microsoft",

),

domain="com.microsoft",

),

github-actions · 2025-02-14T00:36:09Z

onnxruntime/test/python/transformers/test_gqa_cpu.py

+                                            node_name = (
+                                                ("packed_" if packed else "") +
+                                                ("rotary_" if rotary else "") +
+                                                ("rotary_interleaved_" if rotary_interleaved else "") +
+                                                "softcap_" + str(softcap) + "_" +
+                                                "smooth_softmax_" + str(use_smooth_softmax) + "_" +
+                                                 "b_" + str(b) + "_sq_" + str(sq) + "_skv_" + str(skv) + "_n_" + str(n) + "_n2_" + str(n2) + "_h_" + str(h)
+                                            )


Suggested change

node_name = (

("packed_" if packed else "") +

("rotary_" if rotary else "") +

("rotary_interleaved_" if rotary_interleaved else "") +

"softcap_" + str(softcap) + "_" +

"smooth_softmax_" + str(use_smooth_softmax) + "_" +

"b_" + str(b) + "_sq_" + str(sq) + "_skv_" + str(skv) + "_n_" + str(n) + "_n2_" + str(n2) + "_h_" + str(h)

)

node_name = (

("packed_" if packed else "")

+ ("rotary_" if rotary else "")

+ ("rotary_interleaved_" if rotary_interleaved else "")

+ "softcap_"

+ str(softcap)

+ "_"

+ "smooth_softmax_"

+ str(use_smooth_softmax)

+ "_"

+ "b_"

+ str(b)

+ "_sq_"

+ str(sq)

+ "_skv_"

+ str(skv)

+ "_n_"

+ str(n)

+ "_n2_"

+ str(n2)

+ "_h_"

+ str(h)

)

snnn · 2025-02-14T00:36:09Z

onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc

+  const std::string node_name = this->Node().Name();
+
+  // Initialize the profiler_ with a unique log file based on the node name
+  profiler_ = new onnxruntime::profiling::Profiler();


This is test code?

it is used to profile GQA. I have removed it because we need the code to be in for the release.

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

github-actions

You can commit the suggested changes from lintrunner.

github-actions · 2025-02-15T00:46:00Z

onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc

+  auto ret = ApplyAttention(q_rotary, packed_qkv ? nullptr : k_rotary, packed_qkv ? nullptr : V.Get<Tensor>().Data<T>(),
                        past_key, past_value, output, present_k, present_v,
                        seqlens_k, parameters, allocator, context);
+  return ret;


Suggested change

auto ret = ApplyAttention(q_rotary, packed_qkv ? nullptr : k_rotary, packed_qkv ? nullptr : V.Get<Tensor>().Data<T>(),

past_key, past_value, output, present_k, present_v,

seqlens_k, parameters, allocator, context);

return ret;

auto ret = ApplyAttention(q_rotary, packed_qkv ? nullptr : k_rotary, packed_qkv ? nullptr : V.Get<Tensor>().Data<T>(),

past_key, past_value, output, present_k, present_v,

seqlens_k, parameters, allocator, context);

return ret;

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

github-actions

You can commit the suggested changes from lintrunner.

github-actions · 2025-02-15T02:10:10Z

onnxruntime/contrib_ops/cpu/bert/gqa_attention_base.h

                             sequence_length, seqlen_past_kv_cache, seqlen_present_kv_cache, head_size, past_key_data,
-                             present_key_data, past_present_share_buffer, packed_qkv, is_prompt, tp, allocator);
-
+                            present_key_data, past_present_share_buffer, packed_qkv, is_prompt, tp, allocator);
    // Compute the attentionScore * Value: out(B, N, S, H_v) = attention_probs(B, N, S, T) x V(B, N, T, H_v)


Suggested change

sequence_length, seqlen_past_kv_cache, seqlen_present_kv_cache, head_size, past_key_data,

present_key_data, past_present_share_buffer, packed_qkv, is_prompt, tp, allocator);

present_key_data, past_present_share_buffer, packed_qkv, is_prompt, tp, allocator);

// Compute the attentionScore * Value: out(B, N, S, H_v) = attention_probs(B, N, S, T) x V(B, N, T, H_v)

sequence_length, seqlen_past_kv_cache, seqlen_present_kv_cache, head_size, past_key_data,

present_key_data, past_present_share_buffer, packed_qkv, is_prompt, tp, allocator);

// Compute the attentionScore * Value: out(B, N, S, H_v) = attention_probs(B, N, S, T) x V(B, N, T, H_v)

liqunfu added 4 commits January 16, 2025 18:54

profile init code

4523b3d

Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>

from patch

59e2760

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

Merge branch 'liqun/GQA' into Intel-ROPE-kernel-to-use-AVX2

1611fcc

Merge branch 'Intel-ROPE-kernel-to-use-AVX2' into liqun/Intel-ROPE-ke…

6e2f414

…rnel-to-use-AVX2

liqunfu requested a review from a team as a code owner February 14, 2025 00:04

github-actions bot reviewed Feb 14, 2025

View reviewed changes

liqunfu added 3 commits February 14, 2025 00:24

node_name and remove profiler wrapper

35bf517

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

m:erge branch 'main' into liqun/GQA

e1232fc

Merge branch 'liqun/GQA' into liqun/Intel-ROPE-kernel-to-use-AVX2

b36fa2c

github-actions bot reviewed Feb 14, 2025

View reviewed changes

snnn reviewed Feb 14, 2025

View reviewed changes

remove profiling code

3964acc

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

github-actions bot reviewed Feb 15, 2025

View reviewed changes

liqunfu added 3 commits February 15, 2025 01:21

undo test_gqa_cpu.py

40a6854

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

lint

43bdb44

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

some edit

46353d8

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

github-actions bot reviewed Feb 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rope imbedding kernel to use avx2 #23694

Rope imbedding kernel to use avx2 #23694

liqunfu commented Feb 14, 2025

github-actions bot left a comment

github-actions bot Feb 14, 2025

github-actions bot left a comment

github-actions bot Feb 14, 2025

github-actions bot Feb 14, 2025

github-actions bot Feb 14, 2025

snnn Feb 14, 2025

liqunfu Feb 15, 2025

github-actions bot left a comment

github-actions bot Feb 15, 2025

github-actions bot left a comment

github-actions bot Feb 15, 2025

		Status GroupQueryAttention<T>::Compute(OpKernelContext* context) const {

		const std::string node_name = this->Node().Name();

Rope imbedding kernel to use avx2 #23694

Are you sure you want to change the base?

Rope imbedding kernel to use avx2 #23694

Conversation

liqunfu commented Feb 14, 2025

Description

Motivation and Context

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot Feb 14, 2025

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot Feb 14, 2025

Choose a reason for hiding this comment

github-actions bot Feb 14, 2025

Choose a reason for hiding this comment

github-actions bot Feb 14, 2025

Choose a reason for hiding this comment

snnn Feb 14, 2025

Choose a reason for hiding this comment

liqunfu Feb 15, 2025

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot Feb 15, 2025

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot Feb 15, 2025

Choose a reason for hiding this comment