Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rope imbedding kernel to use avx2 #23694

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

liqunfu
Copy link
Contributor

@liqunfu liqunfu commented Feb 14, 2025

Description

Credit to chethanpk who provided with Rope Embedding in a patch. The patch is in the first commit of this PR.

Motivation and Context

improve GQP performance on intel x64.

Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
@liqunfu liqunfu requested a review from a team as a code owner February 14, 2025 00:04
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

Comment on lines 235 to 238
auto ret = ApplyAttention(q_rotary, packed_qkv ? nullptr : k_rotary, packed_qkv ? nullptr : V.Get<Tensor>().Data<T>(),
past_key, past_value, output, present_k, present_v,
seqlens_k, parameters, allocator, context);
if (profiler_->IsEnabled()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto ret = ApplyAttention(q_rotary, packed_qkv ? nullptr : k_rotary, packed_qkv ? nullptr : V.Get<Tensor>().Data<T>(),
past_key, past_value, output, present_k, present_v,
seqlens_k, parameters, allocator, context);
if (profiler_->IsEnabled()) {
auto ret = ApplyAttention(q_rotary, packed_qkv ? nullptr : k_rotary, packed_qkv ? nullptr : V.Get<Tensor>().Data<T>(),
past_key, past_value, output, present_k, present_v,
seqlens_k, parameters, allocator, context);
if (profiler_->IsEnabled()) {

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

Comment on lines 51 to 53
Status GroupQueryAttention<T>::Compute(OpKernelContext* context) const {

const std::string node_name = this->Node().Name();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Status GroupQueryAttention<T>::Compute(OpKernelContext* context) const {
const std::string node_name = this->Node().Name();
Status GroupQueryAttention<T>::Compute(OpKernelContext* context) const {
const std::string node_name = this->Node().Name();

Comment on lines 189 to 191
domain="com.microsoft",

),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
domain="com.microsoft",
),
domain="com.microsoft",
),

Comment on lines 1943 to 1950
node_name = (
("packed_" if packed else "") +
("rotary_" if rotary else "") +
("rotary_interleaved_" if rotary_interleaved else "") +
"softcap_" + str(softcap) + "_" +
"smooth_softmax_" + str(use_smooth_softmax) + "_" +
"b_" + str(b) + "_sq_" + str(sq) + "_skv_" + str(skv) + "_n_" + str(n) + "_n2_" + str(n2) + "_h_" + str(h)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
node_name = (
("packed_" if packed else "") +
("rotary_" if rotary else "") +
("rotary_interleaved_" if rotary_interleaved else "") +
"softcap_" + str(softcap) + "_" +
"smooth_softmax_" + str(use_smooth_softmax) + "_" +
"b_" + str(b) + "_sq_" + str(sq) + "_skv_" + str(skv) + "_n_" + str(n) + "_n2_" + str(n2) + "_h_" + str(h)
)
node_name = (
("packed_" if packed else "")
+ ("rotary_" if rotary else "")
+ ("rotary_interleaved_" if rotary_interleaved else "")
+ "softcap_"
+ str(softcap)
+ "_"
+ "smooth_softmax_"
+ str(use_smooth_softmax)
+ "_"
+ "b_"
+ str(b)
+ "_sq_"
+ str(sq)
+ "_skv_"
+ str(skv)
+ "_n_"
+ str(n)
+ "_n2_"
+ str(n2)
+ "_h_"
+ str(h)
)

const std::string node_name = this->Node().Name();

// Initialize the profiler_ with a unique log file based on the node name
profiler_ = new onnxruntime::profiling::Profiler();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is test code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is used to profile GQA. I have removed it because we need the code to be in for the release.

Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

Comment on lines 196 to 199
auto ret = ApplyAttention(q_rotary, packed_qkv ? nullptr : k_rotary, packed_qkv ? nullptr : V.Get<Tensor>().Data<T>(),
past_key, past_value, output, present_k, present_v,
seqlens_k, parameters, allocator, context);
return ret;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto ret = ApplyAttention(q_rotary, packed_qkv ? nullptr : k_rotary, packed_qkv ? nullptr : V.Get<Tensor>().Data<T>(),
past_key, past_value, output, present_k, present_v,
seqlens_k, parameters, allocator, context);
return ret;
auto ret = ApplyAttention(q_rotary, packed_qkv ? nullptr : k_rotary, packed_qkv ? nullptr : V.Get<Tensor>().Data<T>(),
past_key, past_value, output, present_k, present_v,
seqlens_k, parameters, allocator, context);
return ret;

Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

Comment on lines 92 to 94
sequence_length, seqlen_past_kv_cache, seqlen_present_kv_cache, head_size, past_key_data,
present_key_data, past_present_share_buffer, packed_qkv, is_prompt, tp, allocator);

present_key_data, past_present_share_buffer, packed_qkv, is_prompt, tp, allocator);
// Compute the attentionScore * Value: out(B, N, S, H_v) = attention_probs(B, N, S, T) x V(B, N, T, H_v)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sequence_length, seqlen_past_kv_cache, seqlen_present_kv_cache, head_size, past_key_data,
present_key_data, past_present_share_buffer, packed_qkv, is_prompt, tp, allocator);
present_key_data, past_present_share_buffer, packed_qkv, is_prompt, tp, allocator);
// Compute the attentionScore * Value: out(B, N, S, H_v) = attention_probs(B, N, S, T) x V(B, N, T, H_v)
sequence_length, seqlen_past_kv_cache, seqlen_present_kv_cache, head_size, past_key_data,
present_key_data, past_present_share_buffer, packed_qkv, is_prompt, tp, allocator);
// Compute the attentionScore * Value: out(B, N, S, H_v) = attention_probs(B, N, S, T) x V(B, N, T, H_v)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants