Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sycl : Reenabled mmvq path for the SYCL Nvidia Backend #8372

Merged
merged 2 commits into from
Jul 9, 2024

Conversation

Alcpz
Copy link
Collaborator

@Alcpz Alcpz commented Jul 8, 2024


Intel and Nvidia backends have different "good" paths in the mulmat dispatch.
Recent changes in the SYCL backend for mulmat improved performance for the Level Zero backend, but affected negatively to the CUDA backend by disabling the mmvq path (which forces dmmv).

This patch proposes a change for the preferred path for sycl::backend::ext_oneapi_cuda, without affecting any changes made to the Intel backend, as reenabling mmvq showed a significant performance drop.

Summary of performance changes:

  • Level Zero backend unaffected
  • Prompt processing unaffected.
  • Text generation in the CUDA backend improved. Benchmarking below.

Hardware: Nvidia A100
llama-bench params: tg 128, ngl 81.

Model Tokens / sec (master) Tokens / sec (PR)
Llama 70B 11.07 12.46
Llama 13B 51.99 54.47
Llama 7B 88.46 87.60

Benchmarking

Rows can be seen in pairs (PR above, master below) to compare performance.

build_commit gpu_info model_type n_prompt n_gen avg_ts stddev_ts
f288ae1 NVIDIA A100-PCIE-40GB llama 7B Q4_K - Medium 512 0 5792.442985 36.675407
3f2d538 NVIDIA A100-PCIE-40GB llama 7B Q4_K - Medium 512 0 5771.42 36.64
f288ae1 NVIDIA A100-PCIE-40GB llama 13B Q4_K - Medium 512 0 3396.702887 3.969368
3f2d538 NVIDIA A100-PCIE-40GB llama 13B Q4_K - Medium 512 0 3388.65 16.12
f288ae1 NVIDIA A100-PCIE-40GB llama 70B Q4_K - Medium 512 0 643.724402 1.861034
3f2d538 NVIDIA A100-PCIE-40GB llama 70B Q4_K - Medium 512 0 641.64 1.48
f288ae1 NVIDIA A100-PCIE-40GB llama 7B Q4_K - Medium 0 128 87.600585 0.031495
3f2d538 NVIDIA A100-PCIE-40GB llama 7B Q4_K - Medium 0 128 88.46 0.04
f288ae1 NVIDIA A100-PCIE-40GB llama 13B Q4_K - Medium 0 128 54.474619 0.088287
3f2d538 NVIDIA A100-PCIE-40GB llama 13B Q4_K - Medium 0 128 51.99 0.21
f288ae1 NVIDIA A100-PCIE-40GB llama 70B Q4_K - Medium 0 128 12.463416 0.041176
3f2d538 NVIDIA A100-PCIE-40GB llama 70B Q4_K - Medium 0 128 11.07 0.03
f288ae1 Intel(R) Data Center GPU Max 1100 llama 7B Q4_K - Medium 512 0 3872.411894 17.213968
3f2d538 Intel(R) Data Center GPU Max 1100 llama 7B Q4_K - Medium 512 0 3883.631806 27.968402
f288ae1 Intel(R) Data Center GPU Max 1100 llama 13B Q4_K - Medium 512 0 2168.610024 6.74671
3f2d538 Intel(R) Data Center GPU Max 1100 llama 13B Q4_K - Medium 512 0 2173.704648 8.524752
f288ae1 Intel(R) Data Center GPU Max 1100 llama 70B Q4_K - Medium 512 0 488.779443 0.435489
3f2d538 Intel(R) Data Center GPU Max 1100 llama 70B Q4_K - Medium 512 0 487.903189 1.091783
f288ae1 Intel(R) Data Center GPU Max 1100 llama 7B Q4_K - Medium 0 128 46.417299 0.2024
3f2d538 Intel(R) Data Center GPU Max 1100 llama 7B Q4_K - Medium 0 128 46.535687 0.094065
f288ae1 Intel(R) Data Center GPU Max 1100 llama 13B Q4_K - Medium 0 128 29.541657 0.036522
3f2d538 Intel(R) Data Center GPU Max 1100 llama 13B Q4_K - Medium 0 128 29.527599 0.066409
f288ae1 Intel(R) Data Center GPU Max 1100 llama 70B Q4_K - Medium 0 128 6.398397 0.00904
3f2d538 Intel(R) Data Center GPU Max 1100 llama 70B Q4_K - Medium 0 128 6.397413 0.006652

@Alcpz
Copy link
Collaborator Author

Alcpz commented Jul 8, 2024

@OuadiElfarouki @joeatodd @AidanBeltonS tagging for the discussion wrt the backend.

@airMeng @NeoZhangJianyu: What do you think of this? Is the change ok for the multi-device implementation?

I considered the alternative of re-adding a macro to "enable mmvq", but a dynamic check seems simpler as we already have plenty of Macros.

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Jul 8, 2024
@@ -3658,6 +3658,10 @@ static void ggml_sycl_mul_mat(ggml_backend_sycl_context & ctx, const ggml_tensor
use_mul_mat_q = use_mul_mat_q && (src1->ne[1] <= MMQ_MAX_BATCH_SIZE);
#endif // SYCL_USE_XMX

// mmvq path is faster in the Nvidia backend but slower on the Intel backend
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// mmvq path is faster in the Nvidia backend but slower on the Intel backend

Suggest removing this comment:

  1. No need such detailed comment for 2 lines codes.
    It's common sense to optimize by adding branch for hardware type.

  2. We should avoid mention the brand of commercial company in code.
    Suggest using CUDA and SYCL for them if needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with 1. I agree that it makes sense to optimize for hardware type, but the reasoning and the change in the dispatch is not necessarily obvious to anyone that hasn't been actively developing the backend.

I agree with 2.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK! I see the update.

ggml/src/ggml-sycl.cpp Show resolved Hide resolved
@NeoZhangJianyu NeoZhangJianyu merged commit 5b0b8d8 into ggerganov:master Jul 9, 2024
53 checks passed
@Alcpz Alcpz deleted the Alcpz/sycl-reenable-mmvq branch July 9, 2024 14:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants