sycl : Reenabled mmvq path for the SYCL Nvidia Backend #8372

Alcpz · 2024-07-08T14:48:16Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

Intel and Nvidia backends have different "good" paths in the mulmat dispatch.
Recent changes in the SYCL backend for mulmat improved performance for the Level Zero backend, but affected negatively to the CUDA backend by disabling the mmvq path (which forces dmmv).

This patch proposes a change for the preferred path for sycl::backend::ext_oneapi_cuda, without affecting any changes made to the Intel backend, as reenabling mmvq showed a significant performance drop.

Summary of performance changes:

Level Zero backend unaffected
Prompt processing unaffected.
Text generation in the CUDA backend improved. Benchmarking below.

Hardware: Nvidia A100
llama-bench params: tg 128, ngl 81.

Model	Tokens / sec (master)	Tokens / sec (PR)
Llama 70B	11.07	12.46
Llama 13B	51.99	54.47
Llama 7B	88.46	87.60

Benchmarking

Rows can be seen in pairs (PR above, master below) to compare performance.

build_commit	gpu_info	model_type	n_prompt	n_gen	avg_ts	stddev_ts
`f288ae1`	NVIDIA A100-PCIE-40GB	llama 7B Q4_K - Medium	512	0	5792.442985	36.675407
`3f2d538`	NVIDIA A100-PCIE-40GB	llama 7B Q4_K - Medium	512	0	5771.42	36.64
`f288ae1`	NVIDIA A100-PCIE-40GB	llama 13B Q4_K - Medium	512	0	3396.702887	3.969368
`3f2d538`	NVIDIA A100-PCIE-40GB	llama 13B Q4_K - Medium	512	0	3388.65	16.12
`f288ae1`	NVIDIA A100-PCIE-40GB	llama 70B Q4_K - Medium	512	0	643.724402	1.861034
`3f2d538`	NVIDIA A100-PCIE-40GB	llama 70B Q4_K - Medium	512	0	641.64	1.48
`f288ae1`	NVIDIA A100-PCIE-40GB	llama 7B Q4_K - Medium	0	128	87.600585	0.031495
`3f2d538`	NVIDIA A100-PCIE-40GB	llama 7B Q4_K - Medium	0	128	88.46	0.04
`f288ae1`	NVIDIA A100-PCIE-40GB	llama 13B Q4_K - Medium	0	128	54.474619	0.088287
`3f2d538`	NVIDIA A100-PCIE-40GB	llama 13B Q4_K - Medium	0	128	51.99	0.21
`f288ae1`	NVIDIA A100-PCIE-40GB	llama 70B Q4_K - Medium	0	128	12.463416	0.041176
`3f2d538`	NVIDIA A100-PCIE-40GB	llama 70B Q4_K - Medium	0	128	11.07	0.03
`f288ae1`	Intel(R) Data Center GPU Max 1100	llama 7B Q4_K - Medium	512	0	3872.411894	17.213968
`3f2d538`	Intel(R) Data Center GPU Max 1100	llama 7B Q4_K - Medium	512	0	3883.631806	27.968402
`f288ae1`	Intel(R) Data Center GPU Max 1100	llama 13B Q4_K - Medium	512	0	2168.610024	6.74671
`3f2d538`	Intel(R) Data Center GPU Max 1100	llama 13B Q4_K - Medium	512	0	2173.704648	8.524752
`f288ae1`	Intel(R) Data Center GPU Max 1100	llama 70B Q4_K - Medium	512	0	488.779443	0.435489
`3f2d538`	Intel(R) Data Center GPU Max 1100	llama 70B Q4_K - Medium	512	0	487.903189	1.091783
`f288ae1`	Intel(R) Data Center GPU Max 1100	llama 7B Q4_K - Medium	0	128	46.417299	0.2024
`3f2d538`	Intel(R) Data Center GPU Max 1100	llama 7B Q4_K - Medium	0	128	46.535687	0.094065
`f288ae1`	Intel(R) Data Center GPU Max 1100	llama 13B Q4_K - Medium	0	128	29.541657	0.036522
`3f2d538`	Intel(R) Data Center GPU Max 1100	llama 13B Q4_K - Medium	0	128	29.527599	0.066409
`f288ae1`	Intel(R) Data Center GPU Max 1100	llama 70B Q4_K - Medium	0	128	6.398397	0.00904
`3f2d538`	Intel(R) Data Center GPU Max 1100	llama 70B Q4_K - Medium	0	128	6.397413	0.006652

Alcpz · 2024-07-08T14:52:11Z

@OuadiElfarouki @joeatodd @AidanBeltonS tagging for the discussion wrt the backend.

@airMeng @NeoZhangJianyu: What do you think of this? Is the change ok for the multi-device implementation?

I considered the alternative of re-adding a macro to "enable mmvq", but a dynamic check seems simpler as we already have plenty of Macros.

NeoZhangJianyu · 2024-07-09T01:51:53Z

ggml/src/ggml-sycl.cpp

@@ -3658,6 +3658,10 @@ static void ggml_sycl_mul_mat(ggml_backend_sycl_context & ctx, const ggml_tensor
    use_mul_mat_q = use_mul_mat_q && (src1->ne[1] <= MMQ_MAX_BATCH_SIZE);
 #endif // SYCL_USE_XMX

+    // mmvq path is faster in the Nvidia backend but slower on the Intel backend


// mmvq path is faster in the Nvidia backend but slower on the Intel backend

Suggest removing this comment:

No need such detailed comment for 2 lines codes.
It's common sense to optimize by adding branch for hardware type.

We should avoid mention the brand of commercial company in code.
Suggest using CUDA and SYCL for them if needed.

I disagree with 1. I agree that it makes sense to optimize for hardware type, but the reasoning and the change in the dispatch is not necessarily obvious to anyone that hasn't been actively developing the backend.

I agree with 2.

OK! I see the update.

ggml/src/ggml-sycl.cpp

SYCL : Reenabled mmvq path for the SYCL Nvidia Backend

f288ae1

github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Jul 8, 2024

NeoZhangJianyu reviewed Jul 9, 2024

View reviewed changes

NeoZhangJianyu approved these changes Jul 9, 2024

View reviewed changes

Reduced verbosity of comment

0a00d6e

airMeng approved these changes Jul 9, 2024

View reviewed changes

NeoZhangJianyu merged commit 5b0b8d8 into ggerganov:master Jul 9, 2024
53 checks passed

Alcpz deleted the Alcpz/sycl-reenable-mmvq branch July 9, 2024 14:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sycl : Reenabled mmvq path for the SYCL Nvidia Backend #8372

sycl : Reenabled mmvq path for the SYCL Nvidia Backend #8372

Alcpz commented Jul 8, 2024 •

edited

Loading

Alcpz commented Jul 8, 2024

NeoZhangJianyu Jul 9, 2024

Alcpz Jul 9, 2024

NeoZhangJianyu Jul 9, 2024

sycl : Reenabled mmvq path for the SYCL Nvidia Backend #8372

sycl : Reenabled mmvq path for the SYCL Nvidia Backend #8372

Conversation

Alcpz commented Jul 8, 2024 • edited Loading

Alcpz commented Jul 8, 2024

NeoZhangJianyu Jul 9, 2024

Choose a reason for hiding this comment

Alcpz Jul 9, 2024

Choose a reason for hiding this comment

NeoZhangJianyu Jul 9, 2024

Choose a reason for hiding this comment

Alcpz commented Jul 8, 2024 •

edited

Loading