A better `packNibbles` and `mul_sum_i8_pairs_float` implementation using AVX512 #1119

MeouSker77 · 2023-04-22T09:41:37Z

Use only three instructions to implement packNibbles when AVX512 is available. (The _mm256_cvtepi16_epi8 requires AVX512 support)

sw · 2023-04-22T12:45:26Z

With an AVX512 machine, you may want to look into using _mm256_dpbssd_epi32 in mul_sum_i8_pairs_float, that could give another speed boost. (Preprocessor condition: #if __AVXVNNIINT8__)

Rebasing/merging latest master should fix the failing checks.

MeouSker77 · 2023-04-22T13:20:37Z

With an AVX512 machine, you may want to look into using _mm256_dpbssd_epi32 in mul_sum_i8_pairs_float, that could give another speed boost. (Preprocessor condition: #if __AVXVNNIINT8__)

Thank you very much for your suggestion!

dfyz · 2023-04-22T21:15:58Z

ggml.c

+#if __AVXVNNIINT8__
+    const __m256i zero = _mm256_setzero_si256();
+    const __m256i summed_pairs = _mm256_dpbssd_epi32(zero, x, y);
+    return _mm256_cvtepi32_ps(summed_pairs);
+#else


As far as I'm aware, there is no hardware out there supporting AVX-VNNI-INT8 yet, so I don't think it's a good idea to use _mm256_dpbssd_epi32() here (the code is under an #if, so there is no harm in merging this, but it isn't useful either).

What we can use is AVX-VNNI. It is present on already existing Intel CPUs starting from Alder Lake (12th gen) and includes _mm256_dpbusd_epi32(). The difference is the left operand should be unsigned, so you have to keep _mm256_sign_epi8(...) and only replace the _mm256_maddubs_epi16() + sum_i16_pairs_float() pair with _mm256_dpbusd_epi32().

Essentially, this would be a backport of what I did with AVX-512 (zmm registers) to AVX-VNNI (ymm registers), which was proposed by @ultoris here.

Maybe we should open a separate PR for the VNNI optimization? It might speed-up the quantized dot product, which is on the hot code path, so it's nice to get some performance measurements independently of the packNibbles() optimization.

What we can use is AVX-VNNI. It is present on already existing Intel CPUs starting from Alder Lake (12th gen) and includes _mm256_dpbusd_epi32(). The difference is the left operand should be unsigned, so you have to keep _mm256_sign_epi8(...) and only replace the _mm256_maddubs_epi16() + sum_i16_pairs_float() pair with _mm256_dpbusd_epi32().

Good suggestion! I have changed mul_sum_i8_pairs_float to use AVX_VNNI, I also think the packNibbles does not affect the inference speed, so I think it is ok to measure inference performance in this PR.

dfyz · 2023-04-22T21:19:31Z

ggml.c

+#if __AVX512F__
+    const __m256i bytes_srli_4 = _mm256_srli_epi16(bytes, 4);   // 0000_0000_abcd_0000
+    bytes = _mm256_or_si256(bytes, bytes_srli_4);               // 0000_abcd_abcd_efgh
+    return _mm256_cvtepi16_epi8(bytes);                         // abcd_efgh
+#else


This looks great, thank you! I made a microbenchmark, and it shows a nice improvement on a Tiger Lake CPU:

----------------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------------- BenchPackNibblesAvx2 583 ns 582 ns 1212627 BenchPackNibblesAvx512 514 ns 513 ns 1313028

I don't think packNibbles() is on the hot path during inference, so it will probably not affect the overall inference speed (might be wrong here). However, it's makes the code both more readable and more performant, and is worth merging.

MeouSker77 added 2 commits April 22, 2023 21:17

A better packNibbles implementation using AVX512

bde28f2

A better mul_sum_i8_pairs_float implementation using AVX512

81cb1ee

MeouSker77 force-pushed the avx512-packNibbles branch from 015aeda to 81cb1ee Compare April 22, 2023 13:18

MeouSker77 changed the title ~~A better packNibbles implementation using AVX512~~ A better packNibbles and mul_sum_i8_pairs_float implementation using AVX512 Apr 22, 2023

sw requested a review from dfyz April 22, 2023 13:46

dfyz approved these changes Apr 22, 2023

View reviewed changes

change mul_sum_i8_pairs_float to use AVX_VNNI

d771b81

sw merged commit c9e2c26 into ggml-org:master Apr 23, 2023

MeouSker77 deleted the avx512-packNibbles branch April 23, 2023 10:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A better `packNibbles` and `mul_sum_i8_pairs_float` implementation using AVX512 #1119

A better `packNibbles` and `mul_sum_i8_pairs_float` implementation using AVX512 #1119

MeouSker77 commented Apr 22, 2023

sw commented Apr 22, 2023 •

edited

Loading

MeouSker77 commented Apr 22, 2023

dfyz Apr 22, 2023 •

edited

Loading

MeouSker77 Apr 23, 2023

dfyz Apr 22, 2023

A better packNibbles and mul_sum_i8_pairs_float implementation using AVX512 #1119

A better packNibbles and mul_sum_i8_pairs_float implementation using AVX512 #1119

Conversation

MeouSker77 commented Apr 22, 2023

sw commented Apr 22, 2023 • edited Loading

MeouSker77 commented Apr 22, 2023

dfyz Apr 22, 2023 • edited Loading

Choose a reason for hiding this comment

MeouSker77 Apr 23, 2023

Choose a reason for hiding this comment

dfyz Apr 22, 2023

Choose a reason for hiding this comment

A better `packNibbles` and `mul_sum_i8_pairs_float` implementation using AVX512 #1119

A better `packNibbles` and `mul_sum_i8_pairs_float` implementation using AVX512 #1119

sw commented Apr 22, 2023 •

edited

Loading

dfyz Apr 22, 2023 •

edited

Loading