-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A better packNibbles
and mul_sum_i8_pairs_float
implementation using AVX512
#1119
Conversation
With an AVX512 machine, you may want to look into using Rebasing/merging latest master should fix the failing checks. |
015aeda
to
81cb1ee
Compare
Thank you very much for your suggestion! |
packNibbles
implementation using AVX512packNibbles
and mul_sum_i8_pairs_float
implementation using AVX512
ggml.c
Outdated
#if __AVXVNNIINT8__ | ||
const __m256i zero = _mm256_setzero_si256(); | ||
const __m256i summed_pairs = _mm256_dpbssd_epi32(zero, x, y); | ||
return _mm256_cvtepi32_ps(summed_pairs); | ||
#else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I'm aware, there is no hardware out there supporting AVX-VNNI-INT8
yet, so I don't think it's a good idea to use _mm256_dpbssd_epi32()
here (the code is under an #if
, so there is no harm in merging this, but it isn't useful either).
What we can use is AVX-VNNI
. It is present on already existing Intel CPUs starting from Alder Lake (12th gen) and includes _mm256_dpbusd_epi32()
. The difference is the left operand should be unsigned, so you have to keep _mm256_sign_epi8(...)
and only replace the _mm256_maddubs_epi16() + sum_i16_pairs_float()
pair with _mm256_dpbusd_epi32()
.
Essentially, this would be a backport of what I did with AVX-512 (zmm registers) to AVX-VNNI (ymm registers), which was proposed by @ultoris here.
Maybe we should open a separate PR for the VNNI optimization? It might speed-up the quantized dot product, which is on the hot code path, so it's nice to get some performance measurements independently of the packNibbles()
optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we can use is
AVX-VNNI
. It is present on already existing Intel CPUs starting from Alder Lake (12th gen) and includes_mm256_dpbusd_epi32()
. The difference is the left operand should be unsigned, so you have to keep_mm256_sign_epi8(...)
and only replace the_mm256_maddubs_epi16() + sum_i16_pairs_float()
pair with_mm256_dpbusd_epi32()
.
Good suggestion! I have changed mul_sum_i8_pairs_float
to use AVX_VNNI, I also think the packNibbles
does not affect the inference speed, so I think it is ok to measure inference performance in this PR.
#if __AVX512F__ | ||
const __m256i bytes_srli_4 = _mm256_srli_epi16(bytes, 4); // 0000_0000_abcd_0000 | ||
bytes = _mm256_or_si256(bytes, bytes_srli_4); // 0000_abcd_abcd_efgh | ||
return _mm256_cvtepi16_epi8(bytes); // abcd_efgh | ||
#else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, thank you! I made a microbenchmark, and it shows a nice improvement on a Tiger Lake CPU:
-----------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------
BenchPackNibblesAvx2 583 ns 582 ns 1212627
BenchPackNibblesAvx512 514 ns 513 ns 1313028
I don't think packNibbles()
is on the hot path during inference, so it will probably not affect the overall inference speed (might be wrong here). However, it's makes the code both more readable and more performant, and is worth merging.
Use only three instructions to implement
packNibbles
when AVX512 is available. (The_mm256_cvtepi16_epi8
requires AVX512 support)