Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A better packNibbles and mul_sum_i8_pairs_float implementation using AVX512 #1119

Merged
merged 3 commits into from
Apr 23, 2023

Conversation

MeouSker77
Copy link
Contributor

Use only three instructions to implement packNibbles when AVX512 is available. (The _mm256_cvtepi16_epi8 requires AVX512 support)

@sw
Copy link
Contributor

sw commented Apr 22, 2023

With an AVX512 machine, you may want to look into using _mm256_dpbssd_epi32 in mul_sum_i8_pairs_float, that could give another speed boost. (Preprocessor condition: #if __AVXVNNIINT8__)

Rebasing/merging latest master should fix the failing checks.

@MeouSker77
Copy link
Contributor Author

With an AVX512 machine, you may want to look into using _mm256_dpbssd_epi32 in mul_sum_i8_pairs_float, that could give another speed boost. (Preprocessor condition: #if __AVXVNNIINT8__)

Thank you very much for your suggestion!

@MeouSker77 MeouSker77 changed the title A better packNibbles implementation using AVX512 A better packNibbles and mul_sum_i8_pairs_float implementation using AVX512 Apr 22, 2023
@sw sw requested a review from dfyz April 22, 2023 13:46
ggml.c Outdated
Comment on lines 508 to 512
#if __AVXVNNIINT8__
const __m256i zero = _mm256_setzero_si256();
const __m256i summed_pairs = _mm256_dpbssd_epi32(zero, x, y);
return _mm256_cvtepi32_ps(summed_pairs);
#else
Copy link
Collaborator

@dfyz dfyz Apr 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I'm aware, there is no hardware out there supporting AVX-VNNI-INT8 yet, so I don't think it's a good idea to use _mm256_dpbssd_epi32() here (the code is under an #if, so there is no harm in merging this, but it isn't useful either).

What we can use is AVX-VNNI. It is present on already existing Intel CPUs starting from Alder Lake (12th gen) and includes _mm256_dpbusd_epi32(). The difference is the left operand should be unsigned, so you have to keep _mm256_sign_epi8(...) and only replace the _mm256_maddubs_epi16() + sum_i16_pairs_float() pair with _mm256_dpbusd_epi32().

Essentially, this would be a backport of what I did with AVX-512 (zmm registers) to AVX-VNNI (ymm registers), which was proposed by @ultoris here.

Maybe we should open a separate PR for the VNNI optimization? It might speed-up the quantized dot product, which is on the hot code path, so it's nice to get some performance measurements independently of the packNibbles() optimization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we can use is AVX-VNNI. It is present on already existing Intel CPUs starting from Alder Lake (12th gen) and includes _mm256_dpbusd_epi32(). The difference is the left operand should be unsigned, so you have to keep _mm256_sign_epi8(...) and only replace the _mm256_maddubs_epi16() + sum_i16_pairs_float() pair with _mm256_dpbusd_epi32().

Good suggestion! I have changed mul_sum_i8_pairs_float to use AVX_VNNI, I also think the packNibbles does not affect the inference speed, so I think it is ok to measure inference performance in this PR.

Comment on lines +526 to +530
#if __AVX512F__
const __m256i bytes_srli_4 = _mm256_srli_epi16(bytes, 4); // 0000_0000_abcd_0000
bytes = _mm256_or_si256(bytes, bytes_srli_4); // 0000_abcd_abcd_efgh
return _mm256_cvtepi16_epi8(bytes); // abcd_efgh
#else
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thank you! I made a microbenchmark, and it shows a nice improvement on a Tiger Lake CPU:

-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
BenchPackNibblesAvx2          583 ns          582 ns      1212627
BenchPackNibblesAvx512        514 ns          513 ns      1313028

I don't think packNibbles() is on the hot path during inference, so it will probably not affect the overall inference speed (might be wrong here). However, it's makes the code both more readable and more performant, and is worth merging.

@sw sw merged commit c9e2c26 into ggml-org:master Apr 23, 2023
@MeouSker77 MeouSker77 deleted the avx512-packNibbles branch April 23, 2023 10:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants