Add partial AVX512 Linux support for dot product on 4-bit quantized values #80

Ameobea · 2023-03-20T10:32:23Z

Changes

Update Makefile to detect AVX512 support and add compiler flags if it's available
Add AVX512 impl based on existing AVX2 implementation, dot product on one 32-value block of 4-bit quantized ints at a time
Perform 8 bit -> 16 bit sign extension and multiply+add on 32 values at time instead of 16
Use built-in AVX512 horizontal reduce add to get sum at the end
Manual unrolling on inner dot product loop to reduce loop counter overhead

Performance Impact

I'm seeing around 10% speedup on the 4-bit quantized 7B model when running on my AMD 7950x.

Before:

main: mem per token = 14368644 bytes
main:     load time =   923.25 ms
main:   sample time =    85.94 ms
main:  predict time = 23502.37 ms / 92.17 ms per token
main:    total time = 24845.69 ms

After:

main: mem per token = 14368644 bytes
main:     load time =   928.89 ms
main:   sample time =    16.18 ms
main:  predict time =  5720.41 ms / 82.90 ms per token
main:    total time =  6982.89 ms

I was hoping for more, but some other stuff I tried like converting the bytesFromNibbles function to operate on two blocks at a time by using AVX512 were not successful.

* Update Makefile to detect AVX512 support and add compiler flags if it's available * Based on existing AVX2 implementation, dot product on one 32-value block of 4-bit quantized ints at a time * Perform 8 bit -> 16 bit sign extension and multiply+add on 32 values at time instead of 16 * Use built-in AVX512 horizontal reduce add to get sum at the end

* Manual unrolling on inner dot product loop to reduce loop counter overhead * Add some extra AVX512 compiler flags if detected in makefile

antimatter15 · 2023-03-20T10:39:04Z

Looks great!

That said, I'm working on trying to minimize the number of deviations from the upstream https://github.com/ggerganov/llama.cpp repo, so there would be a more appropriate place for this PR!

Ameobea · 2023-03-20T10:42:37Z

Yeah I'll get a PR up there tomorrow as well.

In the meantime, if anyone else has access to Linux with AVX512-capable CPUs, would be great if they could test this to make sure it works on their setups as well.

Ameobea · 2023-03-20T11:20:42Z

Created ggml-org#320

Ameobea added 2 commits March 19, 2023 23:15

Some optimizations to AVX512 code

b4d82fb

* Manual unrolling on inner dot product loop to reduce loop counter overhead * Add some extra AVX512 compiler flags if detected in makefile

Ameobea mentioned this pull request Mar 20, 2023

Add initial AVX512 support for dot product on Linux ggml-org/llama.cpp#320

Merged

antimatter15 closed this Mar 21, 2023

dfyz mentioned this pull request Apr 15, 2023

≈65% speedup of the AVX-512 implementation of ggml_vec_dot_q4_0() ggml-org/llama.cpp#933

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add partial AVX512 Linux support for dot product on 4-bit quantized values #80

Add partial AVX512 Linux support for dot product on 4-bit quantized values #80

Ameobea commented Mar 20, 2023

antimatter15 commented Mar 20, 2023

Ameobea commented Mar 20, 2023

Ameobea commented Mar 20, 2023

Add partial AVX512 Linux support for dot product on 4-bit quantized values #80

Add partial AVX512 Linux support for dot product on 4-bit quantized values #80

Conversation

Ameobea commented Mar 20, 2023

Changes

Performance Impact

antimatter15 commented Mar 20, 2023

Ameobea commented Mar 20, 2023

Ameobea commented Mar 20, 2023