Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add partial AVX512 Linux support for dot product on 4-bit quantized values #80

Closed
wants to merge 2 commits into from
Closed

Conversation

Ameobea
Copy link

@Ameobea Ameobea commented Mar 20, 2023

Changes

  • Update Makefile to detect AVX512 support and add compiler flags if it's available
  • Add AVX512 impl based on existing AVX2 implementation, dot product on one 32-value block of 4-bit quantized ints at a time
  • Perform 8 bit -> 16 bit sign extension and multiply+add on 32 values at time instead of 16
  • Use built-in AVX512 horizontal reduce add to get sum at the end
  • Manual unrolling on inner dot product loop to reduce loop counter overhead

Performance Impact

I'm seeing around 10% speedup on the 4-bit quantized 7B model when running on my AMD 7950x.

Before:

main: mem per token = 14368644 bytes
main:     load time =   923.25 ms
main:   sample time =    85.94 ms
main:  predict time = 23502.37 ms / 92.17 ms per token
main:    total time = 24845.69 ms

After:

main: mem per token = 14368644 bytes
main:     load time =   928.89 ms
main:   sample time =    16.18 ms
main:  predict time =  5720.41 ms / 82.90 ms per token
main:    total time =  6982.89 ms

I was hoping for more, but some other stuff I tried like converting the bytesFromNibbles function to operate on two blocks at a time by using AVX512 were not successful.

Ameobea added 2 commits March 19, 2023 23:15
 * Update Makefile to detect AVX512 support and add compiler flags if it's available
 * Based on existing AVX2 implementation, dot product on one 32-value block of 4-bit quantized ints at a time
 * Perform 8 bit -> 16 bit sign extension and multiply+add on 32 values at time instead of 16
 * Use built-in AVX512 horizontal reduce add to get sum at the end
 * Manual unrolling on inner dot product loop to reduce loop counter overhead
 * Add some extra AVX512 compiler flags if detected in makefile
@antimatter15
Copy link
Owner

Looks great!

That said, I'm working on trying to minimize the number of deviations from the upstream https://github.com/ggerganov/llama.cpp repo, so there would be a more appropriate place for this PR!

@Ameobea
Copy link
Author

Ameobea commented Mar 20, 2023

Yeah I'll get a PR up there tomorrow as well.

In the meantime, if anyone else has access to Linux with AVX512-capable CPUs, would be great if they could test this to make sure it works on their setups as well.

@Ameobea
Copy link
Author

Ameobea commented Mar 20, 2023

Created ggml-org#320

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants