-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: tune llama metal backend performance #393
Conversation
@ggerganov did an experimental integration with ggml's metal backend and generate numbers above. Does the engine.cc overall looks good to you? Let me know if there's some worth to try flags for better performance. |
Do you have
Let me know if you have more questions |
We're tagging a submodule located at Seems the fp16 change might already be included? |
Yes, seems to be included. On M2 Ultra, I get ~40 t/s with Code Llama 7B F16, so I was expecting at least ~20 t/s on M2 Max: ./bin/main -m ../models/codellama-7b/ggml-model-f16.gguf -p "# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:\n\n" -e -t 4 --temp -1 -n 128
system_info: n_threads = 4 / 24 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = -1.000000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0
# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:
### Complexity Analysis:
- Time Complexity: O(V^2), where V is the number of vertices.
- Space Complexity: O(V).
### Pseudocode:
```python
Dijkstra(Graph, source):
dist[source] = 0 // Distance from source to source
for each vertex v in Graph: // Initialization
if v != source:
dist[v] = INFINITY // Unknown distance function from source to v
prev[v] = UNDEFINED
llama_print_timings: load time = 540.25 ms
llama_print_timings: sample time = 81.64 ms / 128 runs ( 0.64 ms per token, 1567.76 tokens per second)
llama_print_timings: prompt eval time = 45.21 ms / 25 tokens ( 1.81 ms per token, 552.93 tokens per second)
llama_print_timings: eval time = 3180.65 ms / 127 runs ( 25.04 ms per token, 39.93 tokens per second)
llama_print_timings: total time = 3328.93 ms
ggml_metal_free: deallocating
Log end |
# assume there should be `-ngl 1`
./bin/main -m ../models/codellama-model-f16.gguf -p "# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:\n\n" -e -t 4 --temp -1 -n 128 -ngl 1
system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = -1.000000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0
# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:
### Complexity Analysis:
- Time Complexity: O(V^2), where V is the number of vertices.
- Space Complexity: O(V).
### Pseudocode:
```python
Dijkstra(Graph, source):
dist[source] = 0 // Distance from source to source
for each vertex v in Graph: // Initialization
if v != source:
dist[v] = INFINITY // Unknown distance function from source to v
prev[v] = UNDEFINED
llama_print_timings: load time = 997.82 ms
llama_print_timings: sample time = 82.49 ms / 128 runs ( 0.64 ms per token, 1551.74 tokens per second)
llama_print_timings: prompt eval time = 63.77 ms / 25 tokens ( 2.55 ms per token, 392.03 tokens per second)
llama_print_timings: eval time = 5228.16 ms / 127 runs ( 41.17 ms per token, 24.29 tokens per second)
llama_print_timings: total time = 5396.12 ms
ggml_metal_free: deallocating
Log end The performance is indeed ~41ms per token (24t / second). Let me compare the engine.cc and example/main.cc in depth... |
I just realized that I updated the PR's llama.cpp version in commit 8d01604 without redoing the measurement Here are the statistics for the latest version: float16llama_print_timings: load time = 4544.11 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 464.48 ms / 29 tokens ( 16.02 ms per token, 62.44 tokens per second)
llama_print_timings: eval time = 1176.45 ms / 29 runs ( 40.57 ms per token, 24.65 tokens per second)
llama_print_timings: total time = 1642.56 ms q8_0llama_print_timings: load time = 293.14 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 292.94 ms / 29 tokens ( 10.10 ms per token, 99.00 tokens per second)
llama_print_timings: eval time = 709.46 ms / 29 runs ( 24.46 ms per token, 40.88 tokens per second)
llama_print_timings: total time = 1003.70 ms I am amazed by the 2x performance improvement. Thanks for pointing it out! |
Chip: Apple M2 Max
Memory: 96 GB
All tests all done with
TabbyML/CodeLlama-7B
Request
float16
q8_0