feat: tune llama metal backend performance #393

wsxiaoys · 2023-09-03T06:47:33Z

Chip: Apple M2 Max
Memory: 96 GB

All tests all done with TabbyML/CodeLlama-7B

Request

curl -X 'POST' \
  'http://localhost:8080/v1/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "language": "python",
  "segments": {
    "prefix": "def fib(n):\n    ",
    "suffix": "\n        return fib(n - 1) + fib(n - 2)"
  }
}'

float16

llama_print_timings:        load time = 18446.19 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =   450.79 ms /    29 tokens (   15.54 ms per token,    64.33 tokens per second)
llama_print_timings:        eval time =  2345.53 ms /    29 runs   (   80.88 ms per token,    12.36 tokens per second)
llama_print_timings:       total time =  2799.79 ms

q8_0

llama_print_timings:        load time =  8801.08 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =   302.13 ms /    29 tokens (   10.42 ms per token,    95.98 tokens per second)
llama_print_timings:        eval time =   724.88 ms /    29 runs   (   25.00 ms per token,    40.01 tokens per second)
llama_print_timings:       total time =  1030.47 ms

wsxiaoys · 2023-09-03T07:06:38Z

@ggerganov did an experimental integration with ggml's metal backend and generate numbers above. Does the engine.cc overall looks good to you? Let me know if there's some worth to try flags for better performance.

ggerganov · 2023-09-05T04:53:09Z

Do you have llama.cpp version from yesterday?
There were some significant performance improvements for F16 on Metal that would give you >20 t/s for M2 Max for 7B.

engine.cc looks OK

Let me know if you have more questions

wsxiaoys · 2023-09-05T12:35:18Z

Do you have llama.cpp version from yesterday? There were some significant performance improvements for F16 on Metal that would give you >20 t/s for M2 Max for 7B.

engine.cc looks OK

Let me know if you have more questions

We're tagging a submodule located at TabbyML/tabby. You can find a comparison against origin master here: TabbyML/llama.cpp@master...ggerganov:llama.cpp:master

Seems the fp16 change might already be included?

ggerganov · 2023-09-05T12:41:02Z

Yes, seems to be included.

On M2 Ultra, I get ~40 t/s with Code Llama 7B F16, so I was expecting at least ~20 t/s on M2 Max:

./bin/main -m ../models/codellama-7b/ggml-model-f16.gguf -p "# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:\n\n" -e -t 4 --temp -1 -n 128

system_info: n_threads = 4 / 24 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = -1.000000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0


 # Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:

### Complexity Analysis:

- Time Complexity: O(V^2), where V is the number of vertices.
- Space Complexity: O(V).

### Pseudocode:

```python
Dijkstra(Graph, source):
    dist[source] = 0                        // Distance from source to source
    for each vertex v in Graph:              // Initialization
        if v != source:
            dist[v] = INFINITY               // Unknown distance function from source to v
            prev[v] = UNDEFINED
llama_print_timings:        load time =   540.25 ms
llama_print_timings:      sample time =    81.64 ms /   128 runs   (    0.64 ms per token,  1567.76 tokens per second)
llama_print_timings: prompt eval time =    45.21 ms /    25 tokens (    1.81 ms per token,   552.93 tokens per second)
llama_print_timings:        eval time =  3180.65 ms /   127 runs   (   25.04 ms per token,    39.93 tokens per second)
llama_print_timings:       total time =  3328.93 ms
ggml_metal_free: deallocating
Log end

wsxiaoys · 2023-09-05T12:52:07Z

# assume there should be `-ngl 1`

./bin/main -m ../models/codellama-model-f16.gguf -p "# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:\n\n" -e -t 4 --temp -1 -n 128 -ngl 1

system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = -1.000000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0


 # Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:

### Complexity Analysis:

- Time Complexity: O(V^2), where V is the number of vertices.
- Space Complexity: O(V).

### Pseudocode:

```python
Dijkstra(Graph, source):
    dist[source] = 0                        // Distance from source to source
    for each vertex v in Graph:              // Initialization
        if v != source:
            dist[v] = INFINITY               // Unknown distance function from source to v
            prev[v] = UNDEFINED
llama_print_timings:        load time =   997.82 ms
llama_print_timings:      sample time =    82.49 ms /   128 runs   (    0.64 ms per token,  1551.74 tokens per second)
llama_print_timings: prompt eval time =    63.77 ms /    25 tokens (    2.55 ms per token,   392.03 tokens per second)
llama_print_timings:        eval time =  5228.16 ms /   127 runs   (   41.17 ms per token,    24.29 tokens per second)
llama_print_timings:       total time =  5396.12 ms
ggml_metal_free: deallocating
Log end

The performance is indeed ~41ms per token (24t / second).

Let me compare the engine.cc and example/main.cc in depth...

wsxiaoys · 2023-09-05T13:19:25Z

I just realized that I updated the PR's llama.cpp version in commit 8d01604 without redoing the measurement

Here are the statistics for the latest version:

float16

llama_print_timings:        load time =  4544.11 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =   464.48 ms /    29 tokens (   16.02 ms per token,    62.44 tokens per second)
llama_print_timings:        eval time =  1176.45 ms /    29 runs   (   40.57 ms per token,    24.65 tokens per second)
llama_print_timings:       total time =  1642.56 ms

q8_0

llama_print_timings:        load time =   293.14 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =   292.94 ms /    29 tokens (   10.10 ms per token,    99.00 tokens per second)
llama_print_timings:        eval time =   709.46 ms /    29 runs   (   24.46 ms per token,    40.88 tokens per second)
llama_print_timings:       total time =  1003.70 ms

I am amazed by the 2x performance improvement. Thanks for pointing it out!

wsxiaoys added 2 commits September 3, 2023 14:44

feat: support eos based stop

b705e99

feat: print performance stats after each inference

73a67f1

wsxiaoys added 2 commits September 3, 2023 23:20

update llama.cpp

8d01604

update commits

e7372fe

wsxiaoys marked this pull request as ready for review September 5, 2023 02:14

wsxiaoys merged commit e93a971 into main Sep 5, 2023
4 checks passed

wsxiaoys deleted the tune-llama branch September 5, 2023 02:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: tune llama metal backend performance #393

feat: tune llama metal backend performance #393

wsxiaoys commented Sep 3, 2023 •

edited

Loading

wsxiaoys commented Sep 3, 2023

ggerganov commented Sep 5, 2023

wsxiaoys commented Sep 5, 2023

ggerganov commented Sep 5, 2023

wsxiaoys commented Sep 5, 2023 •

edited

Loading

wsxiaoys commented Sep 5, 2023

feat: tune llama metal backend performance #393

feat: tune llama metal backend performance #393

Conversation

wsxiaoys commented Sep 3, 2023 • edited Loading

float16

q8_0

wsxiaoys commented Sep 3, 2023

ggerganov commented Sep 5, 2023

wsxiaoys commented Sep 5, 2023

ggerganov commented Sep 5, 2023

wsxiaoys commented Sep 5, 2023 • edited Loading

wsxiaoys commented Sep 5, 2023

float16

q8_0

wsxiaoys commented Sep 3, 2023 •

edited

Loading

wsxiaoys commented Sep 5, 2023 •

edited

Loading