Performance of llama.cpp on Apple Silicon M-series #4167
Replies: 55 comments 77 replies
-
M2 Mac Mini, 4+4 CPU, 10 GPU, 24 GB Memory (@QueryType) ✅
build: 8e672ef (1550) |
Beta Was this translation helpful? Give feedback.
-
M2 Max Studio, 8+4 CPU, 38 GPU ✅
build: 8e672ef (1550) |
Beta Was this translation helpful? Give feedback.
-
M2 Ultra, 16+8 CPU, 60 GPU (@crasm) ✅
build: 8e672ef (1550) |
Beta Was this translation helpful? Give feedback.
-
M3 Max (MBP 16), 12+4 CPU, 40 GPU (@ymcui) ✅
build: 55978ce (1555) Short Note: mostly similar to the one reported by @slaren . But for Q4_0 |
Beta Was this translation helpful? Give feedback.
-
In the graph, why is PP t/s plotted against bandwidth and TG t/s plotted against GPU cores? Seems like GPU cores have more effect on PP t/s. |
Beta Was this translation helpful? Give feedback.
-
How about also sharing the largest model sizes and context lengths people can run with their amount of RAM? It's important to get the amount of RAM right when buying Apple computers because you can't upgrade later. |
Beta Was this translation helpful? Give feedback.
-
M2 Pro, 6+4 CPU, 16 GPU (@minosvasilias) ✅
build: e9c13ff (1560) |
Beta Was this translation helpful? Give feedback.
-
Would love to see how M1 Max and M1 Ultra fare given their high memory bandwidth. |
Beta Was this translation helpful? Give feedback.
-
M2 MAX (MBP 16) 8+4 CPU, 38 GPU, 96 GB RAM (@MrSparc) ✅
build: e9c13ff (1560) |
Beta Was this translation helpful? Give feedback.
-
M1 Max (MBP 16) 8+2 CPU, 32 GPU, 64GB RAM (@CedricYauLBD) ✅
build: e9c13ff (1560) Note: M1 Max RAM Bandwidth is 400GB/s |
Beta Was this translation helpful? Give feedback.
-
Look at what I started |
Beta Was this translation helpful? Give feedback.
-
M3 Pro (MBP 14), 5+6 CPU, 14 GPU (@paramaggarwal) ✅
build: e9c13ff (1560) |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
### M2 MAX (MBP 16) 38 Core 32GB ✅
build: 795cd5a (1493) |
Beta Was this translation helpful? Give feedback.
-
I'm looking at the summary plot about "PP performance vs GPU cores" and evidence that original unquantised fp16 model always delivers more performance than quantized models. |
Beta Was this translation helpful? Give feedback.
-
M2 MacBook Air, 8‑core CPU, 8‑core GPU, 16GB RAM ✅
build: 8e672ef (1550) |
Beta Was this translation helpful? Give feedback.
-
Would it be possible for MBP owners to add performance stats on Mixtral 8x7B quantized models? |
Beta Was this translation helpful? Give feedback.
-
M1 Ultra, 16+4 core CPU, 64 core GPU, 128GB RAM ✅
build: 8e672efe (1550) I'll work on the quantized models soon, I'm having some challenges getting them working. |
Beta Was this translation helpful? Give feedback.
-
Major edit!
Xeon 3435X w/ 256GB RAM and 2x 20GB Nvidia RTX 4000, compiled w/ CUDA and tooling 12.3:
On WSL2 Ubuntu 22.04 kernel 5.15.133.1 driver 546.01 (on Windows).
There is one bench output that worries me: I use the low-cores 20GB VRAM RTX 4000 because I thought the combined VRAM would be important. Wonder why a 5800X3D w/ RTX 4090 @marcingomulkiewicz shows roughly 10x the speed, albeit on Linux directly... @vitali-fridman Glad that my 2x RTX 4000 seem to be roughly on par with 2x 3090 when run directly on Ubuntu and not via WSL! Thanks for any insights |
Beta Was this translation helpful? Give feedback.
-
Since people already posted some non-apple results. Here is one more just for comparison. CPU: AMD 7840u, GPU: 780m, RAM: LPDDR5-7500
build 213d143 ROCm compiled without shared memory because the latter affects generation very badly. |
Beta Was this translation helpful? Give feedback.
-
I have the same performance on an M3 Max as OP when running the benchmark, but there seems to be a big gap between llama and Llava performance. As the Llama and Llava 1.5 architectures don't differ much (Llava 1.5 = ViT + Projection Layer + Llama), I'm quite surprised. I'm sorry I can't give you any figures, as I don't think there's a benchmark script for multimodal in llama.cpp. Does anyone have the same felling? |
Beta Was this translation helpful? Give feedback.
-
Hey I know this is focused on a very specific benchmark but wanted to draw attention to this as it is performance related albeit only affecting the new IQ quant types #5617 and didn't want it being lost in discussions forever |
Beta Was this translation helpful? Give feedback.
-
can somebody run the benchmark on Linux with Apple M1 or M2 and Asahi Linux ? |
Beta Was this translation helpful? Give feedback.
-
What data/prompts are used for this? |
Beta Was this translation helpful? Give feedback.
-
I have just tested the latest Apple M4 chip equipped on iPad Pro 2024 11-inch (256GB). The main difference between two different versions of M4 is the number of performance cores. Also 1TB/2TB iPad Pro has a doubled memory size (16GB).
The following is a quick test for benchmarking M4. M4 (iPad Pro 2024 256GB), 3+6 CPU, 10 GPUtinyllama 1.1b
phi-2 2.7BTBA mistral 7b
|
Beta Was this translation helpful? Give feedback.
-
Can you explain how to do that? |
Beta Was this translation helpful? Give feedback.
-
Well, the HuggingFace repo says that Gemma-7B-it.gguf is 34.7GB, so I
haven't tried because it looks obviously too big to run, but if you can buy
a 64GB machine, I'd recommend that you do.
…On Sat, Jun 15, 2024 at 2:53 PM Alptekin ***@***.***> wrote:
@pudepiedj <https://github.com/pudepiedj> So you cannot run
Gemma-7B-it.gguf on your M2 max 12-core CPU 38-core GPU? I am considering
to buy a similar config (with 64gb ram) on mac studio so I am curious.
Thanks.
—
Reply to this email directly, view it on GitHub
<#4167 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGG22YNBCPRBRG43ZURHI5TZHRBOLAVCNFSM6AAAAAA7V5XCOKVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TOOBRHE2DI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Where can I get the models listed in the command line?
|
Beta Was this translation helpful? Give feedback.
-
There won't be an M3 Ultra, you should remove that from the list |
Beta Was this translation helpful? Give feedback.
-
M4 Pro, 10+4 CPU, 20 GPU, 24 GB Memory (@miccou) ✅
build: 8e672ef (1550) |
Beta Was this translation helpful? Give feedback.
-
Summary
LLaMA 7B
[GB/s]
Cores
[t/s]
[t/s]
[t/s]
[t/s]
[t/s]
[t/s]
plot.py
Description
This is a collection of short
llama.cpp
benchmarks on various Apple Silicon hardware. It can be useful to compare the performance thatllama.cpp
achieves across the M-series chips and hopefully answer questions of people wondering if they should upgrade or not. Collecting info here just for Apple Silicon for simplicity. Similar collection for A-series chips is available here: #4508If you are a collaborator to the project and have an Apple Silicon device, please add your device, results and optionally username for the following command directly into this post (requires LLaMA 7B v2):
PP
means "prompt processing" (bs = 512
),TG
means "text-generation" (bs = 1
),t/s
means "tokens per second"M1 Pro, 8+2 CPU, 16 GPU (@ggerganov) ✅
build: 8e672ef (1550)
M2 Ultra, 16+8 CPU, 76 GPU (@ggerganov) ✅
build: 8e672ef (1550)
M3 Max (MBP 14), 12+4 CPU, 40 GPU (@slaren) ✅
build: d103d93 (1553)
Footnotes
https://en.wikipedia.org/wiki/Apple_M1#Variants ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8
https://en.wikipedia.org/wiki/Apple_M2#Variants ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8
https://en.wikipedia.org/wiki/Apple_M3#Variants ↩ ↩2 ↩3 ↩4 ↩5 ↩6
https://en.wikipedia.org/wiki/Apple_M4#Variants ↩ ↩2 ↩3 ↩4 ↩5 ↩6
Beta Was this translation helpful? Give feedback.
All reactions