Performance of llama.cpp on Apple Silicon M-series #4167
Replies: 61 comments 98 replies
-
M2 Mac Mini, 4+4 CPU, 10 GPU, 24 GB Memory (@QueryType) ✅
build: 8e672ef (1550) |
Beta Was this translation helpful? Give feedback.
-
M2 Max Studio, 8+4 CPU, 38 GPU ✅
build: 8e672ef (1550) |
Beta Was this translation helpful? Give feedback.
-
M2 Ultra, 16+8 CPU, 60 GPU (@crasm) ✅
build: 8e672ef (1550) |
Beta Was this translation helpful? Give feedback.
-
M3 Max (MBP 16), 12+4 CPU, 40 GPU (@ymcui) ✅
build: 55978ce (1555) Short Note: mostly similar to the one reported by @slaren . But for Q4_0 |
Beta Was this translation helpful? Give feedback.
-
In the graph, why is PP t/s plotted against bandwidth and TG t/s plotted against GPU cores? Seems like GPU cores have more effect on PP t/s. |
Beta Was this translation helpful? Give feedback.
-
How about also sharing the largest model sizes and context lengths people can run with their amount of RAM? It's important to get the amount of RAM right when buying Apple computers because you can't upgrade later. |
Beta Was this translation helpful? Give feedback.
-
M2 Pro, 6+4 CPU, 16 GPU (@minosvasilias) ✅
build: e9c13ff (1560) |
Beta Was this translation helpful? Give feedback.
-
Would love to see how M1 Max and M1 Ultra fare given their high memory bandwidth. |
Beta Was this translation helpful? Give feedback.
-
M2 MAX (MBP 16) 8+4 CPU, 38 GPU, 96 GB RAM (@MrSparc) ✅
build: e9c13ff (1560) |
Beta Was this translation helpful? Give feedback.
-
M1 Max (MBP 16) 8+2 CPU, 32 GPU, 64GB RAM (@CedricYauLBD) ✅
build: e9c13ff (1560) Note: M1 Max RAM Bandwidth is 400GB/s |
Beta Was this translation helpful? Give feedback.
-
Look at what I started |
Beta Was this translation helpful? Give feedback.
-
M3 Pro (MBP 14), 5+6 CPU, 14 GPU (@paramaggarwal) ✅
build: e9c13ff (1560) |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
### M2 MAX (MBP 16) 38 Core 32GB ✅
build: 795cd5a (1493) |
Beta Was this translation helpful? Give feedback.
-
I'm looking at the summary plot about "PP performance vs GPU cores" and evidence that original unquantised fp16 model always delivers more performance than quantized models. |
Beta Was this translation helpful? Give feedback.
-
Hey I know this is focused on a very specific benchmark but wanted to draw attention to this as it is performance related albeit only affecting the new IQ quant types #5617 and didn't want it being lost in discussions forever |
Beta Was this translation helpful? Give feedback.
-
can somebody run the benchmark on Linux with Apple M1 or M2 and Asahi Linux ? |
Beta Was this translation helpful? Give feedback.
-
What data/prompts are used for this? |
Beta Was this translation helpful? Give feedback.
-
I have just tested the latest Apple M4 chip equipped on iPad Pro 2024 11-inch (256GB). The main difference between two different versions of M4 is the number of performance cores. Also 1TB/2TB iPad Pro has a doubled memory size (16GB).
The following is a quick test for benchmarking M4. M4 (iPad Pro 2024 256GB), 3+6 CPU, 10 GPUtinyllama 1.1b
phi-2 2.7BTBA mistral 7b
|
Beta Was this translation helpful? Give feedback.
-
Can you explain how to do that? |
Beta Was this translation helpful? Give feedback.
-
Well, the HuggingFace repo says that Gemma-7B-it.gguf is 34.7GB, so I
haven't tried because it looks obviously too big to run, but if you can buy
a 64GB machine, I'd recommend that you do.
…On Sat, Jun 15, 2024 at 2:53 PM Alptekin ***@***.***> wrote:
@pudepiedj <https://github.com/pudepiedj> So you cannot run
Gemma-7B-it.gguf on your M2 max 12-core CPU 38-core GPU? I am considering
to buy a similar config (with 64gb ram) on mac studio so I am curious.
Thanks.
—
Reply to this email directly, view it on GitHub
<#4167 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGG22YNBCPRBRG43ZURHI5TZHRBOLAVCNFSM6AAAAAA7V5XCOKVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TOOBRHE2DI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Where can I get the models listed in the command line?
|
Beta Was this translation helpful? Give feedback.
-
There won't be an M3 Ultra, you should remove that from the list |
Beta Was this translation helpful? Give feedback.
-
M4 Pro, 10+4 CPU, 20 GPU, 24 GB Memory (@miccou) ✅
build: 8e672ef (1550) |
Beta Was this translation helpful? Give feedback.
-
M4 (Mac Mini 2024), 4+6 CPU, 10GPU, 32 GB Memory ✅
build: 8e672ef (1550) |
Beta Was this translation helpful? Give feedback.
-
M4 Max (Macbook Pro 16" 2024), 12+4 CPU, 40GPU, 128 GB Memory ✅
build: 8e672ef (1550) |
Beta Was this translation helpful? Give feedback.
-
Have someone tried M4 Pro 64G, is it possible to run a 70B model in a usable speed? |
Beta Was this translation helpful? Give feedback.
-
If i'm reading correctly, the m3 pro is slower than the m2 pro?? |
Beta Was this translation helpful? Give feedback.
-
M4 Pro, 8+4 CPU, 16 GPU, 24 GB Memory (MBP 14) ✅
build: 8e672ef (1550) |
Beta Was this translation helpful? Give feedback.
-
M4 Max (Macbook Pro 14" 2024), 12+4 CPU, 40 GPU, 128 GB Memory
build: 8e672ef (1550) |
Beta Was this translation helpful? Give feedback.
-
Summary
LLaMA 7B
[GB/s]
Cores
[t/s]
[t/s]
[t/s]
[t/s]
[t/s]
[t/s]
plot.py
Description
This is a collection of short
llama.cpp
benchmarks on various Apple Silicon hardware. It can be useful to compare the performance thatllama.cpp
achieves across the M-series chips and hopefully answer questions of people wondering if they should upgrade or not. Collecting info here just for Apple Silicon for simplicity. Similar collection for A-series chips is available here: #4508If you are a collaborator to the project and have an Apple Silicon device, please add your device, results and optionally username for the following command directly into this post (requires LLaMA 7B v2):
PP
means "prompt processing" (bs = 512
),TG
means "text-generation" (bs = 1
),t/s
means "tokens per second"Note that in this benchmark we are evaluating the performance against the same build 8e672ef (2023 Nov 13) in order to keep all performance factors even. Since then, there have been multiple improvements resulting in better absolute performance. As an example, here is how the same test compares against the build 86ed72d (2024 Nov 21) on M2 Ultra:
[GB/s]
Cores
[t/s]
[t/s]
[t/s]
[t/s]
[t/s]
[t/s]
M1 Pro, 8+2 CPU, 16 GPU (@ggerganov) ✅
build: 8e672ef (1550)
M2 Ultra, 16+8 CPU, 76 GPU (@ggerganov) ✅
build: 8e672ef (1550)
M3 Max (MBP 14), 12+4 CPU, 40 GPU (@slaren) ✅
build: d103d93 (1553)
Footnotes
https://en.wikipedia.org/wiki/Apple_M1#Variants ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8
https://en.wikipedia.org/wiki/Apple_M2#Variants ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8
https://en.wikipedia.org/wiki/Apple_M3#Variants ↩ ↩2 ↩3 ↩4 ↩5 ↩6
https://en.wikipedia.org/wiki/Apple_M4#Variants ↩ ↩2 ↩3 ↩4 ↩5 ↩6
Beta Was this translation helpful? Give feedback.
All reactions