Performance of llama.cpp on Apple Silicon M-series #4167

ggerganov · 2023-11-22T09:46:54Z

ggerganov
Nov 22, 2023
Maintainer

Summary

LLaMA 7B

	BW [GB/s]	GPU Cores	F16 PP [t/s]	F16 TG [t/s]	Q8_0 PP [t/s]	Q8_0 TG [t/s]	Q4_0 PP [t/s]	Q4_0 TG [t/s]
✅ M1 ¹	68	7			108.21	7.92	107.81	14.19
✅ M1 ¹	68	8			117.25	7.91	117.96	14.15
✅ M1 Pro ¹	200	14	262.65	12.75	235.16	21.95	232.55	35.52
✅ M1 Pro ¹	200	16	302.14	12.75	270.37	22.34	266.25	36.41
✅ M1 Max ¹	400	24	453.03	22.55	405.87	37.81	400.26	54.61
✅ M1 Max ¹	400	32	599.53	23.03	537.37	40.2	530.06	61.19
✅ M1 Ultra ¹	800	48	875.81	33.92	783.45	55.69	772.24	74.93
✅ M1 Ultra ¹	800	64	1168.89	37.01	1042.95	59.87	1030.04	83.73

✅ M2 ²	100	8			147.27	12.18	145.91	21.7
✅ M2 ²	100	10	201.34	6.72	181.4	12.21	179.57	21.91
✅ M2 Pro ²	200	16	312.65	12.47	288.46	22.7	294.24	37.87
✅ M2 Pro ²	200	19	384.38	13.06	344.5	23.01	341.19	38.86
✅ M2 Max ²	400	30	600.46	24.16	540.15	39.97	537.6	60.99
✅ M2 Max ²	400	38	755.67	24.65	677.91	41.83	671.31	65.95
✅ M2 Ultra ²	800	60	1128.59	39.86	1003.16	62.14	1013.81	88.64
✅ M2 Ultra ²	800	76	1401.85	41.02	1248.59	66.64	1238.48	94.27

🟥 M3 ³	100	8
🟨 M3 ³	100	10			187.52	12.27	186.75	21.34
🟨 M3 Pro ³	150	14			272.11	17.44	269.49	30.65
✅ M3 Pro ³	150	18	357.45	9.89	344.66	17.53	341.67	30.74
✅ M3 Max ³	300	30	589.41	19.54	566.4	34.3	567.59	56.58
✅ M3 Max ³	400	40	779.17	25.09	757.64	42.75	759.7	66.31
❌ M3 Ultra

🟥 M4 ⁴	120	8
✅ M4 ⁴	120	10	230.18	7.43	223.64	13.54	221.29	24.11
✅ M4 Pro ⁴	273	16	381.14	17.19	367.13	30.54	364.06	49.64
✅ M4 Pro ⁴	273	20	464.48	17.18	449.62	30.69	439.78	50.74
🟥 M4 Max ⁴	410	32
✅ M4 Max ⁴	546	40	922.83	31.64	891.94	54.05	885.68	83.06
🟥 M4 Ultra	820	64
🟥 M4 Ultra	1092	80

plot.py

# GPT-4 Generated Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Creating DataFrame from the provided data
data = {
    "Chip": ["M1", "M1", "M1 Pro", "M1 Pro", "M1 Max", "M1 Max", "M1 Ultra", "M2", "M2 Pro", "M2 Pro", "M2 Max", "M2 Max", "M2 Ultra", "M2 Ultra", "M3", "M3 Pro", "M3 Pro", "M3 Max"],
    "BW (GB/s)":     [68, 68, 200, 200, 400, 400, 800, 100, 200, 200, 400, 400, 800, 800, 100, 150, 150, 400],
    "GPU Cores":     [7, 8, 14, 16, 24, 32, 48, 10, 16, 19, 30, 38, 60, 76, 10, 14, 18, 40],
    "F16 PP (t/s)":  [None, None, None, 302.14, 453.03, 599.53, 875.81, 201.34, 312.65, 384.38, 600.46, 755.67, 1128.59, 1401.85, None, None, 357.45, 779.17],
    "F16 TG (t/s)":  [None, None, None, 12.75, 22.55, 23.03, 33.92, 6.72, 12.47, 13.06, 24.16, 24.65, 39.86, 41.02, None, None, 9.89, 25.09],
    "Q8_0 PP (t/s)": [108.21, 117.25, 235.16, 270.37, 405.87, 537.37, 783.45, 181.4, 288.46, 344.5, 540.15, 677.91, 1003.16, 1248.59, 187.52, 272.11, 344.66, 757.64],
    "Q8_0 TG (t/s)": [7.92, 7.91, 21.95, 22.34, 37.81, 40.2, 55.69, 12.21, 22.7, 23.01, 39.97, 41.83, 62.14, 66.64, 12.27, 17.44, 17.53, 42.75],
    "Q4_0 PP (t/s)": [107.81, 117.96, 232.55, 266.25, 400.26, 530.06, 772.24, 179.57, 294.24, 341.19, 537.6, 671.31, 1013.81, 1238.48, 186.75, 269.49, 341.67, 759.7],
    "Q4_0 TG (t/s)": [14.19, 14.15, 35.52, 36.41, 54.61, 61.19, 74.93, 21.91, 37.87, 38.86, 60.99, 65.95, 88.64, 94.27, 21.34, 30.65, 30.74, 66.31]
}
df = pd.DataFrame(data)

# Helper function to plot and annotate multiple data series in the same plot
def plot_multi_series(ax, x, y_series, labels, xlabel, ylabel, title, poly_power=1):
    colors = ['r', 'g', 'b']  # Colors for different series
    for i, y in enumerate(y_series):
        # Sorting data for regression
        sorted_indices = np.argsort(x)
        x_sorted = x[sorted_indices]
        y_sorted = y[sorted_indices]

        # Masking NaN values
        mask = ~np.isnan(y_sorted)
        x_sorted = x_sorted[mask]
        y_sorted = y_sorted[mask]

        # Fitting a polynomial regression model
        coefficients = np.polyfit(x_sorted, y_sorted, poly_power)
        polynomial = np.poly1d(coefficients)

        # Creating a range of x-values for a smoother trendline
        x_range = np.linspace(x_sorted.min(), x_sorted.max(), 500)
        trendline = polynomial(x_range)

        # Plotting
        ax.scatter(x, y, color=colors[i], label=labels[i], s=20)
        ax.plot(x_range, trendline, f"{colors[i]}-", linewidth=1)  # Trendline in the same color

    ax.set_xlabel(xlabel)
    ax.set_ylabel(ylabel)
    ax.set_title(title)
    ax.legend()

    # Annotating points with the number of GPU cores and Bandwidth
    for i, txt in enumerate(df["Chip"]):
        ax.annotate(f"{df['GPU Cores'][i]} Cores, {df['BW (GB/s)'][i]} GB/s", (x[i], y_series[0][i]))


# Creating plots for PP vs Cores and TG vs Bandwidth
fig, axs = plt.subplots(1, 2, figsize=(15, 6))
fig.suptitle('PP vs GPU Cores and TG vs Bandwidth for F16, Q8_0, and Q4_0')

# PP vs GPU Cores
y_series_cores_pp = [df["F16 PP (t/s)"], df["Q8_0 PP (t/s)"], df["Q4_0 PP (t/s)"]]
plot_multi_series(axs[0], df["GPU Cores"], y_series_cores_pp,
                  ['F16 PP', 'Q8_0 PP', 'Q4_0 PP'], 'GPU Cores', 'Performance (t/s)',
                  'PP Performance vs GPU Cores', 1)

# TG vs Bandwidth
y_series_bw_tg = [df["F16 TG (t/s)"], df["Q8_0 TG (t/s)"], df["Q4_0 TG (t/s)"]]
plot_multi_series(axs[1], df["BW (GB/s)"], y_series_bw_tg,
                  ['F16 TG', 'Q8_0 TG', 'Q4_0 TG'], 'Bandwidth (GB/s)', 'Performance (t/s)',
                  'TG Performance vs Bandwidth', 2)

plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

Description

This is a collection of short llama.cpp benchmarks on various Apple Silicon hardware. It can be useful to compare the performance that llama.cpp achieves across the M-series chips and hopefully answer questions of people wondering if they should upgrade or not. Collecting info here just for Apple Silicon for simplicity. Similar collection for A-series chips is available here: #4508

If you are a collaborator to the project and have an Apple Silicon device, please add your device, results and optionally username for the following command directly into this post (requires LLaMA 7B v2):

git checkout 8e672efe
make clean && make -j llama-bench && ./llama-bench \
  -m ./models/llama-7b-v2/ggml-model-f16.gguf  \
  -m ./models/llama-7b-v2/ggml-model-q8_0.gguf \
  -m ./models/llama-7b-v2/ggml-model-q4_0.gguf \
  -p 512 -n 128 -ngl 99 2> /dev/null

Make sure to run the benchmark on commit 8e672ef
Please also include the F16 model as shown, not just the quantum models
Contributors can post the same results in the comments below
If a device is already benchmarked and your results are comparable, there is no need to add it again
PP means "prompt processing" (bs = 512), TG means "text-generation" (bs = 1), t/s means "tokens per second"
✅ means the data has been added to the summary

Note that in this benchmark we are evaluating the performance against the same build 8e672ef (2023 Nov 13) in order to keep all performance factors even. Since then, there have been multiple improvements resulting in better absolute performance. As an example, here is how the same test compares against the build 86ed72d (2024 Nov 21) on M2 Ultra:

	BW [GB/s]	GPU Cores	F16 PP [t/s]	F16 TG [t/s]	Q8_0 PP [t/s]	Q8_0 TG [t/s]	Q4_0 PP [t/s]	Q4_0 TG [t/s]
M2 Ultra `8e672ef`	800	76	1401.85	41.02	1248.59	66.64	1238.48	94.27
M2 Ultra `86ed72d` + FA	800	76	1525.95	43.15	1368.18	73.11	1391.78	108.80

M1 Pro, 8+2 CPU, 16 GPU (@ggerganov) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	302.14 ± 0.07
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	12.75 ± 0.00
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	270.37 ± 0.02
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	22.34 ± 0.00
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	266.25 ± 0.07
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	36.41 ± 0.01

build: 8e672ef (1550)

M2 Ultra, 16+8 CPU, 76 GPU (@ggerganov) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	1401.85 ± 1.75
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	41.02 ± 0.02
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	1248.59 ± 0.73
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	66.64 ± 0.02
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	1238.48 ± 0.76
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	94.27 ± 0.05

build: 8e672ef (1550)

M3 Max (MBP 14), 12+4 CPU, 40 GPU (@slaren) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	794.26 ± 3.16
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	25.27 ± 0.07
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	749.37 ± 8.35
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	43.00 ± 0.12
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	690.99 ± 33.76
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	65.85 ± 0.22

build: d103d93 (1553)

QueryType · 2023-11-22T17:17:06Z

QueryType
Nov 22, 2023

M2 Mac Mini, 4+4 CPU, 10 GPU, 24 GB Memory (@QueryType) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	201.34 ± 0.21
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	6.72 ± 0.01
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	181.40 ± 0.05
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	12.21 ± 0.04
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	179.57 ± 0.04
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	21.91 ± 0.02

build: 8e672ef (1550)

0 replies

brozkrut · 2023-11-23T15:50:17Z

brozkrut
Nov 23, 2023

M2 Max Studio, 8+4 CPU, 38 GPU ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	755.67 ± 0.11
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	24.65 ± 0.02
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	677.91 ± 0.26
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	41.83 ± 0.03
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	671.31 ± 0.20
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	65.95 ± 0.08

build: 8e672ef (1550)

6 replies

marcingomulkiewicz Dec 3, 2023

5800X3D, Linux Mint, kernel 6.2.0-37-generic, NV driver 535.129.03:

build: 8e672ef (1550):

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	CUDA	99	pp 512	9905.28 ± 614.92
llama 7B mostly F16	12.55 GiB	6.74 B	CUDA	99	tg 128	60.91 ± 0.17
llama 7B mostly Q8_0	6.67 GiB	6.74 B	CUDA	99	pp 512	6602.12 ± 299.45
llama 7B mostly Q8_0	6.67 GiB	6.74 B	CUDA	99	tg 128	99.90 ± 0.21
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	99	pp 512	6927.76 ± 78.51
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	99	tg 128	156.47 ± 0.24

newer build basically the same:

build: 33e171d (1604):

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	CUDA	99	pp 512	9918.85 ± 597.74
llama 7B mostly F16	12.55 GiB	6.74 B	CUDA	99	tg 128	60.66 ± 0.88
llama 7B mostly Q8_0	6.67 GiB	6.74 B	CUDA	99	pp 512	6616.48 ± 287.31
llama 7B mostly Q8_0	6.67 GiB	6.74 B	CUDA	99	tg 128	100.23 ± 0.42
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	99	pp 512	6970.15 ± 38.83
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	99	tg 128	157.91 ± 0.36

for comparison, CPU only (1604):

model	size	params	backend	threads	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	CPU	8	pp 512	30.69 ± 0.14
llama 7B mostly F16	12.55 GiB	6.74 B	CPU	8	tg 128	2.92 ± 0.00
llama 7B mostly Q8_0	6.67 GiB	6.74 B	CPU	8	pp 512	33.72 ± 0.13
llama 7B mostly Q8_0	6.67 GiB	6.74 B	CPU	8	tg 128	5.40 ± 0.18
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	8	pp 512	28.29 ± 0.11
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	8	tg 128	10.24 ± 0.03

AndreasKunar Dec 3, 2023

Thanks a lot. very interesting data and to me in-line with Apple silicon. If you add a GPU FP32 TFLOPS column (pure GPUs is not comparable cross architecture), the PP F16 scales with TFLOPS (FP16 with FP32 accumulate = 165.2 TFLOPS for the 4090), the TG F16 scales with memory-bandwidth (1008 GB/s for 4090). And quantization adds speed due to reduced memory, but requires more compute. What's amazing to me, is that this might enable a good predictor for model-performance based on easily available GPU FP32 TFLOPS + memory-bandwidth information.

maver1ck Dec 16, 2023

Wow. I wasn't aware that 4090 is so fast.

vitali-fridman Dec 26, 2023

This is from one/two generations old hardware but it's for 70B model which might be of interest.

CPU: AMD 3995WX, GPU: 2x Nvidia 3090, Ubuntu 23.10, Kernel 6.5.0-14, NV Driver: 545.23.08, CUDA: 12.3.1

model	size	params	backend	ngl	test	t/s
llama 70B Q4_0	36.20 GiB	68.98 B	CUDA	99	pp 512	179.29 ± 2.83
llama 70B Q4_0	36.20 GiB	68.98 B	CUDA	99	tg 128	21.17 ± 0.04

For comparison, 7B model on the same hardware

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	pp 512	1178.60 ± 88.08
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	tg 128	87.34 ± 0.89

zotona Dec 27, 2023

could you try at 7b model for correct comparation? Thanks!

crasm · 2023-11-23T19:02:44Z

crasm
Nov 23, 2023

M2 Ultra, 16+8 CPU, 60 GPU (@crasm) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	1128.59 ± 0.82
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	39.86 ± 0.01
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	1003.16 ± 0.39
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	62.14 ± 0.03
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	1013.81 ± 0.92
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	88.64 ± 0.06

build: 8e672ef (1550)

0 replies

ymcui · 2023-11-24T03:17:39Z

ymcui
Nov 24, 2023

M3 Max (MBP 16), 12+4 CPU, 40 GPU (@ymcui) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	779.17 ± 0.49
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	25.09 ± 0.01
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	757.64 ± 1.03
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	42.75 ± 0.06
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	759.70 ± 2.26
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	66.31 ± 0.12

build: 55978ce (1555)

Short Note: mostly similar to the one reported by @slaren . But for Q4_0 pp 512, my result is 759.70 ± 2.26, while the one in the main post is 690.99 ± 33.76. Not sure about the source of the difference.

1 reply

slaren Nov 24, 2023
Collaborator

I am not sure why, but the results that I get are not very consistent. I suspect that it may due to the cooling limitations of the smaller laptop. I repeated the test now and the results are very similar to yours.

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	787.24 ± 0.84
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	25.15 ± 0.02
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	755.88 ± 1.56
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	42.64 ± 0.04
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	760.65 ± 0.77
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	66.35 ± 0.24

Azirine · 2023-11-24T09:08:33Z

Azirine
Nov 24, 2023

In the graph, why is PP t/s plotted against bandwidth and TG t/s plotted against GPU cores? Seems like GPU cores have more effect on PP t/s.

0 replies

Azirine · 2023-11-24T15:08:41Z

Azirine
Nov 24, 2023

How about also sharing the largest model sizes and context lengths people can run with their amount of RAM? It's important to get the amount of RAM right when buying Apple computers because you can't upgrade later.

1 reply

ggerganov Nov 24, 2023
Maintainer Author

You can compute these. By default, you can use ~75% of the total RAM with the GPU. You can use more if you do some tricks

minosvasilias · 2023-11-24T20:36:12Z

minosvasilias
Nov 24, 2023

M2 Pro, 6+4 CPU, 16 GPU (@minosvasilias) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	312.65 ± 15.75
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	12.47 ± 0.71
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	288.46 ± 0.06
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	22.70 ± 0.12
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	294.24 ± 0.10
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	37.87 ± 0.10

build: e9c13ff (1560)

0 replies

to3d · 2023-11-24T22:06:23Z

to3d
Nov 24, 2023

Would love to see how M1 Max and M1 Ultra fare given their high memory bandwidth.

0 replies

MrSparc · 2023-11-25T00:11:27Z

MrSparc
Nov 25, 2023

M2 MAX (MBP 16) 8+4 CPU, 38 GPU, 96 GB RAM (@MrSparc) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	674.50 ± 0.58
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	41.79 ± 0.04
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	669.51 ± 1.17
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	64.55 ± 1.36

build: e9c13ff (1560)

2 replies

rlippmann Nov 26, 2023

I'm also using a MBP16 M2Max with the same CPU/GPU specs, but only 32 gb ram and my results are roughly the same:

M2 MAX (MBP 16) 8+4 CPU, 38 GPU, 32 GB RAM ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	747.99 ± 0.28
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	24.54 ± 0.22
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	674.37 ± 0.63
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	40.67 ± 0.05
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	668.28 ± 0.24
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	62.98 ± 0.06

build: 22da055 (1566)

MrSparc Nov 26, 2023

Yes, it is expected that the same cpu/gpu spec will have similar performance values for same models to be compared regardless of RAM, as long as the size of the model to be used can be loaded into memory.
The amount of RAM is a limiting factor in the size of the model that can be loaded, as only 75% (by default) of the unified memory can be used as VRAM on the GPU
https://github.com/ggerganov/llama.cpp#memorydisk-requirements

CedricYauLBD · 2023-11-25T00:16:50Z

CedricYauLBD
Nov 25, 2023

M1 Max (MBP 16) 8+2 CPU, 32 GPU, 64GB RAM (@CedricYauLBD) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	599.53 ± 0.86
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	23.03 ± 0.09
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	537.37 ± 0.19
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	40.20 ± 0.03
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	530.06 ± 0.17
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	61.19 ± 0.15

build: e9c13ff (1560)

Note: M1 Max RAM Bandwidth is 400GB/s

0 replies

philipturner · 2023-11-25T03:32:09Z

philipturner
Nov 25, 2023

Look at what I started

1 reply

yxzwayne Nov 25, 2023

off topic, but your benchmark output is my desktop rn :D

paramaggarwal · 2023-11-25T03:47:44Z

paramaggarwal
Nov 25, 2023

M3 Pro (MBP 14), 5+6 CPU, 14 GPU (@paramaggarwal) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	272.11 ± 1.40
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	17.44 ± 0.42
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	269.49 ± 1.14
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	30.65 ± 0.20

build: e9c13ff (1560)

3 replies

ggerganov Nov 25, 2023
Maintainer Author

This one has 150 GB/s memory bandwidth, correct?

paramaggarwal Nov 25, 2023

Yes, that's correct. (source)

Kaszebe May 30, 2024

Could it run a Q5 quant of llama3 70b Instruct at ~2 tokens per second?

brozkrut · 2023-11-25T14:50:23Z

brozkrut
Nov 25, 2023

Chip (vs. Predecessor)	F16 PP	F16 TG	Q8_0 PP	Q8_0 TG	Q4_0 PP	Q4_0 TG
M2 Pro (16) vs. M1 Pro (16)	312.65 302.14	12.47 12.75	288.46 270.37	22.7 22.34	294.24 266.25	37.87 36.41
	+3.48%	-2.20%	+6.69%	+1.61%	+10.51%	+4.01%
M2 Max (38) vs. M1 Max (32)	755.67 599.53	24.65 23.03	677.91 537.37	41.83 40.2	671.31 530.06	65.95 61.19
	+26.04%	+7.03%	+26.15%	+4.05%	+26.65%	+7.78%
M2 Ultra (60) vs. M2 Max (38)	1128.59 755.67	39.86 24.65	1003.16 677.91	62.14 41.83	1013.81 671.31	88.64 65.95
	+49.34%	+61.90%	+48.04%	+48.48%	+51.03%	+34.41%
M2 Ultra (76) vs. M2 Max (38)	1401.85 755.67	41.02 24.65	1248.59 677.91	66.64 41.83	1238.48 671.31	94.27 65.95
	+85.67%	+66.45%	+84.24%	+59.47%	+84.53%	+43.06%
M2 Ultra (76) vs. M2 Ultra (60)	1401.85 1128.59	41.02 39.86	1248.59 1003.16	66.64 62.14	1238.48 1013.81	94.27 88.64
	+24.25%	+2.91%	+24.43%	+7.23%	+22.19%	+6.33%
M3 Pro (14) vs. M2 Pro (16)			272.11 288.46	17.44 22.7	269.49 294.24	30.65 37.87
			-5.67%	-23.17%	-8.41%	-19.07%
M3 Max (40) vs. M2 Max (38)	779.17 755.67	25.09 24.65	757.64 677.91	42.75 41.83	759.7 671.31	66.31 65.95
	+3.11%	+1.78%	+11.76%	+2.20%	+13.17%	+0.55%

0 replies

pudepiedj · 2023-11-25T17:33:00Z

pudepiedj
Nov 25, 2023

### M2 MAX (MBP 16) 38 Core 32GB ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	754.39 ± 0.36
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	24.31 ± 0.38
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	671.33 ± 2.65
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	40.85 ± 0.32
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	664.07 ± 9.11
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	63.29 ± 0.15

build: 795cd5a (1493)

0 replies

MrSparc · 2023-11-25T21:49:00Z

MrSparc
Nov 25, 2023

I'm looking at the summary plot about "PP performance vs GPU cores" and evidence that original unquantised fp16 model always delivers more performance than quantized models.
Sorry if my question is silly, I'm new to this area, but can someone explain to me why original model delivers more performance than quantized models? Thanks

1 reply

ggerganov Nov 26, 2023
Maintainer Author

The question is not silly - the observation is expected. At large batch size (PP means batch size of 512) the computation is compute bound. I.e. the speed depends on how many FLOPS you can utilize. For quantum models, the existing kernels require extra compute to dequantize the data compared to F16 models where the data is already in F16 format.

agnosticlines · 2024-02-20T19:39:17Z

agnosticlines
Feb 20, 2024

Hey I know this is focused on a very specific benchmark but wanted to draw attention to this as it is performance related albeit only affecting the new IQ quant types #5617 and didn't want it being lost in discussions forever

0 replies

grigio · 2024-03-29T11:47:13Z

grigio
Mar 29, 2024

can somebody run the benchmark on Linux with Apple M1 or M2 and Asahi Linux ?

1 reply

AndreasKunar Jun 21, 2024

Results for M2 Max 38 GPU 96 GB RAM MacOs with Ubuntu 24.04 in Parallels 19.4.0 - 8 CPUs + 32Gb allocated to VM. This does not support GPU-virtualization, only runs CPU-native.

model	size	params	backend	threads	test	t/s
llama 7B F16	12.55 GiB	6.74 B	CPU	8	pp512	26.18 ± 0.06
llama 7B F16	12.55 GiB	6.74 B	CPU	8	tg128	8.19 ± 0.09
llama 7B Q8_0	6.67 GiB	6.74 B	CPU	8	pp512	52.83 ± 0.36
llama 7B Q8_0	6.67 GiB	6.74 B	CPU	8	tg128	13.79 ± 0.49
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	8	pp512	49.04 ± 0.06
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	8	tg128	22.86 ± 0.16

Results for M2 Max 38 GPU 96 GB RAM MacOs with Fedora Container in Podman 4.9 - 8 CPUs + 32Gb allocated to VM. This DOES support GPU-acceleration via the Vulkan driver. Still not as fast as native.

Vulkan0: Virtio-GPU Venus (Apple M2 Max) | uma: 1 | fp16: 1 | warp size: 32

model	size	params	backend	ngl	test	t/s
llama 7B F16	12.55 GiB	6.74 B	Vulkan	99	pp 512	80.40 ± 0.43
llama 7B F16	12.55 GiB	6.74 B	Vulkan	99	tg 128	2.02 ± 0.00
llama 7B Q8_0	6.67 GiB	6.74 B	Vulkan	99	pp 512	79.79 ± 0.07
llama 7B Q8_0	6.67 GiB	6.74 B	Vulkan	99	tg 128	19.79 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp 512	79.94 ± 0.13
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	tg 128	24.07 ± 0.01

aneeshmb02 · 2024-05-10T11:11:30Z

aneeshmb02
May 10, 2024

What data/prompts are used for this?

1 reply

beebopkim May 12, 2024

llama-bench generates prompts by itself. So there is no given data or prompts.

ymcui · 2024-05-20T00:43:25Z

ymcui
May 20, 2024

I have just tested the latest Apple M4 chip equipped on iPad Pro 2024 11-inch (256GB). The main difference between two different versions of M4 is the number of performance cores. Also 1TB/2TB iPad Pro has a doubled memory size (16GB).

Capacity	Performance Cores	Efficiency Cores	Graphics Cores	Bandwidth	Memory
256GB / 512GB	3	6	10	120 GB/s	8GB
1TB/2TB	4	6	10	120 GB/s	16GB

The following is a quick test for benchmarking M4.

M4 (iPad Pro 2024 256GB), 3+6 CPU, 10 GPU

tinyllama 1.1b

model	size	params	backend	test	t/s
llama 1B F16	2.05 GiB	1.10 B	Metal	pp 512	1375.87 ± 2.49
llama 1B F16	2.05 GiB	1.10 B	Metal	tg 128	46.21 ± 0.22
llama 1B Q8_0	1.09 GiB	1.10 B	Metal	pp 512	1333.82 ± 1.85
llama 1B Q8_0	1.09 GiB	1.10 B	Metal	tg 128	75.33 ± 0.19
llama 1B Q4_0	0.59 GiB	1.10 B	Metal	pp 512	1334.54 ± 6.87
llama 1B Q4_0	0.59 GiB	1.10 B	Metal	tg 128	121.20 ± 0.08

phi-2 2.7B

TBA

mistral 7b

model	size	params	backend	test	t/s
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	pp 512	212.13 ± 0.11
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	tg 128	23.29 ± 0.03

0 replies

tristan-k · 2024-05-21T09:42:23Z

tristan-k
May 21, 2024

ROCm compiled without shared memory because the latter affects generation very badly.

Can you explain how to do that?

0 replies

pudepiedj · 2024-06-16T11:46:19Z

pudepiedj
Jun 16, 2024

Well, the HuggingFace repo says that Gemma-7B-it.gguf is 34.7GB, so I haven't tried because it looks obviously too big to run, but if you can buy a 64GB machine, I'd recommend that you do.

…

On Sat, Jun 15, 2024 at 2:53 PM Alptekin ***@***.***> wrote: @pudepiedj <https://github.com/pudepiedj> So you cannot run Gemma-7B-it.gguf on your M2 max 12-core CPU 38-core GPU? I am considering to buy a similar config (with 64gb ram) on mac studio so I am curious. Thanks. — Reply to this email directly, view it on GitHub <#4167 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGG22YNBCPRBRG43ZURHI5TZHRBOLAVCNFSM6AAAAAA7V5XCOKVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TOOBRHE2DI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

J-Siu · 2024-10-24T08:54:42Z

J-Siu
Oct 24, 2024

Where can I get the models listed in the command line?

  -m ./models/llama-7b-v2/ggml-model-f16.gguf  \
  -m ./models/llama-7b-v2/ggml-model-q8_0.gguf \
  -m ./models/llama-7b-v2/ggml-model-q4_0.gguf \

3 replies

AndreasKunar Oct 24, 2024

This is Llama V2 (an older model), in order to compare apples to apples. You have to accept Meta's terms for using Llama 2. and you get the quantized versions at https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF. The f16 version you might get from https://huggingface.co/atwine/Llama-2-7b-chat-f16-gguf

Hope this helps.

J-Siu Oct 24, 2024

Thank you! I believe I got the correct files as I got almost identical results as the 2nd M2 in the table above.

AndreasKunar Oct 24, 2024

Yes, this should give you comparable benchmarks on your hardware. Some variations are also caused by different versions of llama.cpp

Please note, that llama-2 is not really this much useful anymore outside benchmark-comparisons. For using llama as LLM, I would suggest rather llama 3.1 8B, and - if you use llama.cpp on Snapdragon X or other newer ARM chip, the Q4_0_4_8 quantization (2-3x faster prompt-processing).

miniak · 2024-11-03T01:41:18Z

miniak
Nov 3, 2024

There won't be an M3 Ultra, you should remove that from the list

1 reply

bluemoehre Nov 5, 2024

Should be replaced with the M4 Ultra - this is what we're waiting for 🍿😁

miccou · 2024-11-08T06:50:12Z

miccou
Nov 8, 2024

M4 Pro, 10+4 CPU, 20 GPU, 24 GB Memory (@miccou) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	464.48 ± 2.62
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	17.18 ± 0.02
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	449.62 ± 0.12
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	30.69 ± 0.03
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	439.78 ± 6.51
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	50.74 ± 0.07

build: 8e672ef (1550)

2 replies

miccou Nov 8, 2024

@ggerganov here are the results for the M4 Pro, as released today

ndeldev Nov 13, 2024

@miccou thanks for the updated performance. Do you think the base model 12-Core CPU 20-Core GPU 48GB Unified Memory would perform any differently or is it largely dependent on the model size.

I extrapolated from the current data.

sanddudu · 2024-11-10T03:44:47Z

sanddudu
Nov 10, 2024

M4 (Mac Mini 2024), 4+6 CPU, 10GPU, 32 GB Memory ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	230.18 ± 0.16
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	7.43 ± 0.02
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	223.64 ± 0.05
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	13.54 ± 0.01
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	221.29 ± 0.04
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	24.11 ± 0.01

build: 8e672ef (1550)

2 replies

denysl-forcom Dec 3, 2024

It's interesting to see if anyone runs this test with 24 GB of RAM. I believe only 75% of it can be allocated to work with GPUs.
I'd love to scale power by multiplying M4 minis.

bluemoehre Dec 4, 2024

It seems, scaling one model across multiple machines is not that easy nor beneficial: M4 Mac Mini CLUSTER (YouTube).
Running different models / experts per machine as a cluster is probably the most interesting.

sekstini · 2024-11-13T00:30:51Z

sekstini
Nov 13, 2024

M4 Max (Macbook Pro 16" 2024), 12+4 CPU, 40GPU, 128 GB Memory ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	922.83 ± 1.12
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	31.64 ± 0.08
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	891.94 ± 0.28
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	54.05 ± 0.16
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	885.68 ± 1.11
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	83.06 ± 0.22

build: 8e672ef (1550)

12 replies

atlas5301 Nov 18, 2024

You are mostly correct about this, if a high bandwidth network is available. In fact, even for the largest models, it's still possible to fit one layer in vram , calculate the kv-cache and the hidden states, and then load the next layer to the vram. The major overheads are pcie transfer of the model checkpoint and the transfer of kv-cache from nvidia gpu device to apple silicon devices. For very long prompts, the former one is negligible. The kv-dimension is 1024 for the 32-layer llama-3.1-8b. Hence, the size of kv-cache per token should be 2*32*1024*2Byte, or 128KiB. To match the 1000 token/s prompt evaluation speed, a 1Gbps ethernet is required. For Qwen2.5-7B-Instruct, about half of the speed is required. The size of kv-cache per Llama-3.1-70b token should be 2*80*1024*2Byte, or 320KiB. To match the 100 token/s prompt evaluation speed, we only need to have a 250Mbps network connection. We can always expect some models, for example, deepseek series, with smaller kv-cache size per token, will benefit from such offloading, especially when the prompt itself is long enough. A rough estimate of the prompt length that will benefit from such kind of offloading is about 1000 tokens, where it takes much longer to process the prompt on m4max than on a nvidia device considering the extra time spent loading the model weights from ram.

bluemoehre Nov 19, 2024

@ggerganov insane configuration having 192GB RAM. I was thinking on getting a used Studio M2 Max (32-64GB) or 2x Mac Mini M4 (16GB + 24GB Pro) to run a Chat Model + a Vision Model / Embedding Model (RAG for docs) for daily tasks. Right now it is a similar price tag. Would you recommend getting a huge machine over multiple small ones? How is your subjective experience in quality of answers with larger models / better Q?

ggerganov Nov 19, 2024
Maintainer Author

The M2 Max has 400GB/s memory bandwidth while the M4 Pro has 273GB/s, so ~1.5x faster for text generation. M2 Max with 38 GPU cores would be ~1.5x faster than M4 Pro with 20 GPU cores for prompt processing (see the table above). But with the Minis, you will have 2 of them which can work in parallel. I don't have experience with Vision/Embeddings and RAG so can't really recommend anything.

How is your subjective experience in quality of answers with larger models / better Q?

Can't give you much info here as I am not a heavy chat user and don't have a good base for comparison. I'm mainly using the Qwen 2.5 Coder models and my primary usage is FIM completion. Regarding Q - I've settled on Q8 because prompt processing speed is much more important for FIM and since I have practically infinite VRAM there is no need to use lower quantizations as they don't improve the prompt processing speed. I would probably even use F16 for the extra PP speed, but there is an annoying bug in MacOS that is kind of a deal breaker at the moment (#10119).

bluemoehre Nov 19, 2024

Thanks for your feedback. Yeh, I saw that bandwidth difference, but right now I am with a M2 Air and text generation speed is already fine after latest OS updates. The M4 Pro would make it >2x faster, so I would like to put more focus on flexibility and the ability to run more than one model in parallel 24/7. For the heavy loads I still gonna use the cloud, later on (after RTX 5000 series release next weeks) I consider to get some little brick and attach my current RTX 4070 Ti S to it. Unfortunately a Studio M2 (>128GB) is still too expensive.

gardner Nov 20, 2024
Sponsor

after RTX 5000 series release next weeks

Can you tell me more about next week?

liyimeng · 2024-12-04T17:58:16Z

liyimeng
Dec 4, 2024

Have someone tried M4 Pro 64G, is it possible to run a 70B model in a usable speed?

6 replies

Kaszebe Dec 5, 2024

Would it be any faster (inference speed and prompt eval speed) if I had the 128GB Macbook Pro Max?

My daily driver is Nemotron Q5_M. It's 50GB in size. Cannot run on my 48GB Macbook pro.

Highly doubtful the 64GB Macbook Pro Max would be much better.

I'm considering returning the 48GB model and getting the 128GB model. But not just for LLM inference—I'm running VMs and containers and heavy Chrome usage. With regards to that, I think the 128GB should cure my memory problems.

I just need 4-5 tokens per second and I'm golden.

gsgxnet Dec 8, 2024

I own a 128GB Macbook Pro Max. Today I tested it with Llama 3.3-70B Q6 for writing some python code.
Got: 4.36 tok/sec, 788 tokens, 2.96s to first token. Similar results achieved in several runs with other models. So that machine might fit your needs.

cktang88 · 2024-12-05T03:26:09Z

cktang88
Dec 5, 2024

If i'm reading correctly, the m3 pro is slower than the m2 pro??

1 reply

atlas5301 Dec 5, 2024

It is. That's what 'ONLY APPLE CAN DO'.

Hanneseh · 2024-12-07T11:02:10Z

Hanneseh
Dec 7, 2024

M4 Pro, 8+4 CPU, 16 GPU, 24 GB Memory (MBP 14) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	381.14 ± 0.06
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	17.19 ± 0.04
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	367.13 ± 0.06
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	30.54 ± 0.01
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	364.06 ± 0.11
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	49.64 ± 0.01

build: 8e672ef (1550)

0 replies

eightpigs · 2024-12-17T16:18:34Z

eightpigs
Dec 17, 2024

M4 Max (Macbook Pro 14" 2024), 12+4 CPU, 40 GPU, 128 GB Memory

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	923.55 ± 0.12
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	31.61 ± 0.10
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	852.47 ± 48.37
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	53.06 ± 0.48
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	746.09 ± 29.30
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	82.52 ± 0.13

build: 8e672ef (1550)

0 replies

Performance of llama.cpp on Apple Silicon M-series #4167

ggerganov Nov 22, 2023 Maintainer

Summary

Description

M1 Pro, 8+2 CPU, 16 GPU (@ggerganov) ✅

M2 Ultra, 16+8 CPU, 76 GPU (@ggerganov) ✅

M3 Max (MBP 14), 12+4 CPU, 40 GPU (@slaren) ✅

Footnotes

Replies: 61 comments · 98 replies

M2 Mac Mini, 4+4 CPU, 10 GPU, 24 GB Memory (@QueryType) ✅

M2 Max Studio, 8+4 CPU, 38 GPU ✅

M2 Ultra, 16+8 CPU, 60 GPU (@crasm) ✅

M3 Max (MBP 16), 12+4 CPU, 40 GPU (@ymcui) ✅

slaren Nov 24, 2023 Collaborator

ggerganov Nov 24, 2023 Maintainer Author

M2 Pro, 6+4 CPU, 16 GPU (@minosvasilias) ✅

M2 MAX (MBP 16) 8+4 CPU, 38 GPU, 96 GB RAM (@MrSparc) ✅

M2 MAX (MBP 16) 8+4 CPU, 38 GPU, 32 GB RAM ✅

M1 Max (MBP 16) 8+2 CPU, 32 GPU, 64GB RAM (@CedricYauLBD) ✅

M3 Pro (MBP 14), 5+6 CPU, 14 GPU (@paramaggarwal) ✅

ggerganov Nov 25, 2023 Maintainer Author

ggerganov Nov 26, 2023 Maintainer Author

M4 (iPad Pro 2024 256GB), 3+6 CPU, 10 GPU

tinyllama 1.1b

phi-2 2.7B

mistral 7b

ggerganov
Nov 22, 2023
Maintainer

Replies: 61 comments 98 replies

slaren Nov 24, 2023
Collaborator

ggerganov Nov 24, 2023
Maintainer Author

ggerganov Nov 25, 2023
Maintainer Author

ggerganov Nov 26, 2023
Maintainer Author