Apple MLX framework released #4345

AdithyanI · 2023-12-06T07:37:19Z

AdithyanI
Dec 6, 2023

I just stumbled upon this : https://github.com/ml-explore/mlx

"MLX is an array framework for machine learning on Apple silicon, brought to you by Apple machine learning research."

Can someone help me understand, how will this affect llama.cpp and whisper.cpp?
Looks like in the examples they quote those.

Can we leverage this in our repos and make them even faster?

Best,
Adi

slaren · 2023-12-06T11:01:13Z

slaren
Dec 6, 2023
Collaborator

From what I see, this seems to be like Apple's equivalent of pytorch, and it is too high level for what we need in ggml. However, the source code has a Metal backend, and we may be able to use it to learn how to better optimize our Metal kernels.

2 replies

AdithyanI Dec 6, 2023
Author

Thanks for the reply. That helps clarify some things.

sandangel Dec 8, 2023

I'm also interested in this. Looks forward to seeing if we can reuse the code and optimize the performance further.

0xtotem · 2023-12-06T22:16:01Z

0xtotem
Dec 6, 2023

Performance wise I noticed here:

We observe that 4.3 seconds are required to generate 100 tokens and 0.4 seconds of those are spent processing the prompt. This amounts to a little over 39 ms per token.

This was run on a M1 Ultra and the 7B parameter Llama model (I assume Llama 2).

According to llama.cpp's benchmark for the M1 Ultra 48 GPUs, we have 13.35ms/t (74.93t/s) for the Q4_0 TG.

I don't see any mention of quantization in their tutorial. So 39ms/t unquantized vs 13ms/t Q4 (assuming the same M1 with 48GPUs).

3 replies

signalprime Dec 24, 2023

The mlx examples have been updated to include quantization examples and the recently launched mlx-community on Hugging Face already has pre-quantized versions of llama2 and mistral available for download. More impressively, LLM in a flash documents context-adaptive loading to allow you to load LLMs using memory equaling roughly half the size of the model when stored on disk... in other words we'll soon be loading much bigger models without upgrading our hardware.

AndreasKunar Dec 24, 2023

MLX this week released a version which now supports quantization . I did a benchmarking comparison of their llama inference example against llama.cpp with llama-2 7B in Q4 and fp16 (if anyone wants to replicate/test see my github for a tweaked llama.py of theirs with token/s measures (called llama-perf.py in my repo). I'm using plain llama.cpp code, not the perf-measurement example for benchmarking.

a) MLX model load-times are very slow, GGUF and loading really rocks!
b) at 1st glance MLX prompt-processing of llama vs. the same prompt in llama.cpp seems faster. Need to dig into this
c) MLX token-generation performance for fp16 seems quite similar to llama.cpp, but 2x slower for Q4
d) memory-consumption seems similar

WiseFarAI Dec 25, 2023

I have also followed this with great interest. I cannot wait for these techniques to be integrated with llama.cpp

AndreasKunar · 2024-01-14T18:12:19Z

AndreasKunar
Jan 14, 2024

FYI, MLX v.0.0.9 just also added experimental GGUF file support (ml-explore/mlx#350)

0 replies

ianscrivener · 2024-06-10T08:15:51Z

ianscrivener
Jun 10, 2024

Now 6 months on from the release of MLX... I'm curious to ask to know if MLX has been beneficial to llama.cpp?

0 replies

ciekawy · 2024-06-21T19:07:35Z

ciekawy
Jun 21, 2024

in particular what about combining both MLX and MPS?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apple MLX framework released #4345

{{title}}

Replies: 5 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Apple MLX framework released #4345

Replies: 5 comments · 5 replies

slaren Dec 6, 2023 Collaborator

AdithyanI Dec 6, 2023 Author

Replies: 5 comments 5 replies

slaren
Dec 6, 2023
Collaborator

AdithyanI Dec 6, 2023
Author