Replies: 5 comments 5 replies
-
From what I see, this seems to be like Apple's equivalent of pytorch, and it is too high level for what we need in ggml. However, the source code has a Metal backend, and we may be able to use it to learn how to better optimize our Metal kernels. |
Beta Was this translation helpful? Give feedback.
-
Performance wise I noticed here:
This was run on a M1 Ultra and the 7B parameter Llama model (I assume Llama 2). According to llama.cpp's benchmark for the M1 Ultra 48 GPUs, we have 13.35ms/t (74.93t/s) for the Q4_0 TG. I don't see any mention of quantization in their tutorial. So 39ms/t unquantized vs 13ms/t Q4 (assuming the same M1 with 48GPUs). |
Beta Was this translation helpful? Give feedback.
-
FYI, MLX v.0.0.9 just also added experimental GGUF file support (ml-explore/mlx#350) |
Beta Was this translation helpful? Give feedback.
-
Now 6 months on from the release of MLX... I'm curious to ask to know if MLX has been beneficial to llama.cpp? |
Beta Was this translation helpful? Give feedback.
-
in particular what about combining both MLX and MPS? |
Beta Was this translation helpful? Give feedback.
-
I just stumbled upon this : https://github.com/ml-explore/mlx
"MLX is an array framework for machine learning on Apple silicon, brought to you by Apple machine learning research."
Can someone help me understand, how will this affect llama.cpp and whisper.cpp?
Looks like in the examples they quote those.
Can we leverage this in our repos and make them even faster?
Best,
Adi
Beta Was this translation helpful? Give feedback.
All reactions