-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : Metal inference #1642
llama : Metal inference #1642
Conversation
Ok, the Q4 mul mat kernel is next - very important to get this right. |
A bit of advice, when I made the kernel above, I ran the CPU-side script over a dozen times per change to the Metal code. I ran until I was confident I had found the maximum achievable bandwidth. Although this overestimates actual performance, it removes all noise, so you can focus on relative performance. "Does this change make it slightly faster or slightly slower?" Then it's very similar to training a neural network. Incrementally descend the performance slope until reaching whatever Metal shader works best for you. |
I'm considering purchasing the Mac Studio with M2 Ultra 76 core 192GB. I'm curious about the performance of your 65B 4-bits model. Could you provide some details? Does it run same as A6000(9~13tokens/s in 65B 4-bits)? |
Wouldn't it be cheaper to just purchase access to GPT-4 through the OpenAI API? If the goal is the highest-quality LLM models available. |
I started working on my benchmark app. I’ll publish some alpha results when I get it setup to benchmark every quant and param value for a given model and put it in a table.
if you can get gpt-4 access. That said, gpt-3.5-turbo is still better than any local LLM and is much cheaper. Using gpu instances like runpod is also way cheaper for non 24/7 use vs building even a mid-level setup. I was looking at an amd 6950xt to mess with amd support and decided since it’s not officially supported with rocm I’ll be using azure amd instances. I can get about 300 hours for the price of that card. Running your own hardware doesn’t make sense from a cost perspective unless you’re literally doing it 24/7. Even then I’m not sure what the break even point is due to power bills. Ofc, it’s not like “makes sense from a cost perspective” is always a priority with hobbies. |
My benchmark app can go through some models in a directory, but eventually dies with an out of memory error. This appears to be an issue with llama-cpp-python. I don't think the CPU thread piece is working properly. I removed the prompt eval times from this, as they are much slower than I get if I run llama.cpp directly. The eval times appear to be in line with running llama.cpp directly. I may spawning llama.cpp directly or I'll look into fixing llama-cpp-python. Not sure yet, but I'll have time next week to work on that and flesh this out more. Here's what I have running it against a few 65b models: Racing Llama Benchmark System Information: Runs: 10 Assistant:Eval Tokens per second:
|
@soleblaze Wow you have the greatest of M2 Ultra, congrats. Do you know how it compares to using like Nvidia 4090 ? But maybe not even running on 4090 because of the RAM requrements ? |
$200 more for 5x less bandwidth. Not 5% less, 5x less.
|
I think that cost comparison is a bit misleading, considering you’d also need a motherboard that can handle the 4 cards, two power supplies, two electrical circuits, fast ram, and a cpu that won’t bottleneck 4 cards. I’m not sure where the multi gpu support is on this and if the cards would need to use the pci bus to share a lot data. That said, I would never argue that the m2 ultra is the better buy for this use case. Main thing m2 has going for it is power efficiency and a small form factor. I’m guessing if llama.cpp gets the MFA stuff that philipturner is working on that it could hit 3080-4080 levels of performance. I should have my benchmark app to the point where it’d be useful to do a comparison when that happens. It would be nice to put a 2x 3090 box on that list. IMO that’s the best performance per dollar and I’m not sure a realistic home use would go over the 48GB of ram. Plus you’d get nvlink support. |
Exactly. To buy into the CUDA ecosystem, you have to set up a Windows PC, with a massive box and 500 W power supply. I am all for using existing hardware, which I already own, to do the computations. Not for getting new hardware unless it can be built for free (my end goal with nanotech; build a personal supercluster).
You're implying that a 2-GPU system costs $6,000, factoring in the CPU and box? |
Does 4x GPUs really offer 4x the bandwidth? If I remember correctly, with multiple GPUs the inference speed does not seem to scale proportionally, although I haven't had the chance to test it (cc @JohannesGaessler) |
You can shard the feedforward and attention layers straightforwardly. The bottleneck could be the latency-bound process of broadcasting the result vector to the peers for the next feedforward.
Amdahl’s Law |
Lol no. Well, I’m sure there’s some boutique builders that have some offerings at that level. More to show how much lower that would cost. I don’t think you start hitting the large non-gpu costs of a build until you go to 3 or 4 gpus I am really curious on the performance difference with multiple gpus. It looks like runpod goes up to 8x 4090. I’ll put it on my list to look at. Guessing runpod will be a good starting point to compare hardware performance differences. |
You do get 4x the bandwidth, the problem is actually utilizing it. For large tensors like the matrix multiplication the scaling should be roughly linear but the problem is that for all of the small tensors the overhead from moving data between GPUs is larger than just doing the calculation on a single GPU. So this limits how much of the program you can actually parallelize and the parallelization itself introduces overhead (currently the CUDA synchronization logic for multi GPU settings still needs a lot of optimization). One possible way to improve this would be to fuse tensors where applicable so that you have one large tensor instead of many small tensors which can then be handled more efficiently (this would be beneficial in general). There is also the issue that writing code that utilizes multiple GPUs simply takes more work to develop and maintain; right now only matrix multiplications using the weights can be parallelized with the CUDA implementation (~67% of the runtime). The matrix multiplications using the KV cache could also be parallelized for another ~20% of the runtime. |
|
Since we’re talking about performance and what the manufacturers max specs are, rather than ask my questions about what matters and what’s worth benchmarking here I opened a discussion: #2038. Can y’all give me input on this? I think y’all have the most knowledge around what we actually care about and I want to have a way for people to run useful benchmarks on their system vs trusting that the manufacturer max's are achievable (for instance, I don’t really trust that 800gb/s is achievable on apple silicon ultra chips for a single workload) |
Does it really compare though ? You may have to add performance of both CPU and Neural engine of M2 Ultra systems together to benchmark bang for the bucks performance. I am also not sure about 660.6/2 tflops FP16 performance. The wikipedia page is showing 82 TFlops at half precision. Maybe I am missing the source. Same with M2 Ultra, the wikipedia page is showing 27.2 TFlops FP32 performance, does that mean that FP16 performance is 2x27.2 TFLOPS ? |
They cannot be used simultaneously without a hideous latency.
330 is for tensor cores. 82.5 is for shader cores.
That is theoretical max ALU. The SIMD MATMUL FMADD32 instruction performs 32 float ops in 18 cycles, while the 16-bit version takes 17 cycles. So max FP32 matmul is 24.2 TFLOPS, max FP16 matmul is 25.6 TFLOPS. |
Thanks for the clarification Philip. This is very useful.
Just one more follow up question if you don’t mind. Did you mean that I
can’t run two different instances of Llama.cpp, one on M1 Metal GPU and
another on multithreaded M1 CPU, without incurring significant latency ?
…On Thu, 29 Jun 2023 at 6:01 PM, Philip Turner ***@***.***> wrote:
You may have to add performance of both CPU and Neural engine of M2 Ultra
systems together to benchmark bang for the bucks performance.
They cannot be used simultaneously without a hideous latency.
I am also not sure about 660.6/2 tflops FP16 performance. The wikipedia
<https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units>
page is showing 82 TFlops at half precision. Maybe I am missing the source.
330 is for tensor cores. 82.5 is for shader cores.
Same with M2 Ultra, the wikipedia <https://en.wikipedia.org/wiki/Apple_M2>
page is showing 27.2 TFlops FP32 performance, does that mean that FP16
performance is 2x27.2 TFLOPS ?
That is theoretical max ALU. The SIMD MATMUL FMADD32 instruction performs
32 float ops in 18 cycles, while the 16-bit version takes 17 cycles. So max
FP32 matmul is 24.2 TFLOPS, max FP16 matmul is 25.6 TFLOPS.
—
Reply to this email directly, view it on GitHub
<#1642 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAXGU4GLXZTGDXMPZUDPGQLXNVYS3ANCNFSM6AAAAAAYTBBJV4>
.
You are receiving this because you commented.Message ID: <ggerganov/llama.
***@***.***>
|
Okpatil4u, the ggml.ai homepage shows a screen recording of "Simultaneously running 4 instances of 13B LLaMA + Whisper Small on a single M1 Pro". I take this to mean that you can run multiple models at once. I assume there are complications to this though. |
It's actually quite efficient (theoretically) because you perform 4 batched inferences. The latency is the same as 1 inference, until the batch size becomes so large it's compute-bound. |
When using mac, is the "prompt" processing from user is less efficient compare to using NVIDIA ? Because for example, in summarization, the text needs to be summarized will be the prompt and in my case (m2) it takes a long time to process, and the guys who have NVIDIA is not having that problem |
That's because Metal Performance Shaders has a very inefficient GEMM, which is a compute-bound operation. Token decoding (tokens/second) is a memory-bound operation. |
The ANE is designed for convolutions, so its GEMM throughput is ~25% of the advertised TFLOPS. On everything besides A-series chips, it's slower than the GPU. The solution is a more performant GPU GEMM library. |
Can we use CLBlast to speedup prompt ingestion ? It supports m1 and m2 apparently |
Unlikely. The next performance jump will come from quantum matrix multiplication. |
CLBlast is slower than Metal Performance Shaders. Only able to reach 28% ALU utilization and unable to use half precision.
Yes, using quantum computers to multiply hermitian matrices and solve eigenvalue problems in under |
Add full GPU inference of LLaMA on Apple Silicon using Metal
Demo
M1 Pro + 7B LLaMA:
llama-metal-0.mp4
M2 Max + 7B LLaMA:
llama-metal-1-lq.mp4
M2 Max + 13B LLaMA:
llama-metal-13B-0-lq.mp4
M2 Max + 65B LLaMa:
llama-metal-65B-0-lq.mp4
Details
ggml
API is extended in ggml-metal.hQ4_0
, but all other quantizations can easily be added in the futuremmap
to avoid model data duplication in memory. Still there are a few memory improvements that can be made in the future to reduce the memory usage when Metal is enabledggml_graph_compute()
and it's purpose is to evaluate aggml_cgraph
on the GPU in a similar wayqMatrix
xVector
multiplication which is normally needed for LLM text-generation. For other tasks that involveMatrix
xMatrix
(for example prompt ingestion, perplexity computation, etc) we don't have an efficient implementation yet, so we fallback to the CPU / ANEggml-metal.h
,ggml-metal.m
andggml-metal.metal
files are optional and all Metal-related code is contained within them. 3rd party user apps can decide whether they want to include / modify / ignore themUsage
LLAMA_METAL=1
to yourmake
command or-DLLAMA_METAL=ON
to yourcmake
command.-ngl 1
tomain
command-line arguments to enable GPU inferenceImplementation process of this PR (archive)
Export a
ggml
computation graph of a LLaMA model:This creates the
llama.ggml
file which contains the computation graphWe will now load it with a separate tool and attempt to evaluate with Metal:
Implement the entire network layer by layer, comparing the CPU and GPU results
Optimize the kernels to achieve at the very least parity with CPU-only speed
Adjust dynamic shapes before evaluating the graph (i.e.
n_past
,N
)Simplify encoder dispatch code, reduce duplication
Add basic text-generation example
Robots
🤖 Generated by Copilot at 324e823
Summary
🍎📝🚀
This pull request adds Metal support for llama, a library for tensor manipulation and computation graph export/import. It introduces a new CMake option
LLAMA_METAL
and a new header fileggml-metal.h
that enable GPU acceleration of llama expressions on Apple devices. It also improves the readability, consistency, and usability of the existing code and documentation, and adds some new features and examples. It fixes a bug in themain
example program and adds a newmetal
example program that demonstrates how to evaluate a statically exported ggml computation graph with Metal.Walkthrough
main.cpp
that used subtraction instead of addition to compute the sum of two numbers (link)--export
to the example programmain.cpp
that allows exporting the computation graph to a file namedllama.ggml
(link, link, link)llama_eval_export
that exports a static computation graph for a context of 511 and a batch size of 1 usingllama_eval_internal
(link, link)ggml_graph_import
to parse the arguments of the tensor before creating it, and to handle different cases of view operations differently (link, link)ggml_nbytes
to handle cases where the tensor is not contiguous in memory (link)ggml_scratch_save
andggml_scratch_load
to the functionsggml_view_1d
,ggml_view_2d
,ggml_view_3d
andggml_view_4d
to preserve the scratch memory state when creating a new tensor for the offset (link, link, link, link)ggml_set_name
to the functionsggml_view_2d
,ggml_view_3d
andggml_view_4d
to assign a name to the result tensor for debugging purposes (link, link, link)ggml_set_name
to the functionllama_eval_internal
to assign a name to the tensorVcur
for debugging purposes (link)cgraph_fname
to the functionllama_eval_internal
that allows exporting the computation graph to a file if not null (link, link, link)eop
to the functionggml_graph_import
that stores the enum value of the operation code for convenience (link)const
qualifier to the variablesmean
andx0
in the functionsggml_compute_forward_rms_norm_f32
andggml_compute_forward_rope_f32
to indicate that they are not modified after initialization (link, link, link)ggml_nrows
fromint
toint64_t
to match the type of thene
field of theggml_tensor
struct (link)ggml_is_transposed
andggml_is_contiguous
from static inline to public by adding them to the ggml.h header file (link, link)ggml_graph_export_leaf
andggml_graph_export_node
to accommodate longer tensor names (link, link)ggml_graph_export
that check the work buffer size of the computation graph, because they are not valid when exporting a graph with Metal support (link)ggml_graph_export
for consistency (link)cur
from the functionllama_eval_internal
because it is declared later in the same scope (link)inpL
withcur
in the functionllama_eval_internal
to reflect the previous changes in the tensor creation logic (link, link)llama_eval_internal
for consistency (link)llama_eval_internal
for readability (link)llama_model_load
in the functionllama_init
to use multiple lines and indentation for readability (link)ggml_init
andggml_free
in the ggml.h header file to use multiple lines and indentation for readability (link)llama_model_load_internal
for readability (link)GGML_CUDA_SOURCES
toGGML_SOURCES_CUDA
to match the naming convention of other source variables in the CMake file (link, link)metal
to the examples CMake file ifLLAMA_METAL
is enabled (link)