-
Notifications
You must be signed in to change notification settings - Fork 461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
4-bit is 10x slower compared to fp16 LLaMa #82
Comments
I've done some preliminary debugging and found that
|
I have the same problem on my RTX 3080, driver 530.41.03, cuda 11.7. The performance of the 4bit quantized models is very slow with large contexts. In fact, it seems that it is (much!) faster to unpack the layer weights on the fly and use standard PyTorch matmul when sufficiently large matrices are involved. Here's a hackish implementation of QuantLinear that falls back to PyTorch when context size becomes larger: MasterTaffer/GPTQ-for-LLaMa@b46c976 On my test setup 4-bit 13B LLaMa generating 20 tokens with 2000 context tokens, inference speed is improved by ~5x or so. |
@MasterTaffer Want to test your changes but what exactly does this commit fix? I know the comment is groupsize fix but the current repo there are no "reported " bugs of group size issues? Did you run into something that others have not encountered? |
@MasterTaffer @diegomontoya *edit: Don't quote me on that though, with some models it looks slightly higher, but for some reason I'm currently able to run a LLaMA 30B |
Does this only apply when group-size is enabled? I compared your change to baseline (as of latest commit as of writing this comment) and time is pretty much the same. But I quantized with |
What's your context size? The main improvement I'm seeing is the elimination of the delay before output starts being generated. With a lot of context tokens (like 1800) I usually see a 45-50 second delay on LLaMA 30B; with those changes, that drops to low single digits. |
I'm working on a kernel implementation in Triton. My hope is to lean on Triton's ability to optimize to the hardware on the fly, as well as implement the matmul kernel in a more cache optimal way versus the current CUDA kernel. So far I have a working kernel, though I haven't fully verified accuracy. It's performance curve is a lot better. Not as good as FP16 PyTorch yet, but at least in the ballpark now and scales correctly with context length. I've included the code below. WIP I still need to more thoroughly evaluate correctness. As of right now I'm seeing an absolute error of The one major snag I've hit is that Triton doesn't seem to have a way of expanding a tensor. i.e. something similar to PyTorch's
|
@EyeDeck I'm not sure what you mean by context size. I'm using I quantized LLaMA-7B to 4 bits without group size ( For benchmarking, I modify import accelerate
model = accelerate.load_checkpoint_and_dispatch(
model = model,
checkpoint = checkpoint,
device_map = "auto",
no_split_module_classes = ["LlamaDecoderLayer"]
)
# if checkpoint.endswith('.safetensors'):
# from safetensors.torch import load_file as safe_load
# model.load_state_dict(safe_load(checkpoint))
# else:
# model.load_state_dict(torch.load(checkpoint)) I then took @MasterTaffer 's quant.py and measured generation speed and loading time. For each, I ran a command like:
It's very strange isn't it? Results are very scattered. I can't tell if anything has any performance benefit. I am noticing only ~25% CUDA utilization, which is disappointing. Maybe the HuggingFace implementation is inefficient? Fwiw I'm on WSL2, using an RTX 4090 and PyTorch 2.0.0+cu117. |
@sterlind |
@sterlind What is your cpu? 4090 requires the very least 12th gen intel or zen4 to have the cpu keep up with feeding the cuda cores. 25% gpu utilization is way too low. It should be 80% or higher when generating tokens. To make the result deterministic, try setting a fixed seed to the inference for all tests also. |
@sterlind |
Okay, I've publish a more polished version of my Triton kernel: https://github.com/fpgaminer/GPTQ-triton The README on that repo has more detailed metrics, but the Triton kernel indeed performs 10x faster than the CUDA kernel with context length 2048. In fact, it's always faster than the CUDA kernel for all context lengths. It's almost on par with FP16. And memory usage is the same as CUDA. As for accuracy, it's exactly as accurate as the CUDA kernel across wikitext2, PTB, and C4 validation sets. Currently the kernel only supports 4-bits and groupsize -1, and I've only tested the 7B LLaMa weights. |
@fpgaminer hypothetically the speed up should be even bigger with the > 7B models, right? Have you had a chance to test with the 13B model for example? |
My CPU is an "old" Threadripper 1950X. Maybe I'm confused, but why should an older CPU struggle to feed the CUDA cores? Shouldn't it need to transfer very little to/from GPU memory once the model weights are loaded? |
Almost all inference code is single threaded so it doesn't matter if you have 16 cores, it will only use 1 per gpu. Just monitor your cpu usage vs gpu usage. If your cpu (the core that is running python inference) is at 100% and gpu is 25%, the bottleneck is cpu. The gpu is waiting for more work while cpu is maxed out. For ref, 13900k is 2x the single core performance vs 1950x. After oc, likely 2.2x.
|
I rewrote the current GTPQ kernel to triton using your code. I actually experienced a very high speedup. |
@qwopqwop200 Does the triton branch require re-quantization? Switching from cuda to triton branch is throwing the following for my 30b test model:
The 30b model is 4bit quantized using only |
It probably needs re-quantization.
|
I was able to run my pytorch branch converted model on triton under ooba fine. Tho I had to remove the options that are no longer used for triton. |
@USBhost Are you getting degraded quality of output under triton branch? I am getting both performance regression and massive quality drop-off under triton branch using re-quantized 30b models. Eval score are normal. The output is wildly diverging from cuda branch with same temp/top-p/top_k/etc config. Still trying to isolate issue. |
Can't say I have. But I am using 65b so idk. I'm still trying to figure out why evaluating c4 etc... Keeps changing per day or half a day later. Either I'm being a dummy and doing something wrong or something else is happening. |
I think what you're seeing may be completely normal for an untraced HuggingFace transformer, not really specific to this case? It's a composition of unfused PyTorch modules after all, each one launching at least one kernel, with multiple modules per layer. Definitely feels like something PyTorch's new compiler might make a difference on, or the older jit tracer. I wonder how well that interplays with the new Triton matmul kernel though. Can it really fuse custom kernels? I would be surprised. On the other hand if it can fuse the vanilla PyTorch stuff that might by itself make a big difference. |
I guess nobody else has mentioned it here, but the current CUDA branch generates at a rate of around 9 seconds per token with a freshly requantized 4-bit LLaMA 30B on a 3090 and ~1850 context tokens. This commit 608f3ba (with an older equivalent quantization that works with it), all else equal, runs at around 5 tokens per second. |
This is because the CUDA kernel has been changed to support act-order and groupsize at the same time. Because of this, we recommend Triton for now. |
@fpgaminer Do you have an idea why your triton implem is better than the cuda one? |
I'm not terribly well versed in PTX so I can't say for certain. Each instance of the CUDA kernel only calculates a The Triton kernel calculates a full To the best of my understanding: the CUDA kernel does less work per thread and launches more threads; the Triton kernel does more work per thread and launches less threads. The end results is that the CUDA kernel has to re-fetch data more often than the Triton kernel. This is fine when the data involved fits in the L2 cache, but when it doesn't the Triton kernel dominates. This occurs in all cases where M>1. The Triton kernel also also auto-adapts the block size based on M, N, and K. In all cases the Triton kernel has competitive performance to PyTorch FP16. The one downside right now is that Triton doesn't have support for unpacking quantized data, so I have to do some hacks to get it to work. It works fine, but it isn't getting any of the bandwidth benefits it should. In theory a set of re-written CUDA kernels would handily beat it. |
On my setup the stock 16-bit 7B LLaMa model runs at 0.6s per iteration with a 1x2048 input. The 4-bit quantized model runs at 8.3s per iteration. That makes the 4-bit version 10x slower than the non-quantized model. Is that normal?
Setup is GPTQ-for-LLaMa at 19c0535; RTX 3090; Environment listed below; My code is listed below; 7B model quantized using
c4 --wbits 4 --true-sequential --act-order
; Driver Version: 515.86.01; CUDA Version: 11.7Environment
Code
The text was updated successfully, but these errors were encountered: