-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cuda: NaN perplexity with some models on some GPUs (Gemma, MPT) #5817
Comments
Does this issue only occur with a P40 in particular or is that simply the only GPU that you used for testing? |
I had trouble testing gemma on any other GPUs that I have because although I can offload 25/33 layers of a Q4_0 llama 7b on my 4GB GTX 970 with a batch size of 256, I cannot even offload 1/29 layers of gemma 7b without OOM. Also, this behavior not seem to be deterministic - with the same settings on a perplexity run, the first chunk's ppl may be NaN or it may be something reasonable. I was also able to reproduce these NaNs on mpt-7b-chat quantized to Q4_0 with even This issue is present with an older quant of MPT and 15499eb reverted, but it does not happen on Falcon AFAICT, so the common element is not the tied output and token embedding tensors. cc @slaren |
I cannot reproduce the issue on my system. Does it still happen with the current master? #5853 fixed an issue that could cause intermittent failures. |
I thought I was on the current master, but apparently I was still on c29af7e, because of this:
After manually deleting that ref I was able to successfully |
I'm making an issue for this to make sure it isn't forgotten about. I've been able to work around this, but it seems like a bug to me.
ref #5631 (comment)
Steps to Reproduce
./convert-hf-to-gguf.py gemma-7b --outfile gemma-7b.f16.gguf --outtype f16
cmake -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo -DLLAMA_CUBLAS=ON
make -C build perplexity
-ngl 2
or above.And there's no point in running it longer than that because the running average will stay NaN.
This also occurs with a model quantized to pure F16 from the official GGUF provided by Google.
BUT, these NaNs do not occur with -ngl 1 or with --no-kv-offload. So it has something to do with offloading of the KV cache.
cc @JohannesGaessler in case you haven't seen this yet.
The text was updated successfully, but these errors were encountered: