cuda: NaN perplexity with some models on some GPUs (Gemma, MPT) #5817

cebtenzzre · 2024-03-01T16:03:21Z

I'm making an issue for this to make sure it isn't forgotten about. I've been able to work around this, but it seems like a bug to me.

ref #5631 (comment)

Steps to Reproduce

Download safetensors model from https://huggingface.co/google/gemma-7b
Checkout llama.cpp commit 15499eb (master should reproduce this as well)
./convert-hf-to-gguf.py gemma-7b --outfile gemma-7b.f16.gguf --outtype f16
cmake -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo -DLLAMA_CUBLAS=ON
make -C build perplexity
Run perplexity on a Tesla P40. Use -ngl 2 or above.

$ CUDA_VISIBLE_DEVICES=0 build/bin/perplexity -f wiki.test.raw -c 2048 -m gemma-7b.f16.gguf -ngl 99
<snip>
perplexity: tokenizing the input ..
perplexity: tokenization took 974.102 ms
perplexity: calculating perplexity over 142 chunks, batch_size=512
perplexity: 6.52 seconds per pass - ETA 15.43 minutes
[1]nan,

And there's no point in running it longer than that because the running average will stay NaN.

This also occurs with a model quantized to pure F16 from the official GGUF provided by Google.

BUT, these NaNs do not occur with -ngl 1 or with --no-kv-offload. So it has something to do with offloading of the KV cache.

cc @JohannesGaessler in case you haven't seen this yet.

The text was updated successfully, but these errors were encountered:

JohannesGaessler · 2024-03-03T13:39:03Z

Does this issue only occur with a P40 in particular or is that simply the only GPU that you used for testing?

cebtenzzre · 2024-03-03T18:42:26Z

I had trouble testing gemma on any other GPUs that I have because although I can offload 25/33 layers of a Q4_0 llama 7b on my 4GB GTX 970 with a batch size of 256, I cannot even offload 1/29 layers of gemma 7b without OOM.

Also, this behavior not seem to be deterministic - with the same settings on a perplexity run, the first chunk's ppl may be NaN or it may be something reasonable.

I was also able to reproduce these NaNs on mpt-7b-chat quantized to Q4_0 with even -ngl 1, which allowed me to verify that my GTX 970 also reproduces this behavior - assuming this is not an unrelated issue.

This issue is present with an older quant of MPT and 15499eb reverted, but it does not happen on Falcon AFAICT, so the common element is not the tied output and token embedding tensors.

cc @slaren

slaren · 2024-03-03T19:19:29Z

I cannot reproduce the issue on my system. Does it still happen with the current master? #5853 fixed an issue that could cause intermittent failures.

cebtenzzre · 2024-03-03T19:28:00Z

I thought I was on the current master, but apparently I was still on c29af7e, because of this:

error: cannot lock ref 'refs/remotes/upstream/ci/server/fix-slow-test': 'refs/remotes/upstream/ci' exists; cannot create 'refs/remotes/upstream/ci/server/fix-slow-test'

After manually deleting that ref I was able to successfully git pull, and now I can't reproduce the issue anymore. Likely fixed by #5853.

cebtenzzre added the bug Something isn't working label Mar 1, 2024

cebtenzzre changed the title ~~cuda: NaN perplexity with some models on some GPUs (Gemma + Tesla P40)~~ cuda: NaN perplexity with some models on some GPUs (Gemma, MPT) Mar 3, 2024

cebtenzzre closed this as completed Mar 3, 2024

chenwanqq mentioned this issue Jun 25, 2024

What are the differences between the llama block in this repo and the implementation in candle-transformer? EricLBuehler/mistral.rs#465

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda: NaN perplexity with some models on some GPUs (Gemma, MPT) #5817

cuda: NaN perplexity with some models on some GPUs (Gemma, MPT) #5817

cebtenzzre commented Mar 1, 2024

JohannesGaessler commented Mar 3, 2024

cebtenzzre commented Mar 3, 2024

slaren commented Mar 3, 2024

cebtenzzre commented Mar 3, 2024

cuda: NaN perplexity with some models on some GPUs (Gemma, MPT) #5817

cuda: NaN perplexity with some models on some GPUs (Gemma, MPT) #5817

Comments

cebtenzzre commented Mar 1, 2024

Steps to Reproduce

JohannesGaessler commented Mar 3, 2024

cebtenzzre commented Mar 3, 2024

slaren commented Mar 3, 2024

cebtenzzre commented Mar 3, 2024