Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda: NaN perplexity with some models on some GPUs (Gemma, MPT) #5817

Closed
cebtenzzre opened this issue Mar 1, 2024 · 4 comments
Closed

cuda: NaN perplexity with some models on some GPUs (Gemma, MPT) #5817

cebtenzzre opened this issue Mar 1, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@cebtenzzre
Copy link
Collaborator

I'm making an issue for this to make sure it isn't forgotten about. I've been able to work around this, but it seems like a bug to me.

ref #5631 (comment)

Steps to Reproduce

  1. Download safetensors model from https://huggingface.co/google/gemma-7b
  2. Checkout llama.cpp commit 15499eb (master should reproduce this as well)
  3. ./convert-hf-to-gguf.py gemma-7b --outfile gemma-7b.f16.gguf --outtype f16
  4. cmake -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo -DLLAMA_CUBLAS=ON
  5. make -C build perplexity
  6. Run perplexity on a Tesla P40. Use -ngl 2 or above.
$ CUDA_VISIBLE_DEVICES=0 build/bin/perplexity -f wiki.test.raw -c 2048 -m gemma-7b.f16.gguf -ngl 99
<snip>
perplexity: tokenizing the input ..
perplexity: tokenization took 974.102 ms
perplexity: calculating perplexity over 142 chunks, batch_size=512
perplexity: 6.52 seconds per pass - ETA 15.43 minutes
[1]nan,

And there's no point in running it longer than that because the running average will stay NaN.

This also occurs with a model quantized to pure F16 from the official GGUF provided by Google.

BUT, these NaNs do not occur with -ngl 1 or with --no-kv-offload. So it has something to do with offloading of the KV cache.

cc @JohannesGaessler in case you haven't seen this yet.

@cebtenzzre cebtenzzre added the bug Something isn't working label Mar 1, 2024
@JohannesGaessler
Copy link
Collaborator

Does this issue only occur with a P40 in particular or is that simply the only GPU that you used for testing?

@cebtenzzre
Copy link
Collaborator Author

I had trouble testing gemma on any other GPUs that I have because although I can offload 25/33 layers of a Q4_0 llama 7b on my 4GB GTX 970 with a batch size of 256, I cannot even offload 1/29 layers of gemma 7b without OOM.

Also, this behavior not seem to be deterministic - with the same settings on a perplexity run, the first chunk's ppl may be NaN or it may be something reasonable.

I was also able to reproduce these NaNs on mpt-7b-chat quantized to Q4_0 with even -ngl 1, which allowed me to verify that my GTX 970 also reproduces this behavior - assuming this is not an unrelated issue.

This issue is present with an older quant of MPT and 15499eb reverted, but it does not happen on Falcon AFAICT, so the common element is not the tied output and token embedding tensors.

cc @slaren

@cebtenzzre cebtenzzre changed the title cuda: NaN perplexity with some models on some GPUs (Gemma + Tesla P40) cuda: NaN perplexity with some models on some GPUs (Gemma, MPT) Mar 3, 2024
@slaren
Copy link
Collaborator

slaren commented Mar 3, 2024

I cannot reproduce the issue on my system. Does it still happen with the current master? #5853 fixed an issue that could cause intermittent failures.

@cebtenzzre
Copy link
Collaborator Author

I thought I was on the current master, but apparently I was still on c29af7e, because of this:

error: cannot lock ref 'refs/remotes/upstream/ci/server/fix-slow-test': 'refs/remotes/upstream/ci' exists; cannot create 'refs/remotes/upstream/ci/server/fix-slow-test'

After manually deleting that ref I was able to successfully git pull, and now I can't reproduce the issue anymore. Likely fixed by #5853.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants