-
Notifications
You must be signed in to change notification settings - Fork 9.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Significantly different results (and WRONG) inference when GPU is enabled. #7048
Comments
Are you getting correct results when you use the llama.cpp binaries directly without any Python bindings? If not, are you getting correct results when you compile with |
Thanks for your response.
First I recompiled llama.cpp with the suggested flag LLAMA_CUDA_FORCE_MMQ
e.g.
LLAMA_CUDA=1 LLAMA_CUDA_FORCE_MMQ=1 make
Once completed I ran the binary as follow
-------
When I do not have "-ngl 40" it seems to give the correct answer.
"
./main_gpu -m /data/DemoV1/Model4Demo/openhermes-2.5-mistral-7b.Q5_K_M.gguf \
-c 16192 -b 1024 -n 256 --keep 48 \
--repeat_penalty 5.0 --color -i \
-r "User:" -f prompts/chat-with-bob.txt
...
User: Hello, Bob.
Bob: Hello. How may I help you today?
User: Please tell me the largest city in Europe.
Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User:what is the capital city of France?
Bob: The Capital City Of Paris.<|im_end|>
"
However when I ran it with -ngl 40 this is the response
...
User:what is the capital city of France?
##################################################################################################################
##################################################################################################################
###################
…________________________________
From: Johannes Gäßler ***@***.***>
Sent: Friday, May 3, 2024 4:15 AM
To: ggerganov/llama.cpp ***@***.***>
Cc: Huy Vu ***@***.***>; Author ***@***.***>
Subject: Re: [ggerganov/llama.cpp] Significantly different results (and WRONG) inference when GPU is enabled. (Issue #7048)
Caution: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
Are you getting correct results when you use the llama.cpp binaries directly without any Python bindings? If not, are you getting correct results when you compile with LLAMA_CUDA_FORCE_MMQ?
—
Reply to this email directly, view it on GitHub<#7048 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BIHUORKZ5IDGM7ZSNYYI2BLZANBRPAVCNFSM6AAAAABHEIE4XSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJSGUZDSOBWGQ>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Sorry, I wanted to check this issue but forgot. Did you download a ready-made GGUF file from Huggingface or did you convert it yourself? If it's the former, can you provide a link to the exact file you downloaded? |
No problem and thanks for responding. I downloaded the model from Huggingface at
https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-GGUF/blob/main/openhermes-2.5-mistral-7b.Q5_K_M.gguf
Huy
…________________________________
From: Johannes Gäßler ***@***.***>
Sent: Wednesday, May 8, 2024 11:32 AM
To: ggerganov/llama.cpp ***@***.***>
Cc: Huy Vu ***@***.***>; Author ***@***.***>
Subject: Re: [ggerganov/llama.cpp] Significantly different results (and WRONG) inference when GPU is enabled. (Issue #7048)
Caution: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
Sorry, I wanted to check this issue but forgot. Did you download a ready-made GGUF file from Huggingface or did you convert it yourself? If it's the former, can you provide a link to the exact file you downloaded?
—
Reply to this email directly, view it on GitHub<#7048 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BIHUORPU7HZDYA53C354M53ZBJARPAVCNFSM6AAAAABHEIE4XSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBQHA2TGMZQGY>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
I cannot reproduce the issue on master. Can you re-download the model and check that this issue isn't due to a corrupted file? |
Here is my git master
clean and rebuild with
re-downloaded the model (also matches with my previously downloaded file)
f7faa7e315e81c3778aae55fcb5fc02c openhermes-2.5-mistral-7b.Q5_K_M_v1.gguf
===========================================
Other values of -ngl 16
Pretty much the same for -ngl 8 |
If I remember correctly the output
is effectively what you get when a
|
Also: with SXM your V100s are effectively NVLinked, right? Can you check results when compiling with |
Remake with suggested flags
I think I already have CUDA 12
Now for the run
Same results |
When checking the NVCC version your shell prefix is |
Also, I didn't mean to compile with both |
Remake in base conda environment (default nvcc)
Same results
|
Do you get any errors with |
No errors, and the "compute-sanitizer" didn't seem to help; however, it seems to work better if I use -ngl -1 instead of any specific values. Does that help? |
|
What driver version are you using? Run |
|
According to the HuggingFace repository, the model was made with llama.cpp revision |
Here is my current repo.
|
Recompile with the right option to enable CUDA and same problem
|
Experimenting with various -ngl values, it seems like keeping it below 20 seems to help for many models. At a certain point it just flipped from working to garbage. In the experiment the model llama-2-13b/ggml-model-q5_K_M.bin works with -ngl at 22 or below. Here is an example of it working with python binding
With 16 layers, only 2.2x2GB of VRAM was used. At 23 layers, the answer came back garbage and 3x2GB VRAM usage which is way below the 16x2 GB VRAM available.
If the user goes beyond the supported value, there should be a warning or error? Also is there a deterministic way of knowing what value of -ngl will work vs when it will return garbage? |
It should work with any value. You could try running the |
eval-callback_gpu_f16.log |
Sorry, eval-callback was broken and the numbers are useless. Please try again with #7184 or after it is merged. |
Pulled from master
and reran the eval-callback. |
Additionally for version 4f02636
|
Can you share the full command line that you used to generate the eval-callback logs? What f16 model did you use?
|
My guess is that this is a hardware failure of some sort. Are you using a custom build these V100 that might not provide enough power or cooling? |
I highly doubt that it's enough power or cooling as the source. Mainly it would imply a lot more randomness vs being very deterministic in term of failures at ~20 layers offload. As for cooling, the server is housed in a rack and air conditioned. Let me try enabling ECC and send results.
|
It's not likely to be an incompatibility with the GPU architecture, in fact the ggml-ci tests every commit on master on a PCIE V100. Whatever the issue is, it seems to be specific to your system. I know that some people have been trying to use V100 on custom builds since they are relatively cheap when bought used, and if this is the case here, I think that the most likely cause is some issue with the build. |
We don't have anything "custom" that I am aware of. Pretty much standard server with 2 V100 GPUs. As for software it is Ubuntu 22 LTS and pre-built drivers. |
Going to ask our IT folks to run a complete VRAM diagnostic also. |
My command line |
I figured that you used
With each matrix multiplication, results get progressively worse, until eventually it produces only |
Maybe it is c/c++ version of this ? torch.cuda.synchronize() that is missing somewhere? |
You can test for that by using the |
Is there a way to force single GPU usage? The pytorch benchmark seems to run fine on one GPU but have issues when dual GPU were used. |
Using the CUDA_LAUNCH_BLOCKING=1 env variable yielded the same results. |
Thank you for your help. After running GPU VRAM tests, we found that there may be indeed hardware issues.
|
I am running llama_cpp version 0.2.68 on Ubuntu 22.04LTS under conda environment. Attached are two Jupyter notebooks with ONLY one line changed (use CPU vs GPU). As you can see for exact same environmental conditions switching between CPU/GPU gives vastly different answers where the GPU is completely wrong. Some pointers on how to debug this I would appreciate it.
The only significant difference between the two files is this one liner
#n_gpu_layers=-1, # Uncomment to use GPU acceleration
The model used was openhermes-2.5-mistral-7b.Q5_K_M.gguf
mistral_llama_large-gpu.pdf
mistral_llama_large-cpu.pdf
The text was updated successfully, but these errors were encountered: