Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU has been broken for me recently. ggml-cuda.cu:7068: invalid argument #3930

Closed
Ph0rk0z opened this issue Nov 3, 2023 · 18 comments
Closed

Comments

@Ph0rk0z
Copy link

Ph0rk0z commented Nov 3, 2023

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Git llama.cpp with python bindings.

Expected Behavior

Inference works like before.

Current Behavior

Inference fails and llama.cpp crashes.

Environment and Context

python 3.10 / cuda 11.8

Failure Information (for bugs)


llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.26 MB
llm_load_tensors: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device
llm_load_tensors: mem required  =  140.89 MB
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 83/83 layers to GPU
llm_load_tensors: VRAM used: 39362.61 MB
....................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 1280.00 MB
llama_new_context_with_model: kv self size  = 1280.00 MB
llama_build_graph: non-view tensors processed: 1844/1844
llama_new_context_with_model: compute buffer total size = 574.63 MB
llama_new_context_with_model: VRAM scratch buffer: 568.00 MB
llama_new_context_with_model: total VRAM used: 41210.61 MB (model: 39362.61 MB, context: 1848.00 MB)
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
2023-11-02 17:16:43 INFO:Loaded the model in 37.54 seconds.
Enabled NVLINK P2P 0->1
Enabled NVLINK P2P 1->0

CUDA error 1 at /home/supermicro/ai/llama-cpp-python-gguf-cuda/vendor/llama.cpp/ggml-cuda.cu:7068: invalid argument
current device: 1

Relevant Code

I have some printf's for Nvllink as you see so the line numbers are a little off but here is the snippet that set sit off.


                // copy src0, src1 to device if necessary
                if (src1->backend == GGML_BACKEND_GPU && src1_is_contiguous) {
                    if (id != g_main_device) {
                        if (convert_src1_to_q8_1) {
                            char * src1_ddq_i_source = src1_ddq[g_main_device] + src1_ddq_i_offset;
                     ****>       CUDA_CHECK(cudaMemcpyAsync(src1_ddq_i, src1_ddq_i_source, src1_ncols*src1_padded_col_size*q8_1_ts/q8_1_bs,
                                                    cudaMemcpyDeviceToDevice, stream));
                        } else {
                            float * src1_ddf_i_source = (float *) src1_extra->data_device[g_main_device];
                            src1_ddf_i_source += (i0*ne11 + src1_col_0) * ne10;
                            CUDA_CHECK(cudaMemcpyAsync(src1_ddf_i, src1_ddf_i_source, src1_ncols*ne10*sizeof(float),
                                                    cudaMemcpyDeviceToDevice, stream));
                        }
                    }

One of the args to cudamemcpy async is invalid. I haven't checked yet which one does it. The day before, it was trying to allocate 5TB of system ram after loading the model but subsequent commits fixed that up. Waited a little to see if that would happen with this since the code is so new and I can't access github from that machine so I have to bring the logs here.

It does it with both P40s and 3090s and is independent of whether I force MMQ or not.

@neel-alex
Copy link

I'm encountering the same issue. Llama 2 70B, 8bit quantized. 2x A100. Compiled with:

make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=compute_80

Command:

./main -ngl 83 -m ../transformers_cache/llama-2-70b.Q8_0.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "Never gonna give"

Fails with:

CUDA error 1 at ggml-cuda.cu:7044: invalid argument
current device: 1

Whereas setting -ngl 0 and running it entirely on CPU runs fine (if slowly).

@young-developer
Copy link
Contributor

I assume it can be related to my changes for CUDA memory pools. Once #3931 is merged try to recompile with GGML_CUDA_FORCE_CUSTOM_MEMORY_POOL and double check.

@ggerganov
Copy link
Owner

@Ph0rk0z Can you bisect at which commit the failure occurs?

@sgoll
Copy link

sgoll commented Nov 3, 2023

@ggerganov I am seeing the same error. git bisect reveals that commit d606905 (#3903) seems to be the culprit.

PS: As per #2470 (comment) I am compiling with LLAMA_CUDA_PEER_MAX_BATCH_SIZE=0. But the same error happens without that option.

@Ph0rk0z
Copy link
Author

Ph0rk0z commented Nov 3, 2023

Mine has been broken since: #2268

First it would crash out like setting a too high n_batch does when loading the model. i.e, Trying to allocate massive amounts of system ram. After the memory pool commits it gives the error above. The memory poor PR does not fix it but at least avoids the crash.

@yourbuddyconner
Copy link

yourbuddyconner commented Nov 4, 2023

For what it's worth I am seeing this in a fresh build of llama.cpp as well. I am building via the llama_cpp_python package!

(task, pid=12595) ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
(task, pid=12595) ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
(task, pid=12595) ggml_init_cublas: found 4 CUDA devices:
(task, pid=12595)   Device 0: Tesla T4, compute capability 7.5
(task, pid=12595)   Device 1: Tesla T4, compute capability 7.5
(task, pid=12595)   Device 2: Tesla T4, compute capability 7.5
(task, pid=12595)   Device 3: Tesla T4, compute capability 7.5
...
(task, pid=12595) CUDA error 1 at /tmp/pip-install-bxeyyykh/llama-cpp-python_262979da943c43fa9967b3c0a61f8580/vendor/llama.cpp/ggml-cuda.cu:7036: invalid argument
(task, pid=12595) current device: 1

@moatftw
Copy link

moatftw commented Nov 4, 2023

same error with cuda 12.3:

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 4 CUDA devices:
Device 0: NVIDIA A10, compute capability 8.6
Device 1: NVIDIA A10, compute capability 8.6
Device 2: NVIDIA A10, compute capability 8.6
Device 3: NVIDIA A10, compute capability 8.6

...
llm_load_tensors: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA A10) as main device
llm_load_tensors: mem required = 86.05 MB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 4807.06 MB
..................................................................................................
llama_new_context_with_model: n_ctx = 3900
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 487.50 MB
llama_new_context_with_model: kv self size = 487.50 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 282.00 MB
llama_new_context_with_model: VRAM scratch buffer: 275.37 MB
llama_new_context_with_model: total VRAM used: 5569.93 MB (model: 4807.06 MB, context: 762.87 MB)
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |

CUDA error 1 at /tmp/pip-install-5bufkrrh/llama-cpp-python_9a816a9490ba42a78dfd85cdba57cabf/vendor/llama.cpp/ggml-cuda.cu:7036: invalid argument
current device: 1

@riley-access-labs
Copy link

riley-access-labs commented Nov 4, 2023

Same error here with 2 x T4s using the Python package. It happened to me when redeploying my production Kubernetes environment. I had to quickly downgrade to 1 GPU to get the environment back up. I really do need this fixed ASAP as 1 GPU won't be able to handle load at peak times very well.

@young-developer
Copy link
Contributor

young-developer commented Nov 4, 2023

Please test changes from #3931. CUDA pools are optional now.

@Ph0rk0z
Copy link
Author

Ph0rk0z commented Nov 5, 2023

After reverting cuda pool stuff it appears to be working again.

@Ph0rk0z Ph0rk0z closed this as completed Nov 5, 2023
@RachelShalom
Copy link

I am getting the same error: should I install a specific version?
I installed : CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir
....................................................................................................
llama_new_context_with_model: n_ctx = 3000
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 1500.00 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 10.02 MB
llama_new_context_with_model: VRAM scratch buffer: 3.40 MB
llama_new_context_with_model: total VRAM used: 3170.43 MB (model: 3167.03 MB, context: 3.40 MB)
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
time took to retrive documents is 0.6446716785430908

CUDA error 1 at /tmp/pip-install-1ypw1658/llama-cpp-python_1c1bc0be5c7249408c254fa56f97252b/vendor/llama.cpp/ggml-cuda.cu:7036: invalid argument
current device: 1

@young-developer
Copy link
Contributor

@RachelShalom Try to retest the latest version.

@RachelShalom
Copy link

I installed llama cpp a few hours ago and got this so error. I assume I installed the latest. unless the fix mentioned here is not on a release

@ccbadd
Copy link

ccbadd commented Nov 6, 2023

I installed llama cpp a few hours ago and got this so error. I assume I installed the latest. unless the fix mentioned here is not on a release

Did you install llama.cpp or llama-cpp-python? I really don't know how quickly llama.cpp propagates to llama-cpp-python.

@RachelShalom
Copy link

python
using this:
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python

and I am using lancgchain to load the model. I updated langchain and Now I have a new Error:

CUDA error 222 at /tmp/pip-install-qcfy69x9/llama-cpp-python_d60a2a3fe09943d5b39a16dab77b98a7/vendor/llama.cpp/ggml-cuda.cu:7043: the provided PTX was compiled with an unsupported toolchain.
current device: 0

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jul_11_02:20:44_PDT_2023
Cuda compilation tools, release 12.2, V12.2.128
Build cuda_12.2.r12.2/compiler.33053471_0

@davidleo1984
Copy link

I used llama-cpp-python with langchain, and got the same error:
I installed:
CMAKE_ARGS="-DLLAMA_CUBLAS=on -DCMAKE_CUDA_FLAGS='-DGGML_CUDA_FORCE_CUSTOM_MEMORY_POOL'" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir
and I also upgraded langchain to 0.0.330

here are the output:

"
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6
Device 1: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1

...

llm_load_tensors: ggml ctx size = 0.11 MB
llm_load_tensors: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3060) as main device
llm_load_tensors: mem required = 172.97 MB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloaded 32/35 layers to GPU
llm_load_tensors: VRAM used: 3718.38 MB
..................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 256.00 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 7.18 MB
llama_new_context_with_model: VRAM scratch buffer: 0.55 MB
llama_new_context_with_model: total VRAM used: 3718.93 MB (model: 3718.38 MB, context: 0.55 MB)

CUDA error 1 at /tmp/pip-install-2o911nrr/llama-cpp-python_7b2f2508c89b451280d9116461f3c9cf/vendor/llama.cpp/ggml-cuda.cu:7036: invalid argument
current device: 1
"

I have two different cards and they worked well with the compiled llama.cpp. But I got error when I tried with llama-cpp-python. :(

@Ph0rk0z
Copy link
Author

Ph0rk0z commented Nov 7, 2023

I'm using llama.cpp python too and I just git pull instead of using his cherrypicked revision. Sometimes that's good and sometimes that's bad.

@jezzarax
Copy link

Same issue for me on 2x A100 80GB PCIe setup with #3586. Running with CUDA_VISIBLE_DEVICES=1 works for models which fit. Building with LLAMA_CUDA_PEER_MAX_BATCH_SIZE=0 doesn't help.
My setup works on #3901. Will try to see if I manage to find a commit (e.g. #3903, as suspected in the thread) that breaks it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests