Multi-GPU has been broken for me recently. ggml-cuda.cu:7068: invalid argument #3930

Ph0rk0z · 2023-11-03T11:40:43Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Git llama.cpp with python bindings.

Expected Behavior

Inference works like before.

Current Behavior

Inference fails and llama.cpp crashes.

Environment and Context

python 3.10 / cuda 11.8

Failure Information (for bugs)


llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.26 MB
llm_load_tensors: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device
llm_load_tensors: mem required  =  140.89 MB
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 83/83 layers to GPU
llm_load_tensors: VRAM used: 39362.61 MB
....................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 1280.00 MB
llama_new_context_with_model: kv self size  = 1280.00 MB
llama_build_graph: non-view tensors processed: 1844/1844
llama_new_context_with_model: compute buffer total size = 574.63 MB
llama_new_context_with_model: VRAM scratch buffer: 568.00 MB
llama_new_context_with_model: total VRAM used: 41210.61 MB (model: 39362.61 MB, context: 1848.00 MB)
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
2023-11-02 17:16:43 INFO:Loaded the model in 37.54 seconds.
Enabled NVLINK P2P 0->1
Enabled NVLINK P2P 1->0

CUDA error 1 at /home/supermicro/ai/llama-cpp-python-gguf-cuda/vendor/llama.cpp/ggml-cuda.cu:7068: invalid argument
current device: 1

Relevant Code

I have some printf's for Nvllink as you see so the line numbers are a little off but here is the snippet that set sit off.


                // copy src0, src1 to device if necessary
                if (src1->backend == GGML_BACKEND_GPU && src1_is_contiguous) {
                    if (id != g_main_device) {
                        if (convert_src1_to_q8_1) {
                            char * src1_ddq_i_source = src1_ddq[g_main_device] + src1_ddq_i_offset;
                     ****>       CUDA_CHECK(cudaMemcpyAsync(src1_ddq_i, src1_ddq_i_source, src1_ncols*src1_padded_col_size*q8_1_ts/q8_1_bs,
                                                    cudaMemcpyDeviceToDevice, stream));
                        } else {
                            float * src1_ddf_i_source = (float *) src1_extra->data_device[g_main_device];
                            src1_ddf_i_source += (i0*ne11 + src1_col_0) * ne10;
                            CUDA_CHECK(cudaMemcpyAsync(src1_ddf_i, src1_ddf_i_source, src1_ncols*ne10*sizeof(float),
                                                    cudaMemcpyDeviceToDevice, stream));
                        }
                    }

One of the args to cudamemcpy async is invalid. I haven't checked yet which one does it. The day before, it was trying to allocate 5TB of system ram after loading the model but subsequent commits fixed that up. Waited a little to see if that would happen with this since the code is so new and I can't access github from that machine so I have to bring the logs here.

It does it with both P40s and 3090s and is independent of whether I force MMQ or not.

The text was updated successfully, but these errors were encountered:

neel-alex · 2023-11-03T13:44:48Z

I'm encountering the same issue. Llama 2 70B, 8bit quantized. 2x A100. Compiled with:

make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=compute_80

Command:

./main -ngl 83 -m ../transformers_cache/llama-2-70b.Q8_0.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "Never gonna give"

Fails with:

CUDA error 1 at ggml-cuda.cu:7044: invalid argument
current device: 1

Whereas setting -ngl 0 and running it entirely on CPU runs fine (if slowly).

young-developer · 2023-11-03T15:03:41Z

I assume it can be related to my changes for CUDA memory pools. Once #3931 is merged try to recompile with GGML_CUDA_FORCE_CUSTOM_MEMORY_POOL and double check.

ggerganov · 2023-11-03T15:48:19Z

@Ph0rk0z Can you bisect at which commit the failure occurs?

sgoll · 2023-11-03T17:04:23Z

@ggerganov I am seeing the same error. git bisect reveals that commit d606905 (#3903) seems to be the culprit.

PS: As per #2470 (comment) I am compiling with LLAMA_CUDA_PEER_MAX_BATCH_SIZE=0. But the same error happens without that option.

Ph0rk0z · 2023-11-03T21:45:05Z

Mine has been broken since: #2268

First it would crash out like setting a too high n_batch does when loading the model. i.e, Trying to allocate massive amounts of system ram. After the memory pool commits it gives the error above. The memory poor PR does not fix it but at least avoids the crash.

yourbuddyconner · 2023-11-04T02:16:01Z

For what it's worth I am seeing this in a fresh build of llama.cpp as well. I am building via the llama_cpp_python package!

(task, pid=12595) ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
(task, pid=12595) ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
(task, pid=12595) ggml_init_cublas: found 4 CUDA devices:
(task, pid=12595)   Device 0: Tesla T4, compute capability 7.5
(task, pid=12595)   Device 1: Tesla T4, compute capability 7.5
(task, pid=12595)   Device 2: Tesla T4, compute capability 7.5
(task, pid=12595)   Device 3: Tesla T4, compute capability 7.5
...
(task, pid=12595) CUDA error 1 at /tmp/pip-install-bxeyyykh/llama-cpp-python_262979da943c43fa9967b3c0a61f8580/vendor/llama.cpp/ggml-cuda.cu:7036: invalid argument
(task, pid=12595) current device: 1

moatftw · 2023-11-04T03:48:30Z

same error with cuda 12.3:

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 4 CUDA devices:
Device 0: NVIDIA A10, compute capability 8.6
Device 1: NVIDIA A10, compute capability 8.6
Device 2: NVIDIA A10, compute capability 8.6
Device 3: NVIDIA A10, compute capability 8.6

...
llm_load_tensors: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA A10) as main device
llm_load_tensors: mem required = 86.05 MB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 4807.06 MB
..................................................................................................
llama_new_context_with_model: n_ctx = 3900
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 487.50 MB
llama_new_context_with_model: kv self size = 487.50 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 282.00 MB
llama_new_context_with_model: VRAM scratch buffer: 275.37 MB
llama_new_context_with_model: total VRAM used: 5569.93 MB (model: 4807.06 MB, context: 762.87 MB)
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |

CUDA error 1 at /tmp/pip-install-5bufkrrh/llama-cpp-python_9a816a9490ba42a78dfd85cdba57cabf/vendor/llama.cpp/ggml-cuda.cu:7036: invalid argument
current device: 1

riley-access-labs · 2023-11-04T13:35:25Z

Same error here with 2 x T4s using the Python package. It happened to me when redeploying my production Kubernetes environment. I had to quickly downgrade to 1 GPU to get the environment back up. I really do need this fixed ASAP as 1 GPU won't be able to handle load at peak times very well.

young-developer · 2023-11-04T16:31:42Z

Please test changes from #3931. CUDA pools are optional now.

Ph0rk0z · 2023-11-05T12:24:10Z

After reverting cuda pool stuff it appears to be working again.

RachelShalom · 2023-11-06T14:35:05Z

I am getting the same error: should I install a specific version?
I installed : CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir
....................................................................................................
llama_new_context_with_model: n_ctx = 3000
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 1500.00 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 10.02 MB
llama_new_context_with_model: VRAM scratch buffer: 3.40 MB
llama_new_context_with_model: total VRAM used: 3170.43 MB (model: 3167.03 MB, context: 3.40 MB)
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
time took to retrive documents is 0.6446716785430908

CUDA error 1 at /tmp/pip-install-1ypw1658/llama-cpp-python_1c1bc0be5c7249408c254fa56f97252b/vendor/llama.cpp/ggml-cuda.cu:7036: invalid argument
current device: 1

young-developer · 2023-11-06T16:42:11Z

@RachelShalom Try to retest the latest version.

RachelShalom · 2023-11-06T17:48:01Z

I installed llama cpp a few hours ago and got this so error. I assume I installed the latest. unless the fix mentioned here is not on a release

ccbadd · 2023-11-06T18:48:28Z

I installed llama cpp a few hours ago and got this so error. I assume I installed the latest. unless the fix mentioned here is not on a release

Did you install llama.cpp or llama-cpp-python? I really don't know how quickly llama.cpp propagates to llama-cpp-python.

RachelShalom · 2023-11-06T18:58:54Z

python
using this:
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python

and I am using lancgchain to load the model. I updated langchain and Now I have a new Error:

CUDA error 222 at /tmp/pip-install-qcfy69x9/llama-cpp-python_d60a2a3fe09943d5b39a16dab77b98a7/vendor/llama.cpp/ggml-cuda.cu:7043: the provided PTX was compiled with an unsupported toolchain.
current device: 0

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jul_11_02:20:44_PDT_2023
Cuda compilation tools, release 12.2, V12.2.128
Build cuda_12.2.r12.2/compiler.33053471_0

davidleo1984 · 2023-11-07T11:16:44Z

I used llama-cpp-python with langchain, and got the same error:
I installed:
CMAKE_ARGS="-DLLAMA_CUBLAS=on -DCMAKE_CUDA_FLAGS='-DGGML_CUDA_FORCE_CUSTOM_MEMORY_POOL'" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir
and I also upgraded langchain to 0.0.330

here are the output:

"
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6
Device 1: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1

...

llm_load_tensors: ggml ctx size = 0.11 MB
llm_load_tensors: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3060) as main device
llm_load_tensors: mem required = 172.97 MB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloaded 32/35 layers to GPU
llm_load_tensors: VRAM used: 3718.38 MB
..................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 256.00 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 7.18 MB
llama_new_context_with_model: VRAM scratch buffer: 0.55 MB
llama_new_context_with_model: total VRAM used: 3718.93 MB (model: 3718.38 MB, context: 0.55 MB)

CUDA error 1 at /tmp/pip-install-2o911nrr/llama-cpp-python_7b2f2508c89b451280d9116461f3c9cf/vendor/llama.cpp/ggml-cuda.cu:7036: invalid argument
current device: 1
"

I have two different cards and they worked well with the compiled llama.cpp. But I got error when I tried with llama-cpp-python. :(

Ph0rk0z · 2023-11-07T12:38:15Z

I'm using llama.cpp python too and I just git pull instead of using his cherrypicked revision. Sometimes that's good and sometimes that's bad.

jezzarax · 2023-11-14T16:21:09Z

Same issue for me on 2x A100 80GB PCIe setup with #3586. Running with CUDA_VISIBLE_DEVICES=1 works for models which fit. Building with LLAMA_CUDA_PEER_MAX_BATCH_SIZE=0 doesn't help.
My setup works on #3901. Will try to see if I manage to find a commit (e.g. #3903, as suspected in the thread) that breaks it.

Ph0rk0z added the bug-unconfirmed label Nov 3, 2023

LostRuins mentioned this issue Nov 5, 2023

Revert CUDA pool stuff #3944

Merged

Ph0rk0z closed this as completed Nov 5, 2023

davidleo1984 mentioned this issue Nov 7, 2023

Multi-GPU error, ggml-cuda.cu:7036: invalid argument abetlen/llama-cpp-python#886

Open

Ph0rk0z mentioned this issue Nov 8, 2023

CUDA Error 801: Operation Not Supported abetlen/llama-cpp-python#870

Closed

jezzarax mentioned this issue Nov 15, 2023

Multi GPU CUDA - 8x performance degradation when splitting tensors -> let's split by layer as an option #4055

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU has been broken for me recently. ggml-cuda.cu:7068: invalid argument #3930

Multi-GPU has been broken for me recently. ggml-cuda.cu:7068: invalid argument #3930

Ph0rk0z commented Nov 3, 2023

neel-alex commented Nov 3, 2023

young-developer commented Nov 3, 2023

ggerganov commented Nov 3, 2023

sgoll commented Nov 3, 2023

Ph0rk0z commented Nov 3, 2023

yourbuddyconner commented Nov 4, 2023 •

edited

Loading

moatftw commented Nov 4, 2023

riley-access-labs commented Nov 4, 2023 •

edited

Loading

young-developer commented Nov 4, 2023 •

edited

Loading

Ph0rk0z commented Nov 5, 2023

RachelShalom commented Nov 6, 2023

young-developer commented Nov 6, 2023

RachelShalom commented Nov 6, 2023

ccbadd commented Nov 6, 2023

RachelShalom commented Nov 6, 2023

davidleo1984 commented Nov 7, 2023

Ph0rk0z commented Nov 7, 2023

jezzarax commented Nov 14, 2023

Multi-GPU has been broken for me recently. ggml-cuda.cu:7068: invalid argument #3930

Multi-GPU has been broken for me recently. ggml-cuda.cu:7068: invalid argument #3930

Comments

Ph0rk0z commented Nov 3, 2023

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Relevant Code

neel-alex commented Nov 3, 2023

young-developer commented Nov 3, 2023

ggerganov commented Nov 3, 2023

sgoll commented Nov 3, 2023

Ph0rk0z commented Nov 3, 2023

yourbuddyconner commented Nov 4, 2023 • edited Loading

moatftw commented Nov 4, 2023

riley-access-labs commented Nov 4, 2023 • edited Loading

young-developer commented Nov 4, 2023 • edited Loading

Ph0rk0z commented Nov 5, 2023

RachelShalom commented Nov 6, 2023

young-developer commented Nov 6, 2023

RachelShalom commented Nov 6, 2023

ccbadd commented Nov 6, 2023

RachelShalom commented Nov 6, 2023

davidleo1984 commented Nov 7, 2023

Ph0rk0z commented Nov 7, 2023

jezzarax commented Nov 14, 2023

yourbuddyconner commented Nov 4, 2023 •

edited

Loading

riley-access-labs commented Nov 4, 2023 •

edited

Loading

young-developer commented Nov 4, 2023 •

edited

Loading