Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Command R Plus crashed on large context (~40K) with CUDA #6948

Closed
TomoshibiAkira opened this issue Apr 27, 2024 · 9 comments
Closed

Command R Plus crashed on large context (~40K) with CUDA #6948

TomoshibiAkira opened this issue Apr 27, 2024 · 9 comments

Comments

@TomoshibiAkira
Copy link

I tested Command R Plus on 4 L20 cards with maximum 64K context, with 64 layers offloaded to GPU, 16 layers per card.
My prompt is relatively large, it costs around 50K tokens. During the prefill phase, llama.cpp crashed at ~40K tokens.

Here's the error message:

CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_synchronize at /root/llama-cpp-python/vendor/llama.cpp/ggml-cuda.cu:2403
  cudaStreamSynchronize(cuda_ctx->stream())
GGML_ASSERT: /root/llama-cpp-python/vendor/llama.cpp/ggml-cuda.cu:60: !"CUDA error"
Aborted (core dumped)

I'm using @dranger003 's Q6_K model with the perplexity test fix in #6491 applied.
I also tested on 32K context and it works fine.

@TomoshibiAkira TomoshibiAkira changed the title Command R Plus crashed on large context (~40K) Command R Plus crashed on large context (~40K) with CUDA Apr 27, 2024
@dranger003
Copy link
Contributor

dranger003 commented Apr 28, 2024

I can reproduce this issue and I don't think this is specific to this quant but I'll test other ones.

$ ./build/bin/main -ngl 24 -c 65536 -f prompt.txt -m ggml-c4ai-command-r-plus-104b-q6_k.gguf
Log start
main: build = 2749 (928e0b70)
main: built with cc (GCC) 13.2.1 20240417 for x86_64-pc-linux-gnu
...
CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_synchronize at ~/src/ggerganov/llama.cpp/ggml-cuda.cu:2403
  cudaStreamSynchronize(cuda_ctx->stream())
GGML_ASSERT: ~/src/ggerganov/llama.cpp/ggml-cuda.cu:60: !"CUDA error"
ptrace: Operation not permitted.
No stack.
The program is not being run.
Aborted (core dumped)

EDIT: Same issue with other quants.

@dranger003
Copy link
Contributor

I also tested using PR #6563 and it doesn't resolve this issue, but I think this may be int overflow related.

@slaren
Copy link
Collaborator

slaren commented Apr 28, 2024

You can run it with compute-sanitizer to find what kernel is causing the illegal memory access.

@TomoshibiAkira
Copy link
Author

TomoshibiAkira commented Apr 28, 2024

compute-sanitizer is way too slow. Since my instance is priced by time, I tried the debug build instead.
TL;DR: It seems like it's either ggml_cuda_op_mul_mat_cublas or ggml_cuda_op_mul? Might not be integer overflow related.

It might crash at

llama.cpp/ggml-cuda.cu

Lines 1259 to 1267 in 7bb36cc

CUBLAS_CHECK(
cublasGemmEx(ctx.cublas_handle(id), CUBLAS_OP_T, CUBLAS_OP_N,
row_diff, src1_ncols, ne10,
&alpha_f16, src0_ptr, CUDA_R_16F, ne00,
src1_ptr, CUDA_R_16F, ne10,
&beta_f16, dst_f16.get(), CUDA_R_16F, ldc,
CUBLAS_COMPUTE_16F,
CUBLAS_GEMM_DEFAULT_TENSOR_OP));

CUDA error: CUBLAS_STATUS_EXECUTION_FAILED
  current device: 2, in function ggml_cuda_op_mul_mat_cublas at /root/llama-cpp-python/vendor/llama.cpp/ggml-cuda.cu:1261
  cublasGemmEx(ctx.cublas_handle(id), CUBLAS_OP_T, CUBLAS_OP_N, row_diff, src1_ncols, ne10, &alpha_f16, src0_ptr, CUDA_R_16F, ne00, src1_ptr, CUDA_R_16F, ne10, &beta_f16, dst_f16.get(), CUDA_R_16F, ldc, CUBLAS_COMPUTE_16F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)
GGML_ASSERT: /root/llama-cpp-python/vendor/llama.cpp/ggml-cuda.cu:60: !"CUDA error"

All parameters of this function uses int64_t so I don't think there's an overflow problem.


It sometimes also crashed at

llama.cpp/ggml-cuda.cu

Lines 2297 to 2301 in 7bb36cc

cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) {
fprintf(stderr, "%s: %s failed\n", __func__, ggml_op_desc(dst));
CUDA_CHECK(err);
}

ggml_cuda_compute_forward: MUL failed
CUDA error: an illegal memory access was encountered
  current device: 2, in function ggml_cuda_compute_forward at /root/llama-cpp-python/vendor/llama.cpp/ggml-cuda.cu:2302
  err
GGML_ASSERT: /root/llama-cpp-python/vendor/llama.cpp/ggml-cuda.cu:60: !"CUDA error"

Get into the actual implementation of MUL op in binbcast.cu, there are a bunch of implicit downcastings from int64_t to int. I'm not sure whether these are normal, so I went in and changed all related variables to int64_t, but the problem persists.

BTW I'm using CUDA 11.8 for compiling.

@slaren
Copy link
Collaborator

slaren commented Apr 28, 2024

The kernel launchs are async, so knowing the call that returns an error is not very useful because any previous kernel may have caused the error. You could try setting the CUDA_LAUNCH_BLOCKING env variable to avoid this, but even then it may not be clear what kernel caused the issue without looking at the callstack.

@TomoshibiAkira
Copy link
Author

The kernel launchs are async, so knowing the call that returns an error is not very useful because any previous kernel may have caused the error. You could try setting the CUDA_LAUNCH_BLOCKING env variable to avoid this, but even then it may not be clear what kernel caused the issue without looking at the callstack.

Got it. Would synchronize manually inside ggml_backend_cuda_graph_compute work? That way we could ensure every op is done before the next graph node is processed.

@slaren
Copy link
Collaborator

slaren commented Apr 28, 2024

Setting CUDA_LAUNCH_BLOCKING will make all the kernel launches synchronous, so that shouldn't be necessary.

@dranger003
Copy link
Contributor

That took several hours... hopefully this is useful.

compute-sanitizer
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (32,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03080 is out of bounds
=========     and is 5,026,885,761 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (33,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03084 is out of bounds
=========     and is 5,026,885,765 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (34,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03088 is out of bounds
=========     and is 5,026,885,769 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (35,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c0308c is out of bounds
=========     and is 5,026,885,773 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (36,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03090 is out of bounds
=========     and is 5,026,885,777 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (37,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03094 is out of bounds
=========     and is 5,026,885,781 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (38,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03098 is out of bounds
=========     and is 5,026,885,785 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (39,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c0309c is out of bounds
=========     and is 5,026,885,789 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (40,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c030a0 is out of bounds
=========     and is 5,026,885,793 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (41,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c030a4 is out of bounds
=========     and is 5,026,885,797 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (42,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c030a8 is out of bounds
=========     and is 5,026,885,801 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (43,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c030ac is out of bounds
=========     and is 5,026,885,805 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (44,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c030b0 is out of bounds
=========     and is 5,026,885,809 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (45,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c030b4 is out of bounds
=========     and is 5,026,885,813 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (46,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c030b8 is out of bounds
=========     and is 5,026,885,817 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (47,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c030bc is out of bounds
=========     and is 5,026,885,821 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (48,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c030c0 is out of bounds
=========     and is 5,026,885,825 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (49,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c030c4 is out of bounds
=========     and is 5,026,885,829 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (50,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c030c8 is out of bounds
=========     and is 5,026,885,833 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (51,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c030cc is out of bounds
=========     and is 5,026,885,837 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (52,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c030d0 is out of bounds
=========     and is 5,026,885,841 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (53,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c030d4 is out of bounds
=========     and is 5,026,885,845 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (54,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c030d8 is out of bounds
=========     and is 5,026,885,849 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (55,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c030dc is out of bounds
=========     and is 5,026,885,853 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (56,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c030e0 is out of bounds
=========     and is 5,026,885,857 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (57,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c030e4 is out of bounds
=========     and is 5,026,885,861 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (58,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c030e8 is out of bounds
=========     and is 5,026,885,865 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (59,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c030ec is out of bounds
=========     and is 5,026,885,869 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (60,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c030f0 is out of bounds
=========     and is 5,026,885,873 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (61,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c030f4 is out of bounds
=========     and is 5,026,885,877 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (62,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c030f8 is out of bounds
=========     and is 5,026,885,881 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (63,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c030fc is out of bounds
=========     and is 5,026,885,885 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (64,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03100 is out of bounds
=========     and is 5,026,885,889 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (65,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03104 is out of bounds
=========     and is 5,026,885,893 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (66,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03108 is out of bounds
=========     and is 5,026,885,897 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (67,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c0310c is out of bounds
=========     and is 5,026,885,901 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (68,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03110 is out of bounds
=========     and is 5,026,885,905 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (69,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03114 is out of bounds
=========     and is 5,026,885,909 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (70,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03118 is out of bounds
=========     and is 5,026,885,913 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (71,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c0311c is out of bounds
=========     and is 5,026,885,917 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (72,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03120 is out of bounds
=========     and is 5,026,885,921 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (73,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03124 is out of bounds
=========     and is 5,026,885,925 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (74,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03128 is out of bounds
=========     and is 5,026,885,929 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (75,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c0312c is out of bounds
=========     and is 5,026,885,933 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (76,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03130 is out of bounds
=========     and is 5,026,885,937 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (77,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03134 is out of bounds
=========     and is 5,026,885,941 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (78,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03138 is out of bounds
=========     and is 5,026,885,945 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (79,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c0313c is out of bounds
=========     and is 5,026,885,949 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (80,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03140 is out of bounds
=========     and is 5,026,885,953 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (81,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03144 is out of bounds
=========     and is 5,026,885,957 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (82,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03148 is out of bounds
=========     and is 5,026,885,961 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (83,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c0314c is out of bounds
=========     and is 5,026,885,965 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (84,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03150 is out of bounds
=========     and is 5,026,885,969 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (85,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03154 is out of bounds
=========     and is 5,026,885,973 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (86,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03158 is out of bounds
=========     and is 5,026,885,977 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (87,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c0315c is out of bounds
=========     and is 5,026,885,981 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (88,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03160 is out of bounds
=========     and is 5,026,885,985 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (89,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03164 is out of bounds
=========     and is 5,026,885,989 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (90,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03168 is out of bounds
=========     and is 5,026,885,993 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (91,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c0316c is out of bounds
=========     and is 5,026,885,997 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (92,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03170 is out of bounds
=========     and is 5,026,886,001 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (93,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03174 is out of bounds
=========     and is 5,026,886,005 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (94,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03178 is out of bounds
=========     and is 5,026,886,009 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (95,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c0317c is out of bounds
=========     and is 5,026,886,013 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (96,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03180 is out of bounds
=========     and is 5,026,886,017 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (97,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03184 is out of bounds
=========     and is 5,026,886,021 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (98,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03188 is out of bounds
=========     and is 5,026,886,025 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (99,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c0318c is out of bounds
=========     and is 5,026,886,029 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (100,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03190 is out of bounds
=========     and is 5,026,886,033 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (101,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03194 is out of bounds
=========     and is 5,026,886,037 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (102,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03198 is out of bounds
=========     and is 5,026,886,041 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (103,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c0319c is out of bounds
=========     and is 5,026,886,045 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (104,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c031a0 is out of bounds
=========     and is 5,026,886,049 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (105,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c031a4 is out of bounds
=========     and is 5,026,886,053 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (106,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c031a8 is out of bounds
=========     and is 5,026,886,057 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (107,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c031ac is out of bounds
=========     and is 5,026,886,061 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (108,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c031b0 is out of bounds
=========     and is 5,026,886,065 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (109,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c031b4 is out of bounds
=========     and is 5,026,886,069 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (110,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c031b8 is out of bounds
=========     and is 5,026,886,073 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (111,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c031bc is out of bounds
=========     and is 5,026,886,077 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (112,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c031c0 is out of bounds
=========     and is 5,026,886,081 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (113,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c031c4 is out of bounds
=========     and is 5,026,886,085 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (114,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c031c8 is out of bounds
=========     and is 5,026,886,089 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (115,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c031cc is out of bounds
=========     and is 5,026,886,093 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (116,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c031d0 is out of bounds
=========     and is 5,026,886,097 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (117,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c031d4 is out of bounds
=========     and is 5,026,886,101 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (118,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c031d8 is out of bounds
=========     and is 5,026,886,105 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (119,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c031dc is out of bounds
=========     and is 5,026,886,109 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (120,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c031e0 is out of bounds
=========     and is 5,026,886,113 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (121,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c031e4 is out of bounds
=========     and is 5,026,886,117 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (122,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c031e8 is out of bounds
=========     and is 5,026,886,121 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (123,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c031ec is out of bounds
=========     and is 5,026,886,125 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (124,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c031f0 is out of bounds
=========     and is 5,026,886,129 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (125,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c031f4 is out of bounds
=========     and is 5,026,886,133 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (126,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c031f8 is out of bounds
=========     and is 5,026,886,137 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (127,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c031fc is out of bounds
=========     and is 5,026,886,141 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (0,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03000 is out of bounds
=========     and is 5,026,885,633 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (1,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03004 is out of bounds
=========     and is 5,026,885,637 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (2,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c03008 is out of bounds
=========     and is 5,026,885,641 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
========= Invalid __global__ read of size 4 bytes
=========     at void soft_max_f32<(bool)0, (int)0, (int)0>(const float *, const float *, const float *, float *, int, int, float, float, float, float, unsigned int)+0x540
=========     by thread (3,0,0) in block (48771,0,0)
=========     Address 0x78b7d8c0300c is out of bounds
=========     and is 5,026,885,645 bytes after the nearest allocation at 0x78b6aca00000 of size 8,388,608 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2efadf]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:ggml_cuda_op_soft_max(ggml_backend_cuda_context&, ggml_tensor*) [0x1a5d62]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x1b4927]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:ggml_backend_sched_graph_compute_async [0x1285c4]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:llama_decode [0x69d6c]
=========                in ~/src/ggerganov/llama.cpp/build/libllama.so
=========     Host Frame:main [0x14548]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========     Host Frame: [0x25ccf]
=========                in /usr/lib/libc.so.6
=========     Host Frame:__libc_start_main [0x25d89]
=========                in /usr/lib/libc.so.6
=========     Host Frame:_start [0x18344]
=========                in ~/src/ggerganov/llama.cpp/./build/bin/main
=========
CUDA error: unspecified launch failure
  current device: 0, in function ggml_backend_cuda_synchronize at ~/src/ggerganov/llama.cpp/ggml-cuda.cu:2403
  cudaStreamSynchronize(cuda_ctx->stream())
GGML_ASSERT: ~/src/ggerganov/llama.cpp/ggml-cuda.cu:60: !"CUDA error"
========= Error: process didn't terminate successfully
========= Target application returned an error
========= ERROR SUMMARY: 6849 errors
========= ERROR SUMMARY: 6749 errors were not printed. Use --print-limit option to adjust the number of printed errors

@dranger003
Copy link
Contributor

These changes are fixing the issue, I updated my branch from PR #6563

diff --git a/ggml-cuda/softmax.cu b/ggml-cuda/softmax.cu
index 9bda18e..eb586dc 100644
--- a/ggml-cuda/softmax.cu
+++ b/ggml-cuda/softmax.cu
@@ -28,7 +28,7 @@ static __global__ void soft_max_f32(const float * x, const float * mask, const f
     extern __shared__ float data_soft_max_f32[];
     float * buf_iw = data_soft_max_f32; // shared memory buffer for inter-warp communication
     // shared memory buffer to cache values between iterations:
-    float * vals = vals_smem ? buf_iw + WARP_SIZE : dst + rowx*ncols;
+    float * vals = vals_smem ? buf_iw + WARP_SIZE : dst + (int64_t)rowx*ncols;

     float max_val = -INFINITY;

@@ -40,8 +40,8 @@ static __global__ void soft_max_f32(const float * x, const float * mask, const f
             break;
         }

-        const int ix = rowx*ncols + col;
-        const int iy = rowy*ncols + col;
+        const int64_t ix = (int64_t)rowx*ncols + col;
+        const int64_t iy = (int64_t)rowy*ncols + col;

         const float val = x[ix]*scale + (mask ? mask[iy] : 0.0f) + (pos ? slope*pos[col] : 0.0f);

@@ -109,12 +109,12 @@ static __global__ void soft_max_f32(const float * x, const float * mask, const f
             return;
         }

-        const int idst = rowx*ncols + col;
+        const int64_t idst = (int64_t)rowx*ncols + col;
         dst[idst] = vals[col] * inv_sum;
     }
 }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants