Fix more int overflow during quant (PPL/CUDA). #6563

dranger003 · 2024-04-09T13:21:44Z

Running perplexity on Command-R+ using CUDA is currently broken without this commit (more info here #6491 (comment)).
Although perplexity now works with all tested quants, I may have move some extra vars to int64_t than needed.

slaren · 2024-04-09T13:26:04Z

It would be good to have a set of tests in test-backend-ops that use very large tensors to check for overflows. That will also allow testing other backends. These tests will probably take too long to be enabled by default, but they can be left behind an #ifdef or command line parameter.

github-actions · 2024-04-09T13:50:00Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 435 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=10823.95ms p(95)=29142.4ms fails=, finish reason: stop=380 truncated=55
Prompt processing (pp): avg=122.92tk/s p(95)=555.33tk/s
Token generation (tg): avg=26.03tk/s p(95)=38.23tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=ppl-int-overflow-fix commit=0258f9bd3ddbcbfafcfd8019e8902f4cecc9c276

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 435 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1714344693 --> 1714345317
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 381.54, 381.54, 381.54, 381.54, 381.54, 651.82, 651.82, 651.82, 651.82, 651.82, 429.52, 429.52, 429.52, 429.52, 429.52, 443.33, 443.33, 443.33, 443.33, 443.33, 465.45, 465.45, 465.45, 465.45, 465.45, 517.84, 517.84, 517.84, 517.84, 517.84, 522.69, 522.69, 522.69, 522.69, 522.69, 527.49, 527.49, 527.49, 527.49, 527.49, 546.1, 546.1, 546.1, 546.1, 546.1, 563.26, 563.26, 563.26, 563.26, 563.26, 565.42, 565.42, 565.42, 565.42, 565.42, 575.2, 575.2, 575.2, 575.2, 575.2, 577.34, 577.34, 577.34, 577.34, 577.34, 592.12, 592.12, 592.12, 592.12, 592.12, 608.56, 608.56, 608.56, 608.56, 608.56, 629.16, 629.16, 629.16, 629.16, 629.16, 637.88, 637.88, 637.88, 637.88, 637.88, 585.33, 585.33, 585.33, 585.33, 585.33, 568.0, 568.0, 568.0, 568.0, 568.0, 577.47, 577.47, 577.47, 577.47, 577.47, 580.0, 580.0, 580.0, 580.0, 580.0, 580.36, 580.36, 580.36, 580.36, 580.36, 600.43, 600.43, 600.43, 600.43, 600.43, 600.12, 600.12, 600.12, 600.12, 600.12, 605.14, 605.14, 605.14, 605.14, 605.14, 611.47, 611.47, 611.47, 611.47, 611.47, 612.44, 612.44, 612.44, 612.44, 612.44, 616.09, 616.09, 616.09, 616.09, 616.09, 617.97, 617.97, 617.97, 617.97, 617.97, 594.28, 594.28, 594.28, 594.28, 594.28, 594.36, 594.36, 594.36, 594.36, 594.36, 598.25, 598.25, 598.25, 598.25, 598.25, 600.36, 600.36, 600.36, 600.36, 600.36, 609.16, 609.16, 609.16, 609.16, 609.16, 611.44, 611.44, 611.44, 611.44, 611.44, 611.64, 611.64, 611.64, 611.64, 611.64, 618.21, 618.21, 618.21, 618.21, 618.21, 620.53, 620.53, 620.53, 620.53, 620.53, 620.36, 620.36, 620.36, 620.36, 620.36, 622.6, 622.6, 622.6, 622.6, 622.6, 629.23, 629.23, 629.23, 629.23, 629.23, 639.68, 639.68, 639.68, 639.68, 639.68, 637.07, 637.07, 637.07, 637.07, 637.07, 633.43, 633.43, 633.43, 633.43, 633.43, 633.7, 633.7, 633.7, 633.7, 633.7, 633.11, 633.11, 633.11, 633.11, 633.11, 633.26, 633.26, 633.26, 633.26, 633.26, 630.95, 630.95, 630.95, 630.95, 630.95, 632.35, 632.35, 632.35, 632.35, 632.35, 636.89, 636.89, 636.89, 636.89, 636.89, 644.01, 644.01, 644.01, 644.01, 644.01, 644.55, 644.55, 644.55, 644.55, 644.55, 643.52, 643.52, 643.52, 643.52, 643.52, 641.55, 641.55, 641.55, 641.55, 641.55, 639.58, 639.58, 639.58, 639.58, 639.58, 637.88, 637.88, 637.88, 637.88, 637.88, 636.92, 636.92, 636.92, 636.92, 636.92, 639.22, 639.22, 639.22, 639.22, 639.22, 642.23, 642.23, 642.23, 642.23, 642.23, 642.47, 642.47, 642.47, 642.47, 642.47, 642.86, 642.86, 642.86]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 435 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1714344693 --> 1714345317
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 36.59, 36.59, 36.59, 36.59, 36.59, 35.89, 35.89, 35.89, 35.89, 35.89, 26.22, 26.22, 26.22, 26.22, 26.22, 25.17, 25.17, 25.17, 25.17, 25.17, 21.51, 21.51, 21.51, 21.51, 21.51, 21.67, 21.67, 21.67, 21.67, 21.67, 22.25, 22.25, 22.25, 22.25, 22.25, 23.56, 23.56, 23.56, 23.56, 23.56, 24.42, 24.42, 24.42, 24.42, 24.42, 24.56, 24.56, 24.56, 24.56, 24.56, 24.77, 24.77, 24.77, 24.77, 24.77, 24.32, 24.32, 24.32, 24.32, 24.32, 24.16, 24.16, 24.16, 24.16, 24.16, 24.02, 24.02, 24.02, 24.02, 24.02, 23.44, 23.44, 23.44, 23.44, 23.44, 23.3, 23.3, 23.3, 23.3, 23.3, 22.76, 22.76, 22.76, 22.76, 22.76, 22.61, 22.61, 22.61, 22.61, 22.61, 22.1, 22.1, 22.1, 22.1, 22.1, 21.59, 21.59, 21.59, 21.59, 21.59, 21.7, 21.7, 21.7, 21.7, 21.7, 21.83, 21.83, 21.83, 21.83, 21.83, 21.85, 21.85, 21.85, 21.85, 21.85, 21.77, 21.77, 21.77, 21.77, 21.77, 21.73, 21.73, 21.73, 21.73, 21.73, 21.77, 21.77, 21.77, 21.77, 21.77, 21.79, 21.79, 21.79, 21.79, 21.79, 21.86, 21.86, 21.86, 21.86, 21.86, 21.96, 21.96, 21.96, 21.96, 21.96, 21.99, 21.99, 21.99, 21.99, 21.99, 21.84, 21.84, 21.84, 21.84, 21.84, 21.92, 21.92, 21.92, 21.92, 21.92, 22.06, 22.06, 22.06, 22.06, 22.06, 22.05, 22.05, 22.05, 22.05, 22.05, 21.92, 21.92, 21.92, 21.92, 21.92, 22.0, 22.0, 22.0, 22.0, 22.0, 22.27, 22.27, 22.27, 22.27, 22.27, 22.42, 22.42, 22.42, 22.42, 22.42, 22.42, 22.42, 22.42, 22.42, 22.42, 22.61, 22.61, 22.61, 22.61, 22.61, 22.7, 22.7, 22.7, 22.7, 22.7, 22.65, 22.65, 22.65, 22.65, 22.65, 22.64, 22.64, 22.64, 22.64, 22.64, 22.63, 22.63, 22.63, 22.63, 22.63, 22.35, 22.35, 22.35, 22.35, 22.35, 22.37, 22.37, 22.37, 22.37, 22.37, 22.33, 22.33, 22.33, 22.33, 22.33, 22.42, 22.42, 22.42, 22.42, 22.42, 22.55, 22.55, 22.55, 22.55, 22.55, 22.61, 22.61, 22.61, 22.61, 22.61, 22.61, 22.61, 22.61, 22.61, 22.61, 22.6, 22.6, 22.6, 22.6, 22.6, 22.47, 22.47, 22.47, 22.47, 22.47, 22.31, 22.31, 22.31, 22.31, 22.31, 21.95, 21.95, 21.95, 21.95, 21.95, 21.15, 21.15, 21.15, 21.15, 21.15, 20.92, 20.92, 20.92, 20.92, 20.92, 20.65, 20.65, 20.65, 20.65, 20.65, 20.63, 20.63, 20.63, 20.63, 20.63, 20.63, 20.63, 20.63, 20.63, 20.63, 20.67, 20.67, 20.67]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 435 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1714344693 --> 1714345317
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.09, 0.09, 0.09, 0.09, 0.09, 0.38, 0.38, 0.38, 0.38, 0.38, 0.23, 0.23, 0.23, 0.23, 0.23, 0.31, 0.31, 0.31, 0.31, 0.31, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.1, 0.1, 0.1, 0.1, 0.1, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.24, 0.24, 0.24, 0.24, 0.24, 0.19, 0.19, 0.19, 0.19, 0.19, 0.29, 0.29, 0.29, 0.29, 0.29, 0.19, 0.19, 0.19, 0.19, 0.19, 0.26, 0.26, 0.26, 0.26, 0.26, 0.22, 0.22, 0.22, 0.22, 0.22, 0.39, 0.39, 0.39, 0.39, 0.39, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.18, 0.18, 0.18, 0.18, 0.18, 0.1, 0.1, 0.1, 0.1, 0.1, 0.21, 0.21, 0.21, 0.21, 0.21, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.21, 0.21, 0.21, 0.21, 0.21, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.2, 0.2, 0.2, 0.2, 0.2, 0.23, 0.23, 0.23, 0.23, 0.23, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.28, 0.28, 0.28, 0.28, 0.28, 0.17, 0.17, 0.17, 0.17, 0.17, 0.1, 0.1, 0.1, 0.1, 0.1, 0.11, 0.11, 0.11, 0.11, 0.11, 0.1, 0.1, 0.1, 0.1, 0.1, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.34, 0.34, 0.34, 0.34, 0.34, 0.17, 0.17, 0.17, 0.17, 0.17, 0.2, 0.2, 0.2, 0.2, 0.2, 0.21, 0.21, 0.21, 0.21, 0.21, 0.12, 0.12, 0.12, 0.12, 0.12, 0.09, 0.09, 0.09, 0.09, 0.09, 0.22, 0.22, 0.22, 0.22, 0.22, 0.38, 0.38, 0.38, 0.38, 0.38, 0.43, 0.43, 0.43, 0.43, 0.43, 0.53, 0.53, 0.53, 0.53, 0.53, 0.6, 0.6, 0.6, 0.6, 0.6, 0.5, 0.5, 0.5, 0.5, 0.5, 0.4, 0.4, 0.4, 0.4, 0.4, 0.26, 0.26, 0.26, 0.26, 0.26, 0.17, 0.17, 0.17, 0.17, 0.17, 0.21, 0.21, 0.21, 0.21, 0.21, 0.16, 0.16, 0.16, 0.16, 0.16, 0.1, 0.1, 0.1]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 435 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1714344693 --> 1714345317
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0]

jxy · 2024-04-10T02:23:16Z

Given blockIdx.x, blockDim.x, and threadIdx.x are all basically uint32_t, we could keep some of those as uint32_t and only cast them to uint64_t or int64_t when actually necessary.

randoentity · 2024-04-10T06:00:14Z

Edit: Ignore below. I was using the wrong environment. Retesting.
Edit 2: Either I'm doing something wrong with my environment or there's some regression because I keep getting a segmentation fault in text-generation-webui where before it was working. Most likely the former. It's working fine with llama-cpp-python serving to SillyTavern. No repetition issues that way either!

I'm still getting a segmentation fault when running inference using both the latest master as well as this branch. I've tried ggml-c4ai-command-r-plus-104b-iq3_xs.gguf and ggml-c4ai-command-r-plus-104b-iq4_xs.gguf (I know about gguf --merge). At an earlier commit inference did work, although the model went into a repetition loop (I've seen this mentioned on Reddit as well).
I'm running inference on text-generation-webui out of habit.
Sorry if I'm missing some key detail.

JohannesGaessler · 2024-04-10T11:39:06Z

Given blockIdx.x, blockDim.x, and threadIdx.x are all basically uint32_t, we could keep some of those as uint32_t and only cast them to uint64_t or int64_t when actually necessary.

There are two disadvantages with 64 bit integers over 32 bit integers: they need 2 registers and they are slower. But for dequantize kernels I would intuitively assume that this is not going to matter because you need very few registers and you're going to be heavily IO bound anyways. So for simplicity I would say to just use 64 bits throughout unless someone can demonstrate that this actually makes a performance difference (I'm not seeing any performance difference on my RTX 3090, my other GPUs are currently busy).

dranger003 · 2024-04-10T11:51:05Z

Just saw @JohannesGaessler's comment (after I pushed the revert). I can revert the revert if decided to be the right approach.

JohannesGaessler · 2024-04-11T09:00:52Z

I personally would in this case prefer to just consistently use 64 bit ints, but ultimately I would say either way is fine. The biggest issue would have been the additional effort from actually changing the code but this has already been done anyways.

JohannesGaessler · 2024-04-28T21:18:32Z

I completely forgot about this PR. @slaren even without the tests, do you think we should just merge it, given that it seems to fix the issue for at least one backend?

slaren · 2024-04-28T21:25:59Z

Yes absolutely, we should merge this now if it solves the immediate problem. The changes look good to me.

JohannesGaessler · 2024-04-28T21:45:32Z

ggml-cuda.cu

@@ -1225,7 +1225,7 @@ static void ggml_cuda_op_mul_mat_cublas(

    // the main device has a larger memory buffer to hold the results from all GPUs
    // ldc == nrows of the matrix that cuBLAS writes into
-    int64_t ldc = id == ctx.device ? ne0 : row_diff;
+    int ldc = id == ctx.device ? ne0 : row_diff;


Wait, why is this being changed? I thought the problem was that certain ints had too few bits for large models.

Did you maybe, in response to one of my earlier comments, accidentally change more places than just the ones originally touched in this PR?

@JohannesGaessler This one was reverted following an earlier comment questionning why it was changed in the first place. As previously mentioned, I have limited knowledge about these vars and rely on others expertise for the review. And because of the large number of ints that was overflowing, I had to guess and change them in batches until all the crashes were fixed, but surely I most likely changed more than needed.

It's fine to change more int to int64_t than necessary. But this is a change where a value was int64_t on master to int with your PR. I think this was done on accident when you reverted some of your other changes.

That is because my previous PR was merged into master, this is a subsequent PR. I can revert them back if needed.

The revert is in a single commit dranger003@9acb43d so if these are all fine I can delete that one commit.

Just delete the commit I'd say. Using int64_t has no disadvantages other than maybe slightly worse performance and I was not able to measure any performance difference whatsoever.

I pushed a rebase to remove the revert commit.

It seems this particular change is still there. Revert it and I'll merge.

cublasGemmEx takes an int anyway, so this doesn't really matter. There is a 64-bit interface to cublas, but I don't think there are any cases where a single dimension is larger than 2^31-1.

JohannesGaessler · 2024-04-28T21:46:41Z

ggml-cuda.cu

+    int i13 = blockIdx.x * blockDim.x + threadIdx.x;
+    int i12 = blockIdx.y * blockDim.y + threadIdx.y;


Same question, why the int64_t -> int change?

JohannesGaessler · 2024-04-28T21:47:45Z

ggml-cuda/convert.cu

@@ -5,16 +5,16 @@

 template <int qk, int qr, dequantize_kernel_t dequantize_kernel, typename dst_t>
 static __global__ void dequantize_block(const void * __restrict__ vx, dst_t * __restrict__ y, const int64_t k) {
-    const int64_t i = 2*(blockDim.x*blockIdx.x + threadIdx.x);
+    const int i = 2*(blockDim.x*blockIdx.x + threadIdx.x);


Same question.

JohannesGaessler · 2024-04-28T21:48:06Z

ggml-cuda/convert.cu

-    const int64_t tid = threadIdx.x;
-    const int64_t ip  = tid/32;   // ip is 0 or 1
-    const int64_t il  = tid - 32*ip; // 0...32
-    const int64_t is  = 8*ip + il/16;
+    const int tid = threadIdx.x;
+    const int ip  = tid/32;   // ip is 0 or 1
+    const int il  = tid - 32*ip; // 0...32
+    const int is  = 8*ip + il/16;


Same question.

JohannesGaessler · 2024-04-28T21:48:21Z

ggml-cuda/convert.cu

-    const int64_t tid = threadIdx.x;
-    const int64_t ip  = tid/16;         // 0 or 1
-    const int64_t il  = tid - 16*ip;    // 0...15
+    const int tid = threadIdx.x;
+    const int ip  = tid/16;         // 0 or 1
+    const int il  = tid - 16*ip;    // 0...15


Same question.

JohannesGaessler · 2024-04-28T21:48:38Z

ggml-cuda/convert.cu

-    const int64_t i = (int64_t)blockDim.x*blockIdx.x + threadIdx.x;
+    const int i = blockDim.x*blockIdx.x + threadIdx.x;


Same question.

These were all originally int and I reverted them to avoid changing more than needed.

No they were not. Go to the "files changed" tab and look at the combined changes of all of your commits relative to master.

I changed them in PR #6491.

dranger003 · 2024-04-29T01:18:40Z

Closes #6948.

* Fix more int overflow during quant. * Fix some more int overflow in softmax. * Revert back to int64_t.

Fix more int overflow during quant.

c481e11

dranger003 mentioned this pull request Apr 9, 2024

Add Command R Plus support #6491

Merged

dranger003 mentioned this pull request Apr 28, 2024

Command R Plus crashed on large context (~40K) with CUDA #6948

Closed

slaren approved these changes Apr 28, 2024

View reviewed changes

JohannesGaessler reviewed Apr 28, 2024

View reviewed changes

Fix some more int overflow in softmax.

91c10ef

dranger003 force-pushed the ppl-int-overflow-fix branch from 4947778 to 91c10ef Compare April 28, 2024 22:21

Revert back to int64_t.

0258f9b

JohannesGaessler approved these changes Apr 28, 2024

View reviewed changes

JohannesGaessler merged commit e00b4a8 into ggerganov:master Apr 28, 2024
51 of 58 checks passed

dranger003 deleted the ppl-int-overflow-fix branch May 1, 2024 11:29

nopperl pushed a commit to nopperl/llama.cpp that referenced this pull request May 5, 2024

Fix more int overflow during quant (PPL/CUDA). (ggerganov#6563)

d0228cb

* Fix more int overflow during quant. * Fix some more int overflow in softmax. * Revert back to int64_t.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix more int overflow during quant (PPL/CUDA). #6563

Fix more int overflow during quant (PPL/CUDA). #6563

dranger003 commented Apr 9, 2024 •

edited

Loading

slaren commented Apr 9, 2024 •

edited

Loading

github-actions bot commented Apr 9, 2024 •

edited

Loading

jxy commented Apr 10, 2024

randoentity commented Apr 10, 2024 •

edited

Loading

JohannesGaessler commented Apr 10, 2024

dranger003 commented Apr 10, 2024

JohannesGaessler commented Apr 11, 2024

JohannesGaessler commented Apr 28, 2024

slaren commented Apr 28, 2024

JohannesGaessler Apr 28, 2024

JohannesGaessler Apr 28, 2024

dranger003 Apr 28, 2024

JohannesGaessler Apr 28, 2024

dranger003 Apr 28, 2024

dranger003 Apr 28, 2024

JohannesGaessler Apr 28, 2024

dranger003 Apr 28, 2024

JohannesGaessler Apr 28, 2024

slaren Apr 28, 2024

JohannesGaessler Apr 28, 2024

JohannesGaessler Apr 28, 2024

JohannesGaessler Apr 28, 2024

JohannesGaessler Apr 28, 2024

JohannesGaessler Apr 28, 2024

dranger003 Apr 28, 2024

JohannesGaessler Apr 28, 2024

dranger003 Apr 28, 2024

dranger003 commented Apr 29, 2024

		int i13 = blockIdx.x * blockDim.x + threadIdx.x;
		int i12 = blockIdx.y * blockDim.y + threadIdx.y;

		const int64_t i = (int64_t)blockDim.x*blockIdx.x + threadIdx.x;
		const int i = blockDim.x*blockIdx.x + threadIdx.x;

Fix more int overflow during quant (PPL/CUDA). #6563

Fix more int overflow during quant (PPL/CUDA). #6563

Conversation

dranger003 commented Apr 9, 2024 • edited Loading

slaren commented Apr 9, 2024 • edited Loading

github-actions bot commented Apr 9, 2024 • edited Loading

jxy commented Apr 10, 2024

randoentity commented Apr 10, 2024 • edited Loading

JohannesGaessler commented Apr 10, 2024

dranger003 commented Apr 10, 2024

JohannesGaessler commented Apr 11, 2024

JohannesGaessler commented Apr 28, 2024

slaren commented Apr 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dranger003 commented Apr 29, 2024

dranger003 commented Apr 9, 2024 •

edited

Loading

slaren commented Apr 9, 2024 •

edited

Loading

github-actions bot commented Apr 9, 2024 •

edited

Loading

randoentity commented Apr 10, 2024 •

edited

Loading