Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoder is broken when CUBLAS is ON #1688

Closed
bobqianic opened this issue Dec 26, 2023 · 23 comments
Closed

Encoder is broken when CUBLAS is ON #1688

bobqianic opened this issue Dec 26, 2023 · 23 comments
Labels
bug Something isn't working help wanted Extra attention is needed high priority Very important issue

Comments

@bobqianic
Copy link
Collaborator

bobqianic commented Dec 26, 2023

This occurs when using the tiny, small, base, medium, and large models.
All models used are not quantized.

    ggml_tensor * tensor = wctx.state->embd_conv;
    std::vector<float> tensor_data(ggml_nelements(tensor));
    ggml_backend_tensor_get(tensor, tensor_data.data(), 0, ggml_nbytes(tensor));
    std::ofstream outFile("encoder_embedding_conv.json");
    outFile << "[";
    for (uint64_t i = 0; i < tensor_data.size() - 1; i++) {
        outFile << tensor_data[i] << ", ";
    }
    outFile << tensor_data[tensor_data.size() - 1] << "]";
    outFile.close();
    return 0;

CUDA:

image
image

CPU:

image
image

encoder_embedding_conv.zip

@bobqianic bobqianic added bug Something isn't working help wanted Extra attention is needed high priority Very important issue labels Dec 26, 2023
@bobqianic
Copy link
Collaborator Author

bobqianic commented Dec 26, 2023

Regarding the encoder's final output. I compared CUDA output with CPU output.

This output reflects the behavior of a specific audio sample, and it's important to note that various audio samples can exhibit different behaviors. For instance, while some audio files might yield distinct results when using medium model, others could show varied outcomes when using small model.

    ggml_tensor * tensor = wctx->state->embd_enc;
    std::vector<float> tensor_data(ggml_nelements(tensor));
    ggml_backend_tensor_get(tensor, tensor_data.data(), 0, ggml_nbytes(tensor));
    std::ofstream outFile("encoder_embedding.json");
    outFile << "[";
    for (uint64_t i = 0; i < tensor_data.size() - 1; i++) {
        outFile << tensor_data[i] << ", ";
    }
    outFile << tensor_data[tensor_data.size() - 1] << "]";
    outFile.close();
    return 0;

Tiny:

image

Base:

image

Small:

image

Medium:

image

Large:

image

@bobqianic
Copy link
Collaborator Author

Do you know how we could fix this? @slaren

@slaren
Copy link
Collaborator

slaren commented Dec 26, 2023

We would need to find the op that is producing wrong results in CUDA. The easiest way to do this is by using ggml_backend_compare_graph_backend to run the graph both on the CPU and in CUDA at the same time and compare the results. test-backend-ops shows how to do this. If you already know or suspect what op may be the issue, then you can add a test case in test-backend-ops to confirm it.

@bobqianic
Copy link
Collaborator Author

bobqianic commented Dec 26, 2023

Seems like we're dealing with a nightmare here. I'll do my best to pinpoint the operation at the heart of the issue : )

Wait a second, my mistake, I made an error when comparing the second graph.

(Edit: I probably know why there was a problem when comparing the second graph. It's because a tensor was calculated twice, being computed once by each of two backends, so there was a problem. So what I need to do is modify the graph to make a copy of the initial tensor that's being calculated.)

I don't know how to fix this : (

CPU_backend vs CUDA_backend

Large-v2 (whisper_build_graph_conv)

[MUL_MAT] NMSE = 0.000001 [ADD] NMSE = 0.000001 [GELU] NMSE = 0.000001 [IM2COL] NMSE = 0.000001 [MUL_MAT] NMSE = 0.000007 [ADD] NMSE = 0.000006 [GELU] NMSE = 0.000010 FAIL

Large-v2 (whisper_build_graph_encoder)

[MUL_MAT] NMSE = 0.000000 [SCALE] NMSE = 0.000000 [SOFT_MAX] NMSE = 0.000008 [MUL_MAT] NMSE = 0.000000 [CPY] NMSE = 0.000000 [MUL_MAT] NMSE = 0.000000 [ADD] NMSE = 0.000000 [NORM] NMSE = 0.000000 [MUL] NMSE = 0.000000 [ADD] NMSE = 0.000000 [MUL_MAT] NMSE = 0.000000 [ADD] NMSE = 0.000000 [GELU] NMSE = 0.000001 [MUL_MAT] NMSE = 0.000001 [ADD] NMSE = 0.000001 [ADD] NMSE = 0.000000 [NORM] NMSE = 0.000001 [MUL] NMSE = 0.000001 [ADD] NMSE = 0.000001 [MUL_MAT] NMSE = 0.000001 [ADD] NMSE = 0.000001 [CPY] NMSE = 0.000001 [MUL_MAT] NMSE = 0.000001 [CPY] NMSE = 0.000001 [MUL_MAT] NMSE = 0.000001 [ADD] NMSE = 0.000001 [CPY] NMSE = 0.000001 [MUL_MAT] NMSE = 0.000001 [SCALE] NMSE = 0.000001 [SOFT_MAX] NMSE = 0.000007 [MUL_MAT] NMSE = 0.000001 [CPY] NMSE = 0.000001 [MUL_MAT] NMSE = 0.000002 [ADD] NMSE = 0.000002 [ADD] NMSE = 0.000000 [NORM] NMSE = 0.000001 [MUL] NMSE = 0.000001 [ADD] NMSE = 0.000001 [MUL_MAT] NMSE = 0.000000 [ADD] NMSE = 0.000000 [GELU] NMSE = 0.000001 [MUL_MAT] NMSE = 0.000003 [ADD] NMSE = 0.000003 [ADD] NMSE = 0.000000 [NORM] NMSE = 0.000001 [MUL] NMSE = 0.000001 [ADD] NMSE = 0.000001 [MUL_MAT] NMSE = 0.000002 [ADD] NMSE = 0.000002 [CPY] NMSE = 0.000002 [MUL_MAT] NMSE = 0.000001 [CPY] NMSE = 0.000001 [MUL_MAT] NMSE = 0.000001 [ADD] NMSE = 0.000001 [CPY] NMSE = 0.000001 [MUL_MAT] NMSE = 0.000001 [SCALE] NMSE = 0.000001 [SOFT_MAX] NMSE = 0.000011 [MUL_MAT] NMSE = 0.000002 [CPY] NMSE = 0.000002 [MUL_MAT] NMSE = 0.000003 [ADD] NMSE = 0.000002 [ADD] NMSE = 0.000000 [NORM] NMSE = 0.000001 [MUL] NMSE = 0.000002 [ADD] NMSE = 0.000001 [MUL_MAT] NMSE = 0.000000 [ADD] NMSE = 0.000000 [GELU] NMSE = 0.000001 [MUL_MAT] NMSE = 0.000004 [ADD] NMSE = 0.000004 [ADD] NMSE = 0.000000 [NORM] NMSE = 0.000001 [MUL] NMSE = 0.000002 [ADD] NMSE = 0.000002 [MUL_MAT] NMSE = 0.000002 [ADD] NMSE = 0.000002 [CPY] NMSE = 0.000002 [MUL_MAT] NMSE = 0.000001 [CPY] NMSE = 0.000001 [MUL_MAT] NMSE = 0.000001 [ADD] NMSE = 0.000000 [CPY] NMSE = 0.000000 [MUL_MAT] NMSE = 0.000001 [SCALE] NMSE = 0.000001 [SOFT_MAX] NMSE = 0.000004 [MUL_MAT] NMSE = 0.000001 [CPY] NMSE = 0.000001 [MUL_MAT] NMSE = 0.000001 [ADD] NMSE = 0.000001 [ADD] NMSE = 0.000000 [NORM] NMSE = 0.000001 [MUL] NMSE = 0.000002 [ADD] NMSE = 0.000001 [MUL_MAT] NMSE = 0.000000 [ADD] NMSE = 0.000000 [GELU] NMSE = 0.000001 [MUL_MAT] NMSE = 0.000005 [ADD] NMSE = 0.000005 [ADD] NMSE = 0.000000 [NORM] NMSE = 0.000002 [MUL] NMSE = 0.000002 [ADD] NMSE = 0.000001 [MUL_MAT] NMSE = 0.000002 [ADD] NMSE = 0.000002 [CPY] NMSE = 0.000002 [MUL_MAT] NMSE = 0.000001 [CPY] NMSE = 0.000001 [MUL_MAT] NMSE = 0.000001 [ADD] NMSE = 0.000001 [CPY] NMSE = 0.000001 [MUL_MAT] NMSE = 0.000001 [SCALE] NMSE = 0.000001 [SOFT_MAX] NMSE = 0.000005 [MUL_MAT] NMSE = 0.000001 [CPY] NMSE = 0.000001 [MUL_MAT] NMSE = 0.000002
 [ADD] NMSE = 0.000001 [ADD] NMSE = 0.000000 [NORM] NMSE = 0.000002 [MUL] NMSE = 0.000003 [ADD] NMSE = 0.000002 [MUL_MAT] NMSE = 0.000000 [ADD] NMSE = 0.000000 [GELU] NMSE = 0.000001 [MUL_MAT] NMSE = 0.000006 [ADD] NMSE = 0.000006 [ADD] NMSE = 0.000000 [NORM] NMSE = 0.000003 [MUL] NMSE = 0.000002 [ADD] NMSE = 0.000002 [MUL_MAT] NMSE = 0.000003 [ADD] NMSE = 0.000003 [CPY] NMSE = 0.000003 [MUL_MAT] NMSE = 0.000001 [CPY] NMSE = 0.000001 [MUL_MAT] NMSE = 0.000001 [ADD] NMSE = 0.000001 [CPY] NMSE = 0.000001 [MUL_MAT] NMSE = 0.000001 [SCALE] NMSE = 0.000001 [SOFT_MAX] NMSE = 0.000023 [MUL_MAT] NMSE = 0.000003 [CPY] NMSE = 0.000003 [MUL_MAT] NMSE = 0.000004 [ADD] NMSE = 0.000004 [ADD] NMSE = 0.000001 [NORM] NMSE = 0.000003 [MUL] NMSE = 0.000003 [ADD] NMSE = 0.000003 [MUL_MAT] NMSE = 0.000000 [ADD] NMSE = 0.000000 [GELU] NMSE = 0.000001
 [MUL_MAT] NMSE = 0.000008 [ADD] NMSE = 0.000008 [ADD] NMSE = 0.000001 [NORM] NMSE = 0.000004 [MUL] NMSE = 0.000003 [ADD] NMSE = 0.000003 [MUL_MAT] NMSE = 0.000003 [ADD] NMSE = 0.000003 [CPY] NMSE = 0.000003 [MUL_MAT] NMSE = 0.000002 [CPY] NMSE = 0.000002 [MUL_MAT] NMSE = 0.000001 [ADD] NMSE = 0.000001 [CPY] NMSE = 0.000001 [MUL_MAT] NMSE = 0.000001 [SCALE] NMSE = 0.000001 [SOFT_MAX] NMSE = 0.000010 [MUL_MAT] NMSE = 0.000002 [CPY] NMSE = 0.000002 [MUL_MAT] NMSE = 0.000003 [ADD] NMSE = 0.000003 [ADD] NMSE = 0.000001 [NORM] NMSE = 0.000004 [MUL] NMSE = 0.000004 [ADD] NMSE = 0.000003 [MUL_MAT] NMSE = 0.000000 [ADD] NMSE = 0.000000 [GELU] NMSE = 0.000001 [MUL_MAT] NMSE = 0.000009 [ADD] NMSE = 0.000009 [ADD] NMSE = 0.000001 [NORM] NMSE = 0.000004 [MUL] NMSE = 0.000003 [ADD] NMSE = 0.000003 [MUL_MAT] NMSE = 0.000003 [ADD] NMSE = 0.000003 [CPY] NMSE = 0.000003 [MUL_MAT] NMSE = 0.000001 [CPY] NMSE = 0.000001 [MUL_MAT] NMSE = 0.000001 [ADD] NMSE = 0.000001 [CPY] NMSE = 0.000001 [MUL_MAT] NMSE = 0.000001 [SCALE] NMSE = 0.000001 [SOFT_MAX] NMSE = 0.000018 [MUL_MAT] NMSE = 0.000002 [CPY] NMSE = 0.000002 [MUL_MAT] NMSE = 0.000002 [ADD] NMSE = 0.000002 [ADD] NMSE = 0.000001 [NORM] NMSE = 0.000004 [MUL] NMSE = 0.000005 [ADD] NMSE = 0.000004 [MUL_MAT] NMSE = 0.000000 [ADD] NMSE = 0.000000 [GELU] NMSE = 0.000002 [MUL_MAT] NMSE = 0.000010 [ADD] NMSE = 0.000010 [ADD] NMSE = 0.000001 [NORM] NMSE = 0.000005 [MUL] NMSE = 0.000004 [ADD] NMSE = 0.000004 [MUL_MAT] NMSE = 0.000003 [ADD] NMSE = 0.000003 [CPY] NMSE = 0.000003 [MUL_MAT] NMSE = 0.000002 [CPY] NMSE = 0.000002 [MUL_MAT] NMSE = 0.000001 [ADD] NMSE = 0.000001 [CPY] NMSE = 0.000001 [MUL_MAT] NMSE = 0.000002 [SCALE] NMSE = 0.000002 [SOFT_MAX] NMSE = 0.000008 [MUL_MAT] NMSE = 0.000002 [CPY] NMSE = 0.000002 [MUL_MAT] NMSE = 0.000003 [ADD] NMSE = 0.000003 [ADD] NMSE = 0.000001 [NORM] NMSE = 0.000005 [MUL] NMSE = 0.000006 [ADD] NMSE = 0.000005 [MUL_MAT] NMSE = 0.000000 [ADD] NMSE = 0.000000 [GELU] NMSE = 0.000002 [MUL_MAT] NMSE = 0.000011 [ADD] NMSE = 0.000011 [ADD] NMSE = 0.000001 [NORM] NMSE = 0.000006 [MUL] NMSE = 0.000004 [ADD] NMSE = 0.000004 [MUL_MAT] NMSE = 0.000004 [ADD] NMSE = 0.000004 [CPY] NMSE = 0.000004 [MUL_MAT] NMSE = 0.000002 [CPY] NMSE = 0.000002 [MUL_MAT] NMSE = 0.000001 [ADD] NMSE = 0.000001 [CPY] NMSE = 0.000001 [MUL_MAT] NMSE = 0.000001 [SCALE] NMSE = 0.000001 [SOFT_MAX] NMSE = 0.000010 [MUL_MAT] NMSE = 0.000002 [CPY] NMSE = 0.000002 [MUL_MAT] NMSE = 0.000003 [ADD] NMSE = 0.000003 [ADD] NMSE = 0.000001 [NORM] NMSE = 0.000006 [MUL] NMSE = 0.000007 [ADD] NMSE = 0.000005 [MUL_MAT] NMSE = 0.000000 [ADD] NMSE = 0.000000 [GELU] NMSE = 0.000003 [MUL_MAT] NMSE = 0.000012 [ADD] NMSE = 0.000013 [ADD] NMSE = 0.000001 [NORM] NMSE = 0.000007 [MUL] NMSE = 0.000005 [ADD] NMSE = 0.000005 [MUL_MAT] NMSE = 0.000005 [ADD] NMSE = 0.000005 [CPY] NMSE = 0.000005 [MUL_MAT] NMSE = 0.000002 [CPY] NMSE = 0.000002 [MUL_MAT] NMSE = 0.000001 [ADD] NMSE = 0.000001 [CPY] NMSE = 0.000001 [MUL_MAT] NMSE = 0.000002 [SCALE] NMSE = 0.000002 [SOFT_MAX] NMSE = 0.000012 [MUL_MAT] NMSE = 0.000003 [CPY] NMSE = 0.000003 [MUL_MAT] NMSE = 0.000004 [ADD] NMSE = 0.000004 [ADD] NMSE = 0.000001 [NORM] NMSE = 0.000007 [MUL] NMSE = 0.000008 [ADD] NMSE = 0.000006 [MUL_MAT] NMSE = 0.000000 [ADD] NMSE = 0.000000 [GELU] NMSE = 0.000003 [MUL_MAT] NMSE = 0.000013 [ADD] NMSE = 0.000014 [ADD] NMSE = 0.000001
 [NORM] NMSE = 0.000008 [MUL] NMSE = 0.000006 [ADD] NMSE = 0.000006 [MUL_MAT] NMSE = 0.000006 [ADD] NMSE = 0.000006 [CPY] NMSE = 0.000006 [MUL_MAT] NMSE = 0.000002 [CPY] NMSE = 0.000002 [MUL_MAT] NMSE = 0.000002 [ADD] NMSE = 0.000002 [CPY] NMSE = 0.000002 [MUL_MAT] NMSE = 0.000002 [SCALE] NMSE = 0.000002 [SOFT_MAX] NMSE = 0.000014 [MUL_MAT] NMSE = 0.000003 [CPY] NMSE = 0.000003 [MUL_MAT] NMSE = 0.000005 [ADD] NMSE = 0.000005 [ADD] NMSE = 0.000001 [NORM] NMSE = 0.000008 [MUL] NMSE = 0.000009 [ADD] NMSE = 0.000007 [MUL_MAT] NMSE = 0.000000 [ADD] NMSE = 0.000000 [GELU] NMSE = 0.000003 [MUL_MAT] NMSE = 0.000014 [ADD] NMSE = 0.000015 [ADD] NMSE = 0.000001 [NORM] NMSE = 0.000008 [MUL] NMSE = 0.000006 [ADD] NMSE = 0.000006 [MUL_MAT] NMSE = 0.000006 [ADD] NMSE = 0.000006 [CPY] NMSE = 0.000006 [MUL_MAT] NMSE = 0.000002 [CPY] NMSE = 0.000002 [MUL_MAT] NMSE = 0.000002 [ADD] NMSE = 0.000002 [CPY] NMSE = 0.000002 [MUL_MAT] NMSE = 0.000002 [SCALE] NMSE = 0.000002 [SOFT_MAX] NMSE = 0.000013 [MUL_MAT] NMSE = 0.000003 [CPY] NMSE = 0.000003 [MUL_MAT] NMSE = 0.000004 [ADD] NMSE = 0.000004 [ADD] NMSE = 0.000001 [NORM] NMSE = 0.000008 [MUL] NMSE = 0.000009 [ADD] NMSE = 0.000007 [MUL_MAT] NMSE = 0.000001 [ADD] NMSE = 0.000001 [GELU] NMSE = 0.000003 [MUL_MAT] NMSE = 0.000018 [ADD] NMSE = 0.000018 [ADD] NMSE = 0.000001 [NORM] NMSE = 0.000009 [MUL] NMSE = 0.000009 [ADD] NMSE = 0.000009 [MUL_MAT] NMSE = 0.000006 [ADD] NMSE = 0.000006 [CPY] NMSE = 0.000006 [MUL_MAT] NMSE = 0.000003 [CPY] NMSE = 0.000003 [MUL_MAT] NMSE = 0.000002 [ADD] NMSE = 0.000001 [CPY] NMSE = 0.000001 [MUL_MAT] NMSE = 0.000003 [SCALE] NMSE = 0.000003 [SOFT_MAX] NMSE = 0.000006 [MUL_MAT] NMSE = 0.000002 [CPY] NMSE = 0.000002 [MUL_MAT] NMSE = 0.000003 [ADD] NMSE = 0.000003 [ADD] NMSE = 0.000001 [NORM] NMSE = 0.000009 [MUL] NMSE = 0.000010 [ADD] NMSE = 0.000007 [MUL_MAT] NMSE = 0.000001 [ADD] NMSE = 0.000001 [GELU] NMSE = 0.000004 [MUL_MAT] NMSE = 0.000017 [ADD] NMSE = 0.000017 [ADD] NMSE = 0.000001 [NORM] NMSE = 0.000011 [MUL] NMSE = 0.000008 [ADD] NMSE = 0.000008 [MUL_MAT] NMSE = 0.000008 [ADD] NMSE = 0.000008 [CPY] NMSE = 0.000008 [MUL_MAT] NMSE = 0.000003 [CPY] NMSE = 0.000003 [MUL_MAT] NMSE = 0.000003 [ADD] NMSE = 0.000003 [CPY] NMSE = 0.000003 [MUL_MAT] NMSE = 0.000003 [SCALE] NMSE = 0.000003 [SOFT_MAX] NMSE = 0.000017 [MUL_MAT] NMSE = 0.000004 [CPY] NMSE = 0.000004 [MUL_MAT] NMSE = 0.000009 [ADD] NMSE = 0.000009 [ADD] NMSE = 0.000001 [NORM] NMSE = 0.000010 [MUL] NMSE = 0.000012 [ADD] NMSE = 0.000009 [MUL_MAT] NMSE = 0.000001 [ADD] NMSE = 0.000001 [GELU] NMSE = 0.000004 [MUL_MAT] NMSE = 0.000020 [ADD] NMSE = 0.000021 [ADD] NMSE = 0.000001 [NORM] NMSE = 0.000011 [MUL] NMSE = 0.000008 [ADD] NMSE = 0.000008 [MUL_MAT] NMSE = 0.000008 [ADD] NMSE = 0.000008 [CPY] NMSE = 0.000008 [MUL_MAT] NMSE = 0.000003 [CPY] NMSE = 0.000003 [MUL_MAT] NMSE = 0.000003 [ADD] NMSE = 0.000002 [CPY] NMSE = 0.000002 [MUL_MAT] NMSE = 0.000003 [SCALE] NMSE = 0.000003 [SOFT_MAX] NMSE = 0.000019 [MUL_MAT] NMSE = 0.000004 [CPY] NMSE = 0.000004 [MUL_MAT] NMSE = 0.000006 [ADD] NMSE = 0.000005 [ADD] NMSE = 0.000001 [NORM] NMSE = 0.000011 [MUL] NMSE = 0.000012 [ADD] NMSE = 0.000009 [MUL_MAT] NMSE = 0.000001 [ADD] NMSE = 0.000001 [GELU] NMSE = 0.000004 [MUL_MAT] NMSE = 0.000022 [ADD] NMSE = 0.000022 [ADD] NMSE = 0.000001 [NORM] NMSE = 0.000012 [MUL] NMSE = 0.000010 [ADD] NMSE = 0.000010 [MUL_MAT] NMSE = 0.000009 [ADD] NMSE = 0.000009 [CPY] NMSE = 0.000009 [MUL_MAT] NMSE = 0.000004 [CPY] NMSE = 0.000004 [MUL_MAT] NMSE = 0.000003 [ADD] NMSE = 0.000003 [CPY] NMSE = 0.000003 [MUL_MAT] NMSE = 0.000005 [SCALE] NMSE = 0.000005 [SOFT_MAX] NMSE = 0.000016 [MUL_MAT] NMSE = 0.000004 [CPY] NMSE = 0.000004 [MUL_MAT] NMSE = 0.000009 [ADD] NMSE = 0.000008 [ADD] NMSE = 0.000001 [NORM] NMSE = 0.000012 [MUL] NMSE = 0.000013 [ADD] NMSE = 0.000010 [MUL_MAT] NMSE = 0.000001 [ADD] NMSE = 0.000001 [GELU] NMSE = 0.000005 [MUL_MAT] NMSE = 0.000023 [ADD] NMSE = 0.000024 [ADD] NMSE = 0.000001 [NORM] NMSE = 0.000013 [MUL] NMSE = 0.000010 [ADD] NMSE = 0.000010 [MUL_MAT] NMSE = 0.000008 [ADD] NMSE = 0.000008 [CPY] NMSE = 0.000008 [MUL_MAT] NMSE = 0.000004 [CPY] NMSE = 0.000004 [MUL_MAT] NMSE = 0.000003 [ADD] NMSE = 0.000003 [CPY] NMSE = 0.000003 [MUL_MAT] NMSE = 0.000005 [SCALE] NMSE = 0.000005 [SOFT_MAX] NMSE = 0.000015 [MUL_MAT] NMSE = 0.000005 [CPY] NMSE = 0.000005 [MUL_MAT] NMSE = 0.000007 [ADD] NMSE = 0.000007 [ADD] NMSE = 0.000001 [NORM] NMSE = 0.000013 [MUL] NMSE = 0.000015 [ADD] NMSE = 0.000011 [MUL_MAT] NMSE = 0.000001 [ADD] NMSE = 0.000001 [GELU] NMSE = 0.000006 [MUL_MAT] NMSE = 0.000026 [ADD] NMSE = 0.000027 [ADD] NMSE = 0.000001 [NORM] NMSE = 0.000014 [MUL] NMSE = 0.000010 [ADD] NMSE = 0.000010 [MUL_MAT] NMSE = 0.000011 [ADD] NMSE = 0.000011 [CPY] NMSE = 0.000011 [MUL_MAT] NMSE = 0.000004 [CPY] NMSE = 0.000004 [MUL_MAT] NMSE = 0.000003 [ADD] NMSE = 0.000003 [CPY] NMSE = 0.000003 [MUL_MAT] NMSE = 0.000004 [SCALE] NMSE = 0.000004 [SOFT_MAX] NMSE = 0.000020 [MUL_MAT] NMSE = 0.000005 [CPY] NMSE = 0.000005 [MUL_MAT] NMSE = 0.000010 [ADD] NMSE = 0.000009 [ADD] NMSE = 0.000001 [NORM] NMSE = 0.000013 [MUL] NMSE = 0.000015 [ADD] NMSE = 0.000011 [MUL_MAT] NMSE = 0.000001 [ADD] NMSE = 0.000001 [GELU] NMSE = 0.000006 [MUL_MAT] NMSE = 0.000031 [ADD] NMSE = 0.000032 [ADD] NMSE = 0.000001 [NORM] NMSE = 0.000014 [MUL] NMSE = 0.000013 [ADD] NMSE = 0.000013 [MUL_MAT] NMSE = 0.000012 [ADD] NMSE = 0.000012 [CPY] NMSE = 0.000012 [MUL_MAT] NMSE = 0.000004 [CPY] NMSE = 0.000004 [MUL_MAT] NMSE = 0.000003 [ADD] NMSE = 0.000003 [CPY] NMSE = 0.000003 [MUL_MAT] NMSE = 0.000004 [SCALE] NMSE = 0.000004 [SOFT_MAX] NMSE = 0.000013 [MUL_MAT] NMSE = 0.000005 [CPY] NMSE = 0.000005 [MUL_MAT] NMSE = 0.000008 [ADD] NMSE = 0.000008 [ADD] NMSE = 0.000001 [NORM] NMSE = 0.000014 [MUL] NMSE = 0.000016 [ADD] NMSE = 0.000012 [MUL_MAT] NMSE = 0.000001 [ADD] NMSE = 0.000001 [GELU] NMSE = 0.000007 [MUL_MAT] NMSE = 0.000034 [ADD] NMSE = 0.000035 [ADD] NMSE = 0.000001 [NORM] NMSE = 0.000015 [MUL] NMSE = 0.000013 [ADD] NMSE = 0.000013 [MUL_MAT] NMSE = 0.000014 [ADD] NMSE = 0.000014 [CPY] NMSE = 0.000014 [MUL_MAT] NMSE = 0.000005 [CPY] NMSE = 0.000005 [MUL_MAT] NMSE = 0.000004 [ADD] NMSE = 0.000004 [CPY] NMSE = 0.000004 [MUL_MAT] NMSE = 0.000004 [SCALE] NMSE = 0.000004 [SOFT_MAX] NMSE = 0.000023 [MUL_MAT] NMSE = 0.000005 [CPY] NMSE = 0.000005 [MUL_MAT] NMSE = 0.000009 [ADD] NMSE = 0.000008 [ADD] NMSE = 0.000001 [NORM] NMSE = 0.000014 [MUL] NMSE = 0.000016 [ADD] NMSE = 0.000012 [MUL_MAT] NMSE = 0.000001 [ADD] NMSE = 0.000001 [GELU] NMSE = 0.000008 [MUL_MAT] NMSE = 0.000034 [ADD] NMSE = 0.000035 [ADD] NMSE = 0.000001 [NORM] NMSE = 0.000015 [MUL] NMSE = 0.000014 [ADD] NMSE = 0.000014 [MUL_MAT] NMSE = 0.000014 [ADD] NMSE = 0.000014 [CPY] NMSE = 0.000014 [MUL_MAT] NMSE = 0.000005 [CPY] NMSE = 0.000005 [MUL_MAT] NMSE = 0.000005 [ADD] NMSE = 0.000005 [CPY] NMSE = 0.000005 [MUL_MAT] NMSE = 0.000005 [SCALE] NMSE = 0.000005 [SOFT_MAX] NMSE = 0.000023 [MUL_MAT] NMSE = 0.000006 [CPY] NMSE = 0.000006
 [MUL_MAT] NMSE = 0.000011 [ADD] NMSE = 0.000010 [ADD] NMSE = 0.000001 [NORM] NMSE = 0.000015 [MUL] NMSE = 0.000017 [ADD] NMSE = 0.000013 [MUL_MAT] NMSE = 0.000001 [ADD] NMSE = 0.000001 [GELU] NMSE = 0.000008 [MUL_MAT] NMSE = 0.000038 [ADD] NMSE = 0.000039 [ADD] NMSE = 0.000001 [NORM] NMSE = 0.000016 [MUL] NMSE = 0.000013 [ADD] NMSE = 0.000014 [MUL_MAT] NMSE = 0.000013 [ADD] NMSE = 0.000013 [CPY] NMSE = 0.000013 [MUL_MAT] NMSE = 0.000006 [CPY] NMSE = 0.000006 [MUL_MAT] NMSE = 0.000005 [ADD] NMSE = 0.000005 [CPY] NMSE = 0.000005 [MUL_MAT] NMSE = 0.000004 [SCALE] NMSE = 0.000004 [SOFT_MAX] NMSE = 0.000022 [MUL_MAT] NMSE = 0.000005 [CPY] NMSE = 0.000005 [MUL_MAT] NMSE = 0.000014 [ADD] NMSE = 0.000013 [ADD] NMSE = 0.000001 [NORM] NMSE = 0.000015 [MUL] NMSE = 0.000017 [ADD] NMSE = 0.000013 [MUL_MAT] NMSE = 0.000002 [ADD] NMSE = 0.000002 [GELU] NMSE = 0.000008 [MUL_MAT] NMSE = 0.000041 [ADD] NMSE = 0.000042 [ADD] NMSE = 0.000001 [NORM] NMSE = 0.000016 [MUL] NMSE = 0.000015 [ADD] NMSE = 0.000015 [MUL_MAT] NMSE = 0.000015 [ADD] NMSE = 0.000015 [CPY] NMSE = 0.000015 [MUL_MAT] NMSE = 0.000006 [CPY] NMSE = 0.000006 [MUL_MAT] NMSE = 0.000006 [ADD] NMSE = 0.000005 [CPY] NMSE = 0.000005 [MUL_MAT] NMSE = 0.000006 [SCALE] NMSE = 0.000006 [SOFT_MAX] NMSE = 0.000025 [MUL_MAT] NMSE = 0.000007 [CPY] NMSE = 0.000007 [MUL_MAT] NMSE = 0.000014 [ADD] NMSE = 0.000013 [ADD] NMSE = 0.000001 [NORM] NMSE = 0.000016 [MUL] NMSE = 0.000017 [ADD] NMSE = 0.000014 [MUL_MAT] NMSE = 0.000002 [ADD] NMSE = 0.000002 [GELU] NMSE = 0.000010 [MUL_MAT] NMSE = 0.000042 [ADD] NMSE = 0.000044 [ADD] NMSE = 0.000002 [NORM] NMSE = 0.000018 [MUL] NMSE = 0.000016 [ADD] NMSE = 0.000016 [MUL_MAT] NMSE = 0.000016 [ADD] NMSE = 0.000016 [CPY] NMSE = 0.000016 [MUL_MAT] NMSE = 0.000007 [CPY] NMSE = 0.000007 [MUL_MAT] NMSE = 0.000007 [ADD] NMSE = 0.000007 [CPY] NMSE = 0.000007 [MUL_MAT] NMSE = 0.000008 [SCALE] NMSE = 0.000008 [SOFT_MAX] NMSE = 0.000044 [MUL_MAT] NMSE = 0.000008 [CPY] NMSE = 0.000008 [MUL_MAT] NMSE = 0.000019 [ADD] NMSE = 0.000018 [ADD] NMSE = 0.000002 [NORM] NMSE = 0.000019 [MUL] NMSE = 0.000019 [ADD] NMSE = 0.000016 [MUL_MAT] NMSE = 0.000002 [ADD] NMSE = 0.000002 [GELU] NMSE = 0.000012 [MUL_MAT] NMSE = 0.000049 [ADD] NMSE = 0.000050 [ADD] NMSE = 0.000002 [NORM] NMSE = 0.000023 [MUL] NMSE = 0.000019 [ADD] NMSE = 0.000019 [MUL_MAT] NMSE = 0.000020 [ADD] NMSE = 0.000020 [CPY] NMSE = 0.000020 [MUL_MAT] NMSE = 0.000011 [CPY] NMSE = 0.000011 [MUL_MAT] NMSE = 0.000010 [ADD] NMSE = 0.000009 [CPY] NMSE = 0.000009 [MUL_MAT] NMSE = 0.000014 [SCALE] NMSE = 0.000014 [SOFT_MAX] NMSE = 0.000048 [MUL_MAT] NMSE = 0.000015 [CPY] NMSE = 0.000015 [MUL_MAT] NMSE = 0.000025 [ADD] NMSE = 0.000024 [ADD] NMSE = 0.000002 [NORM] NMSE = 0.000024 [MUL] NMSE = 0.000023 [ADD] NMSE = 0.000019 [MUL_MAT] NMSE = 0.000003 [ADD] NMSE = 0.000003 [GELU] NMSE = 0.000017 [MUL_MAT] NMSE = 0.000059 [ADD] NMSE = 0.000060 [ADD] NMSE = 0.000003 [NORM] NMSE = 0.000038 [MUL] NMSE = 0.000026 [ADD] NMSE = 0.000027 [MUL_MAT] NMSE = 0.000031 [ADD] NMSE = 0.000031 [CPY] NMSE = 0.000031 [MUL_MAT] NMSE = 0.000021 [CPY] NMSE = 0.000021 [MUL_MAT] NMSE = 0.000019 [ADD] NMSE = 0.000017 [CPY] NMSE = 0.000017 [MUL_MAT] NMSE = 0.000037 [SCALE] NMSE = 0.000037 [SOFT_MAX] NMSE = 0.000053 [MUL_MAT] NMSE = 0.000029 [CPY] NMSE = 0.000029 [MUL_MAT] NMSE = 0.000042 [ADD] NMSE = 0.000040 [ADD] NMSE = 0.000004
 [NORM] NMSE = 0.000042 [MUL] NMSE = 0.000034 [ADD] NMSE = 0.000029 [MUL_MAT] NMSE = 0.000004 [ADD] NMSE = 0.000004 [GELU] NMSE = 0.000033 [MUL_MAT] NMSE = 0.000142 [ADD] NMSE = 0.000144 [ADD] NMSE = 0.000007 [NORM] NMSE = 0.000079 [MUL] NMSE = 0.000047 [ADD] NMSE = 0.000047 [MUL_MAT] NMSE = 0.000045 [ADD] NMSE = 0.000045 [CPY] NMSE = 0.000045 [MUL_MAT] NMSE = 0.000042 [CPY] NMSE = 0.000042 [MUL_MAT] NMSE = 0.000037 [ADD] NMSE = 0.000035 [CPY] NMSE = 0.000035 [MUL_MAT] NMSE = 0.000078 [SCALE] NMSE = 0.000078 [SOFT_MAX] NMSE = 0.000097 [MUL_MAT] NMSE = 0.000049 [CPY] NMSE = 0.000049 [MUL_MAT] NMSE = 0.000082 [ADD] NMSE = 0.000080 [ADD] NMSE = 0.000008 [NORM] NMSE = 0.000075 [MUL] NMSE = 0.000056 [ADD] NMSE = 0.000048 [MUL_MAT] NMSE = 0.000006 [ADD] NMSE = 0.000006 [GELU] NMSE = 0.000090 [MUL_MAT] NMSE = 0.000488 [ADD] NMSE = 0.000495 [ADD] NMSE = 0.000025 [NORM] NMSE = 0.000067 [MUL] NMSE = 0.000051 [ADD] NMSE = 0.000052 [MUL_MAT] NMSE = 0.000064 [ADD] NMSE = 0.000064 [CPY] NMSE = 0.000064 [MUL_MAT] NMSE = 0.000039 [CPY] NMSE = 0.000039 [MUL_MAT] NMSE = 0.000034 [ADD] NMSE = 0.000032 [CPY] NMSE = 0.000032 [MUL_MAT] NMSE = 0.000061 [SCALE] NMSE = 0.000061 [SOFT_MAX] NMSE = 0.000104 [MUL_MAT] NMSE = 0.000054 [CPY] NMSE = 0.000054 [MUL_MAT] NMSE = 0.000080 [ADD] NMSE = 0.000077 [ADD] NMSE = 0.000025 [NORM] NMSE = 0.000068 [MUL] NMSE = 0.000060 [ADD] NMSE = 0.000052 [MUL_MAT] NMSE = 0.000007 [ADD] NMSE = 0.000007 [GELU] NMSE = 0.000127 [MUL_MAT] NMSE = 0.000938 [ADD] NMSE = 0.000940 [ADD] NMSE = 0.000063 [NORM] NMSE = 0.000343 [MUL] NMSE = 0.000152 [ADD] NMSE = 0.000154 [MUL_MAT] NMSE = 0.000130 [ADD] NMSE = 0.000130 [CPY] NMSE = 0.000130 [MUL_MAT] NMSE = 0.000169 [CPY] NMSE = 0.000169 [MUL_MAT] NMSE = 0.000119 [ADD] NMSE = 0.000114 [CPY] NMSE = 0.000114 [MUL_MAT] NMSE = 0.000312 [SCALE] NMSE = 0.000312 [SOFT_MAX] NMSE = 0.000324 [MUL_MAT] NMSE = 0.000217 [CPY] NMSE = 0.000217 [MUL_MAT] NMSE = 0.000305 [ADD] NMSE = 0.000293 [ADD] NMSE = 0.000064 [NORM] NMSE = 0.000380 [MUL] NMSE = 0.000357 [ADD] NMSE = 0.000316 [MUL_MAT] NMSE = 0.000055 [ADD] NMSE = 0.000054 [GELU] NMSE = 0.002536 [MUL_MAT] NMSE = 0.006002 [ADD] NMSE = 0.006002
 [ADD] NMSE = 0.002475 [NORM] NMSE = 0.000820 [MUL] NMSE = 0.000437 [ADD] NMSE = 0.000440 [MUL_MAT] NMSE = 0.000277 [ADD] NMSE = 0.000278 [CPY] NMSE = 0.000278 [MUL_MAT] NMSE = 0.000620 [CPY] NMSE = 0.000620 [MUL_MAT] NMSE = 0.000315 [ADD] NMSE = 0.000303 [CPY] NMSE = 0.000303 [MUL_MAT] NMSE = 0.001110 [SCALE] NMSE = 0.001110 [SOFT_MAX] NMSE = 0.000941 [MUL_MAT] NMSE = 0.000439 [CPY] NMSE = 0.000439 [MUL_MAT] NMSE = 0.000577 [ADD] NMSE = 0.000558 [ADD] NMSE = 0.002471 [NORM] NMSE = 0.000828 [MUL] NMSE = 0.001247 [ADD] NMSE = 0.001100 [MUL_MAT] NMSE = 0.000411 [ADD] NMSE = 0.000406 [GELU] NMSE = 0.008901 [MUL_MAT] NMSE = 0.010630 [ADD] NMSE = 0.010632 [ADD] NMSE = 0.004732 [NORM] NMSE = 0.001196 [MUL] NMSE = 0.000786 [ADD] NMSE = 0.000783 [MUL_MAT] NMSE = 0.000519 [ADD] NMSE = 0.000520 [CPY] NMSE = 0.000520 [MUL_MAT] NMSE = 0.001015 [CPY] NMSE = 0.001015 [MUL_MAT] NMSE = 0.000578 [ADD] NMSE = 0.000557 [CPY] NMSE = 0.000557 [MUL_MAT] NMSE = 0.001883 [SCALE] NMSE = 0.001883 [SOFT_MAX] NMSE = 0.001561 [MUL_MAT] NMSE = 0.000627 [CPY] NMSE = 0.000627 [MUL_MAT] NMSE = 0.000657 [ADD] NMSE = 0.000639 [ADD] NMSE = 0.004732 [NORM] NMSE = 0.001201 [MUL] NMSE = 0.001921 [ADD] NMSE = 0.001670 [MUL_MAT] NMSE = 0.000502 [ADD] NMSE = 0.000494 [GELU] NMSE = 0.009599 [MUL_MAT] NMSE = 0.012234 [ADD] NMSE = 0.012243 [ADD] NMSE = 0.006867 [NORM] NMSE = 0.001300 [MUL] NMSE = 0.000892 [ADD] NMSE = 0.000872 FAIL

Large-v2 (whisper_build_graph_cross)

OK

My modification:

I copied a section of code from test-backend-ops and pasted it into the whisper.cpp encoder, then proceeded to compile and run it.

static std::vector<float> tensor_to_float(const ggml_tensor * t) {
    std::vector<float> tv;
    tv.reserve(ggml_nelements(t));

    std::vector<uint8_t> buf(ggml_nbytes(t));
    ggml_backend_tensor_get(t, buf.data(), 0, ggml_nbytes(t));

    ggml_type_traits_t tt = ggml_internal_get_type_traits(t->type);
    size_t bs = ggml_blck_size(t->type);
    std::vector<float> vq(ggml_blck_size(t->type));
    bool quantized = ggml_is_quantized(t->type);

    // access elements by index to avoid gaps in views
    for (int64_t i3 = 0; i3 < t->ne[3]; i3++) {
        for (int64_t i2 = 0; i2 < t->ne[2]; i2++) {
            for (int64_t i1 = 0; i1 < t->ne[1]; i1++) {
                for (int64_t i0 = 0; i0 < t->ne[0]; i0 += bs) {
                    size_t i = i3*t->nb[3] + i2*t->nb[2] + i1*t->nb[1] + i0/bs*t->nb[0];
                    if (t->type == GGML_TYPE_F16) {
                        tv.push_back(ggml_fp16_to_fp32(*(ggml_fp16_t*)&buf[i]));
                    } else if (t->type == GGML_TYPE_F32) {
                        tv.push_back(*(float *) &buf[i]);
                    } else if (t->type == GGML_TYPE_I32) {
                        tv.push_back((float)*(int32_t *) &buf[i]);
                    } else if (quantized) {
                        tt.to_float(&buf[i], vq.data(), bs);
                        tv.insert(tv.end(), vq.begin(), vq.end());
                    } else {
                        GGML_ASSERT(false);
                    }
                }
            }
        }
    }

    return tv;
}

static bool isinf_or_max(float f) {
    return std::isinf(f) || f == FLT_MAX || f == -FLT_MAX;
}

static double nmse(const float * a, const float * b, size_t n) {
    double mse_a_b = 0.0;
    double mse_a_0 = 0.0;

    for (size_t i = 0; i < n; i++) {
        float a_i = a[i];
        float b_i = b[i];

        mse_a_b += (a_i - b_i) * (a_i - b_i);
        mse_a_0 += a_i * a_i;
    }

    return mse_a_b / mse_a_0;
}

static bool whisper_encode_internal(
        whisper_context & wctx,
          whisper_state & wstate,
              const int   mel_offset,
              const int   n_threads,
 whisper_abort_callback   abort_callback,
                   void * abort_callback_data) {
    const int64_t t_start_us = ggml_time_us();

    struct callback_userdata {
        bool   ok;
        double max_err;
    };

    callback_userdata ud {
            true,
            1e-7,
    };

    auto callback = [](int index, ggml_tensor * t1, ggml_tensor * t2, void * user_data) -> bool {
        callback_userdata * ud = (callback_userdata *) user_data;

        if (t1->op == GGML_OP_NONE) {
            // sentinels must be unchanged
            std::vector<uint8_t> t1_data(ggml_nbytes(t1));
            std::vector<uint8_t> t2_data(ggml_nbytes(t2));
            ggml_backend_tensor_get(t1, t1_data.data(), 0, ggml_nbytes(t1));
            ggml_backend_tensor_get(t2, t2_data.data(), 0, ggml_nbytes(t2));

            if (memcmp(t1_data.data(), t2_data.data(), ggml_nbytes(t1)) != 0) {
                printf("sentinel mismatch: %s ", t1->name);
                ud->ok = false;
                return true;
            }
        }

        std::vector<float> f1 = tensor_to_float(t1);
        std::vector<float> f2 = tensor_to_float(t2);

        for (size_t i = 0; i < f1.size(); i++) {
            // check for nans
            if (std::isnan(f1[i]) || std::isnan(f2[i])) {
                printf("[%s] NaN at index %zu (%f %f) ", ggml_op_desc(t1), i, f1[i], f2[i]);
                ud->ok = false;
                return true;
            }
            // check for infs: both must be inf of the same sign, or both must be finite
            if (isinf_or_max(f1[i]) || isinf_or_max(f2[i])) {
                if (isinf_or_max(f1[i]) && isinf_or_max(f2[i])) {
                    if (std::signbit(f1[i]) != std::signbit(f2[i])) {
                        printf("[%s] inf sign mismatch: %f %f ", ggml_op_desc(t1), f1[i], f2[i]);
                        ud->ok = false;
                        return true;
                    }
                } else {
                    printf("[%s] inf mismatch: %f %f ", ggml_op_desc(t1), f1[i], f2[i]);
                    ud->ok = false;
                    return true;
                }
            }
        }

        double err = nmse(f1.data(), f2.data(), f1.size());
        if (err > ud->max_err) {
            printf("[%s] NMSE = %f ", ggml_op_desc(t1), err);
            //for (int i = 0; i < f1.size(); i++) {
            //    printf("%5d %9.6f %9.6f, diff = %9.6f\n", i, f1[i], f2[i], f1[i] - f2[i]);
            //}
            //printf("\n");
            //exit(1);
            ud->ok = false;
        }
        return true;

        GGML_UNUSED(index);
    };

    ggml_backend_t backend_cpu = ggml_backend_cpu_init();

    // conv
    {
        auto & alloc = wstate.alloc_conv.alloc;

        ggml_allocr_reset(alloc);

        ggml_cgraph * gf = whisper_build_graph_conv(wctx, wstate, mel_offset);

        ggml_allocr_alloc_graph(alloc, gf);

        ud = {true, 1e-7};

        ggml_backend_compare_graph_backend(wstate.backend, backend_cpu, gf, callback, &ud);

        if (ud.ok) {
            printf("\033[1;32mOK\033[0m\n");
        } else {
            printf("\033[1;31mFAIL\033[0m\n");
        }

//        if (!whisper_encode_external(wstate)) {
//            ggml_graph_compute_helper(wstate.backend, gf, n_threads);
//        }
    }

    // encoder
    if (!whisper_encode_external(wstate)) {
        auto & alloc = wstate.alloc_encode.alloc;

        ggml_allocr_reset(alloc);

        ggml_cgraph * gf = whisper_build_graph_encoder(wctx, wstate);

        ggml_allocr_alloc_graph(alloc, gf);

        ud = {true, 1e-7};

        ggml_backend_compare_graph_backend(wstate.backend, backend_cpu, gf, callback, &ud);

        if (ud.ok) {
            printf("\033[1;32mOK\033[0m\n");
        } else {
            printf("\033[1;31mFAIL\033[0m\n");
        }

//        ggml_graph_compute_helper(wstate.backend, gf, n_threads);
    }

    // cross
    {
        auto & alloc = wstate.alloc_cross.alloc;

        ggml_allocr_reset(alloc);

        ggml_cgraph * gf = whisper_build_graph_cross(wctx, wstate);

        ggml_allocr_alloc_graph(alloc, gf);

        ud = {true, 1e-7};

        ggml_backend_compare_graph_backend(wstate.backend, backend_cpu, gf, callback, &ud);

        if (ud.ok) {
            printf("\033[1;32mOK\033[0m\n");
        } else {
            printf("\033[1;31mFAIL\033[0m\n");
        }

        return 0;

//        ggml_graph_compute_helper(wstate.backend, gf, n_threads);
    }

    wstate.t_encode_us += ggml_time_us() - t_start_us;
    wstate.n_encode++;

    return !(abort_callback && abort_callback(abort_callback_data));
}

@bobqianic bobqianic changed the title Encoder Convolution is broken when CUBLAS is ON Encoder is broken when CUBLAS is ON Dec 26, 2023
@ggerganov
Copy link
Owner

Which NVIDIA card are you using - this seems like an issue that occurs only on very old hardware (CC <= 6). It's hard to fix because I don't have means to reproduce it

@bobqianic
Copy link
Collaborator Author

this seems like an issue that occurs only on very old hardware

I’m currently using an RTX 3060, which is still fairly recent.

@bobqianic
Copy link
Collaborator Author

By the way, do you have any idea on how to get my test code running properly? As far as I know, ggml_op_CONT is just meant to copy data. In my test, it’s the first operation in the second graph and shows an NMSE of 1, indicating that one of the backends received a zero tensor, which isn’t right. I’ve tried testing with the -ng flag for consistency, as it compares CPU backend against CPU backend. Interestingly, I get consistent results in the first graph, but not in the second or third.

@slaren
Copy link
Collaborator

slaren commented Dec 26, 2023

The test code looks good. It's not impossible that the issue is in the ggml_cont, but to verify that you can add a test case to test-backend-ops with the exact same parameters (copy the tensor types, dimensions and strides in nb).

@bobqianic
Copy link
Collaborator Author

The test code looks good.

Ah, now I understand why I'm encountering issues when comparing the second and third graphs.

whisper_build_graph_encoder

struct ggml_tensor * cur = ggml_view_tensor(ctx0, wstate.embd_conv);

If I remove this code and replace it with

struct ggml_tensor * cur = wstate.embd_conv;

whisper_build_graph_cross

struct ggml_tensor * cur = ggml_view_tensor(ctx0, wstate.embd_enc);

If I remove this code and replace it with

struct ggml_tensor * cur = wstate.embd_enc;

It functions as anticipated, especially when I employ ggml_backend_compare_graph_backend. When I utilize the -ng (no GPU) flag, it yields identical results.

@slaren
Copy link
Collaborator

slaren commented Dec 27, 2023

Ah I see, I didn't understand the issue. I think that views of externals tensors were not being copied properly to the other backend. This should fix it:

diff --git a/ggml-backend.c b/ggml-backend.c
index 526ce732..e9cfffbe 100644
--- a/ggml-backend.c
+++ b/ggml-backend.c
@@ -1312,6 +1312,7 @@ static void graph_init_tensor(struct ggml_hash_set hash_set, struct ggml_tensor

     struct ggml_tensor * dst = node_copies[id];
     if (dst->view_src != NULL) {
+        graph_init_tensor(hash_set, node_copies, node_init, src->view_src);
         ggml_backend_view_init(dst->view_src->buffer, dst);
     }
     else {

@bobqianic
Copy link
Collaborator Author

bobqianic commented Dec 27, 2023

This output reflects the behavior of a specific audio sample, and it's important to note that various audio samples can exhibit different behaviors. For instance, while some audio files might yield distinct results when using medium model, others could show varied outcomes when using small model.

My hypothesis is that there must have been a relatively large error occurring somewhere, and these errors keep accumulating as the computation goes on, leading to problems in the results. This also explains why, with different audio files, only specific models encounter issues. It's because different models have different weights and embeddings, so the errors might cancel each other out during the computation process.

@slaren
Copy link
Collaborator

slaren commented Dec 27, 2023

It might be because matrix multiplications are performed in FP16. You can force FP32 by using ggml_mul_mat_set_prec. This should work to modify the entire graph:

for (int i = 0; i < gf->n_nodes; i++) {
    if (gf->nodes[i]->op == GGML_OP_MUL_MAT) ggml_mul_mat_set_prec(gf->nodes[i], GGML_PREC_F32);
}

@bobqianic
Copy link
Collaborator Author

bobqianic commented Dec 27, 2023

Have I made a mistake here? This seems to have worsened the NMSE.

Before:

[MUL_MAT] NMSE = 0.000001 [ADD] NMSE = 0.000001 [GELU] NMSE = 0.000001 [IM2COL] NMSE = 0.000001 [MUL_MAT] NMSE = 0.000007 [ADD] NMSE = 0.000006 [GELU] NMSE = 0.000010 FAIL

After:

[MUL_MAT] NMSE = 5.193738 [ADD] NMSE = 3.548122 [GELU] NMSE = 5.239518 [IM2COL] NMSE = 5.234486 [MUL_MAT] NMSE = 0.935680 [ADD] NMSE = 0.933797 [GELU] NMSE = 0.925861 FAIL
    // conv
    {
        auto & alloc = wstate.alloc_conv.alloc;

        ggml_allocr_reset(alloc);

        ggml_cgraph * gf = whisper_build_graph_conv(wctx, wstate, mel_offset);

        for (int i = 0; i < gf->n_nodes; i++) {
            if (gf->nodes[i]->op == GGML_OP_MUL_MAT) ggml_mul_mat_set_prec(gf->nodes[i], GGML_PREC_F32);
        }

        ggml_allocr_alloc_graph(alloc, gf);

        ud = {true, 1e-7};

        ggml_backend_compare_graph_backend(wstate.backend, backend_cpu, gf, callback, &ud);

        if (ud.ok) {
            printf("\033[1;32mOK\033[0m\n\n");
        } else {
            printf("\033[1;31mFAIL\033[0m\n\n");
        }

//        if (!whisper_encode_external(wstate)) {
//            ggml_graph_compute_helper(wstate.backend, gf, n_threads);
//        }
    }

@slaren
Copy link
Collaborator

slaren commented Dec 27, 2023

That's odd, maybe there is a bug with GGML_PREC_F32.

@ggerganov
Copy link
Owner

@bobqianic Maybe it would be better if you open a PR with the changes that you have made and steps to reproduce.

Also, try the latest sync #1691 that will be merged soon and see of the issues still persist there

@bobqianic
Copy link
Collaborator Author

bobqianic commented Dec 27, 2023

That's odd, maybe there is a bug with GGML_PREC_F32.

Even if ggml_mul_mat_set_prec functions properly, it won't change the precision used in GGML_OP_MUL_MAT when using CUDA backend. Given that src0 typically acts as the weight and is in FP16 format, it will automatically perform matrix multiplication using FP16 precision.

whisper.cpp/ggml-cuda.cu

Lines 7396 to 7397 in 37a709f

if (compute_capability >= CC_VOLTA && (src0->type == GGML_TYPE_F16 || ggml_is_quantized(src0->type)) && ggml_is_contiguous(src0) && row_diff == src0->ne[1] && dst->op_params[0] == GGML_PREC_DEFAULT) {
// convert src0 and src1 to fp16, multiply as fp16, convert dst to fp32

@bobqianic
Copy link
Collaborator Author

@bobqianic Maybe it would be better if you open a PR with the changes that you have made and steps to reproduce.

Also, try the latest sync #1691 that will be merged soon and see of the issues still persist there

OK

@slaren
Copy link
Collaborator

slaren commented Dec 27, 2023

Given that src0 typically acts as the weight and is in FP16 format, it will automatically perform matrix multiplication using FP16 precision.

There is a check for dst->op_params[0] == GGML_PREC_DEFAULT that should prevent that.

@bobqianic
Copy link
Collaborator Author

bobqianic commented Dec 27, 2023

Given that src0 typically acts as the weight and is in FP16 format, it will automatically perform matrix multiplication using FP16 precision.

There is a check for dst->op_params[0] == GGML_PREC_DEFAULT that should prevent that.

src0->type == GGML_TYPE_F16 ||

If src0->type == GGML_TYPE_F16 evaluates to true, short-circuit evaluation comes into play. Therefore, dst->op_params[0] == GGML_PREC_DEFAULT will never be evaluated at all. So, it won't interfere with the evaluation.

@slaren
Copy link
Collaborator

slaren commented Dec 27, 2023

All of these need to be true to use FP16 matrix multiplication:

  • compute_capability >= CC_VOLTA
  • (src0->type == GGML_TYPE_F16 || ggml_is_quantized(src0->type))
  • ggml_is_contiguous(src0)
  • row_diff == src0->ne[1]
  • dst->op_params[0] == GGML_PREC_DEFAULT

Note that the || is inside a parenthesis.

@bobqianic
Copy link
Collaborator Author

By the way, do you have any idea why using the -ng (no GPU) flag still results in some GPU activity? I've been monitoring it with GPU-Z and noticed a few activities. I'm planning to use Nsight to identify which kernel is being launched.

It's not just me. Several other users have noticed the same thing, which is quite odd. I've been able to replicate this issue on my machine as well.

#1587 (comment)

#1587 (comment)

What?
image

As far as I know, the -ng flag is supposed to create a CPU backend, which theoretically means there should be no GPU activity.

whisper.cpp/whisper.cpp

Lines 1056 to 1089 in 37a709f

static ggml_backend_t whisper_backend_init(const whisper_context_params & params) {
ggml_backend_t backend_gpu = NULL;
// initialize the backends
#ifdef GGML_USE_CUBLAS
if (params.use_gpu && ggml_cublas_loaded()) {
WHISPER_LOG_INFO("%s: using CUDA backend\n", __func__);
backend_gpu = ggml_backend_cuda_init(0);
if (!backend_gpu) {
WHISPER_LOG_ERROR("%s: ggml_backend_cuda_init() failed\n", __func__);
}
}
#endif
#ifdef GGML_USE_METAL
if (params.use_gpu) {
WHISPER_LOG_INFO("%s: using Metal backend\n", __func__);
ggml_metal_log_set_callback(whisper_log_callback_default, nullptr);
backend_gpu = ggml_backend_metal_init();
if (!backend_gpu) {
WHISPER_LOG_ERROR("%s: ggml_backend_metal_init() failed\n", __func__);
} else if (!ggml_backend_metal_supports_family(backend_gpu, 7)) {
WHISPER_LOG_ERROR("%s: Metal GPU does not support family 7 - falling back to CPU\n", __func__);
ggml_backend_free(backend_gpu);
backend_gpu = NULL;
}
}
#endif
if (backend_gpu) {
return backend_gpu;
}
return ggml_backend_cpu_init();
}

image

@slaren
Copy link
Collaborator

slaren commented Dec 27, 2023

The CUDA backend is always used automatically with large matrix multiplications. At the moment, the only way to disable it completely is to build without CUDA.

whisper.cpp/ggml-cuda.cu

Lines 8243 to 8246 in 37a709f

return (src0->type == GGML_TYPE_F32 || src0->type == GGML_TYPE_F16 || ggml_is_quantized(src0->type)) &&
src1->type == GGML_TYPE_F32 &&
dst->type == GGML_TYPE_F32 &&
(ne0 >= 32 && ne1 >= 32 && ne10 >= 32);

@ggerganov
Copy link
Owner

At the moment, the only way to disable it completely is to build without CUDA.

There is also the option to run with CUDA_VISIBLE_DEVICES=-1:

whisper.cpp/ggml-cuda.cu

Lines 6643 to 6647 in 37a709f

if (cudaGetDeviceCount(&g_device_count) != cudaSuccess) {
initialized = true;
g_cublas_loaded = false;
return;
}

@bobqianic bobqianic linked a pull request Jan 6, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed high priority Very important issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants