-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encoder is broken when CUBLAS is ON #1688
Comments
Regarding the encoder's final output. I compared CUDA output with CPU output. This output reflects the behavior of a specific audio sample, and it's important to note that various audio samples can exhibit different behaviors. For instance, while some audio files might yield distinct results when using ggml_tensor * tensor = wctx->state->embd_enc;
std::vector<float> tensor_data(ggml_nelements(tensor));
ggml_backend_tensor_get(tensor, tensor_data.data(), 0, ggml_nbytes(tensor));
std::ofstream outFile("encoder_embedding.json");
outFile << "[";
for (uint64_t i = 0; i < tensor_data.size() - 1; i++) {
outFile << tensor_data[i] << ", ";
}
outFile << tensor_data[tensor_data.size() - 1] << "]";
outFile.close();
return 0; Tiny:Base:Small:Medium:Large: |
Do you know how we could fix this? @slaren |
We would need to find the op that is producing wrong results in CUDA. The easiest way to do this is by using |
Seems like we're dealing with a nightmare here. I'll do my best to pinpoint the operation at the heart of the issue : )
CPU_backend vs CUDA_backend
My modification:I copied a section of code from static std::vector<float> tensor_to_float(const ggml_tensor * t) {
std::vector<float> tv;
tv.reserve(ggml_nelements(t));
std::vector<uint8_t> buf(ggml_nbytes(t));
ggml_backend_tensor_get(t, buf.data(), 0, ggml_nbytes(t));
ggml_type_traits_t tt = ggml_internal_get_type_traits(t->type);
size_t bs = ggml_blck_size(t->type);
std::vector<float> vq(ggml_blck_size(t->type));
bool quantized = ggml_is_quantized(t->type);
// access elements by index to avoid gaps in views
for (int64_t i3 = 0; i3 < t->ne[3]; i3++) {
for (int64_t i2 = 0; i2 < t->ne[2]; i2++) {
for (int64_t i1 = 0; i1 < t->ne[1]; i1++) {
for (int64_t i0 = 0; i0 < t->ne[0]; i0 += bs) {
size_t i = i3*t->nb[3] + i2*t->nb[2] + i1*t->nb[1] + i0/bs*t->nb[0];
if (t->type == GGML_TYPE_F16) {
tv.push_back(ggml_fp16_to_fp32(*(ggml_fp16_t*)&buf[i]));
} else if (t->type == GGML_TYPE_F32) {
tv.push_back(*(float *) &buf[i]);
} else if (t->type == GGML_TYPE_I32) {
tv.push_back((float)*(int32_t *) &buf[i]);
} else if (quantized) {
tt.to_float(&buf[i], vq.data(), bs);
tv.insert(tv.end(), vq.begin(), vq.end());
} else {
GGML_ASSERT(false);
}
}
}
}
}
return tv;
}
static bool isinf_or_max(float f) {
return std::isinf(f) || f == FLT_MAX || f == -FLT_MAX;
}
static double nmse(const float * a, const float * b, size_t n) {
double mse_a_b = 0.0;
double mse_a_0 = 0.0;
for (size_t i = 0; i < n; i++) {
float a_i = a[i];
float b_i = b[i];
mse_a_b += (a_i - b_i) * (a_i - b_i);
mse_a_0 += a_i * a_i;
}
return mse_a_b / mse_a_0;
}
static bool whisper_encode_internal(
whisper_context & wctx,
whisper_state & wstate,
const int mel_offset,
const int n_threads,
whisper_abort_callback abort_callback,
void * abort_callback_data) {
const int64_t t_start_us = ggml_time_us();
struct callback_userdata {
bool ok;
double max_err;
};
callback_userdata ud {
true,
1e-7,
};
auto callback = [](int index, ggml_tensor * t1, ggml_tensor * t2, void * user_data) -> bool {
callback_userdata * ud = (callback_userdata *) user_data;
if (t1->op == GGML_OP_NONE) {
// sentinels must be unchanged
std::vector<uint8_t> t1_data(ggml_nbytes(t1));
std::vector<uint8_t> t2_data(ggml_nbytes(t2));
ggml_backend_tensor_get(t1, t1_data.data(), 0, ggml_nbytes(t1));
ggml_backend_tensor_get(t2, t2_data.data(), 0, ggml_nbytes(t2));
if (memcmp(t1_data.data(), t2_data.data(), ggml_nbytes(t1)) != 0) {
printf("sentinel mismatch: %s ", t1->name);
ud->ok = false;
return true;
}
}
std::vector<float> f1 = tensor_to_float(t1);
std::vector<float> f2 = tensor_to_float(t2);
for (size_t i = 0; i < f1.size(); i++) {
// check for nans
if (std::isnan(f1[i]) || std::isnan(f2[i])) {
printf("[%s] NaN at index %zu (%f %f) ", ggml_op_desc(t1), i, f1[i], f2[i]);
ud->ok = false;
return true;
}
// check for infs: both must be inf of the same sign, or both must be finite
if (isinf_or_max(f1[i]) || isinf_or_max(f2[i])) {
if (isinf_or_max(f1[i]) && isinf_or_max(f2[i])) {
if (std::signbit(f1[i]) != std::signbit(f2[i])) {
printf("[%s] inf sign mismatch: %f %f ", ggml_op_desc(t1), f1[i], f2[i]);
ud->ok = false;
return true;
}
} else {
printf("[%s] inf mismatch: %f %f ", ggml_op_desc(t1), f1[i], f2[i]);
ud->ok = false;
return true;
}
}
}
double err = nmse(f1.data(), f2.data(), f1.size());
if (err > ud->max_err) {
printf("[%s] NMSE = %f ", ggml_op_desc(t1), err);
//for (int i = 0; i < f1.size(); i++) {
// printf("%5d %9.6f %9.6f, diff = %9.6f\n", i, f1[i], f2[i], f1[i] - f2[i]);
//}
//printf("\n");
//exit(1);
ud->ok = false;
}
return true;
GGML_UNUSED(index);
};
ggml_backend_t backend_cpu = ggml_backend_cpu_init();
// conv
{
auto & alloc = wstate.alloc_conv.alloc;
ggml_allocr_reset(alloc);
ggml_cgraph * gf = whisper_build_graph_conv(wctx, wstate, mel_offset);
ggml_allocr_alloc_graph(alloc, gf);
ud = {true, 1e-7};
ggml_backend_compare_graph_backend(wstate.backend, backend_cpu, gf, callback, &ud);
if (ud.ok) {
printf("\033[1;32mOK\033[0m\n");
} else {
printf("\033[1;31mFAIL\033[0m\n");
}
// if (!whisper_encode_external(wstate)) {
// ggml_graph_compute_helper(wstate.backend, gf, n_threads);
// }
}
// encoder
if (!whisper_encode_external(wstate)) {
auto & alloc = wstate.alloc_encode.alloc;
ggml_allocr_reset(alloc);
ggml_cgraph * gf = whisper_build_graph_encoder(wctx, wstate);
ggml_allocr_alloc_graph(alloc, gf);
ud = {true, 1e-7};
ggml_backend_compare_graph_backend(wstate.backend, backend_cpu, gf, callback, &ud);
if (ud.ok) {
printf("\033[1;32mOK\033[0m\n");
} else {
printf("\033[1;31mFAIL\033[0m\n");
}
// ggml_graph_compute_helper(wstate.backend, gf, n_threads);
}
// cross
{
auto & alloc = wstate.alloc_cross.alloc;
ggml_allocr_reset(alloc);
ggml_cgraph * gf = whisper_build_graph_cross(wctx, wstate);
ggml_allocr_alloc_graph(alloc, gf);
ud = {true, 1e-7};
ggml_backend_compare_graph_backend(wstate.backend, backend_cpu, gf, callback, &ud);
if (ud.ok) {
printf("\033[1;32mOK\033[0m\n");
} else {
printf("\033[1;31mFAIL\033[0m\n");
}
return 0;
// ggml_graph_compute_helper(wstate.backend, gf, n_threads);
}
wstate.t_encode_us += ggml_time_us() - t_start_us;
wstate.n_encode++;
return !(abort_callback && abort_callback(abort_callback_data));
} |
Which NVIDIA card are you using - this seems like an issue that occurs only on very old hardware (CC <= 6). It's hard to fix because I don't have means to reproduce it |
I’m currently using an RTX 3060, which is still fairly recent. |
By the way, do you have any idea on how to get my test code running properly? As far as I know, |
The test code looks good. It's not impossible that the issue is in the |
Ah, now I understand why I'm encountering issues when comparing the second and third graphs.
Line 1785 in 37a709f
If I remove this code and replace it with struct ggml_tensor * cur = wstate.embd_conv;
Line 2034 in 37a709f
If I remove this code and replace it with struct ggml_tensor * cur = wstate.embd_enc; It functions as anticipated, especially when I employ |
Ah I see, I didn't understand the issue. I think that views of externals tensors were not being copied properly to the other backend. This should fix it: diff --git a/ggml-backend.c b/ggml-backend.c
index 526ce732..e9cfffbe 100644
--- a/ggml-backend.c
+++ b/ggml-backend.c
@@ -1312,6 +1312,7 @@ static void graph_init_tensor(struct ggml_hash_set hash_set, struct ggml_tensor
struct ggml_tensor * dst = node_copies[id];
if (dst->view_src != NULL) {
+ graph_init_tensor(hash_set, node_copies, node_init, src->view_src);
ggml_backend_view_init(dst->view_src->buffer, dst);
}
else { |
My hypothesis is that there must have been a relatively large error occurring somewhere, and these errors keep accumulating as the computation goes on, leading to problems in the results. This also explains why, with different audio files, only specific models encounter issues. It's because different models have different weights and embeddings, so the errors might cancel each other out during the computation process. |
It might be because matrix multiplications are performed in FP16. You can force FP32 by using for (int i = 0; i < gf->n_nodes; i++) {
if (gf->nodes[i]->op == GGML_OP_MUL_MAT) ggml_mul_mat_set_prec(gf->nodes[i], GGML_PREC_F32);
} |
Have I made a mistake here? This seems to have worsened the NMSE. Before:
After:
// conv
{
auto & alloc = wstate.alloc_conv.alloc;
ggml_allocr_reset(alloc);
ggml_cgraph * gf = whisper_build_graph_conv(wctx, wstate, mel_offset);
for (int i = 0; i < gf->n_nodes; i++) {
if (gf->nodes[i]->op == GGML_OP_MUL_MAT) ggml_mul_mat_set_prec(gf->nodes[i], GGML_PREC_F32);
}
ggml_allocr_alloc_graph(alloc, gf);
ud = {true, 1e-7};
ggml_backend_compare_graph_backend(wstate.backend, backend_cpu, gf, callback, &ud);
if (ud.ok) {
printf("\033[1;32mOK\033[0m\n\n");
} else {
printf("\033[1;31mFAIL\033[0m\n\n");
}
// if (!whisper_encode_external(wstate)) {
// ggml_graph_compute_helper(wstate.backend, gf, n_threads);
// }
} |
That's odd, maybe there is a bug with |
@bobqianic Maybe it would be better if you open a PR with the changes that you have made and steps to reproduce. Also, try the latest sync #1691 that will be merged soon and see of the issues still persist there |
Lines 7396 to 7397 in 37a709f
|
OK |
There is a check for |
|
All of these need to be true to use FP16 matrix multiplication:
Note that the |
By the way, do you have any idea why using the It's not just me. Several other users have noticed the same thing, which is quite odd. I've been able to replicate this issue on my machine as well. As far as I know, the Lines 1056 to 1089 in 37a709f
|
The CUDA backend is always used automatically with large matrix multiplications. At the moment, the only way to disable it completely is to build without CUDA. Lines 8243 to 8246 in 37a709f
|
There is also the option to run with Lines 6643 to 6647 in 37a709f
|
This occurs when using the
tiny
,small
,base
,medium
, andlarge
models.All models used are not quantized.
CUDA:
CPU:
encoder_embedding_conv.zip
The text was updated successfully, but these errors were encountered: