-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(fluids): Update tests for fixed ts_type alpha #1681
Conversation
The startup logic for `ts_type: alpha` was fixed in petsc/petsc!7816, so the reference solutions needed to be fixed as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much appreciated!
hmm, CUDA is unhappy on Noether |
I have no idea why CUDA is failing, but everybody else is happy... It's definitely related to ts_type alpha, because all the failed tests are with ts_type alpha. But those tests didn't need to have their values updated to pass on CPU (and ROCM evidently). |
I'm doing a pull and rebuild just in case, but I don't think that will change things Edit: No dice, same issue |
I'm going back to the original alpha fix commit to see if it's maybe something unrelated in PETSc's CUDA implementation. If that's not the case, I'm really confused. |
Yeah, this seems strange unless its exposing some bad assumption somewhere in one of the CUDA impl files 🤷 |
Yep, going back to 3f1b2ee8 works fine. Time for |
@jrwrigh Is the commit wrong? I can't see an error. |
I ran a As for what is wrong about the commit, I don't know; I'm not familiar with the code and how it interfaces with the fluids example or HONEE. |
I sorta suspect that the commit in question exposes some oops elsewhere for CUDA vectors, as the logic there looks correct. |
Jed had a theory of maybe a kernel synchronicity issue? We are only seeing an issue when writing out to binary, right? |
Not just when writing/reading to binary. Every fluids test reads binary files, but only some of the tests that use ts_type: alpha fail. |
The fact it's only alpha failing makes me think it's something to do with the alpha restart, which does several |
And those are somehow async copies involving offloading to the host, which isn't great |
Are we setting the DM vec type on every DM? |
Yes, indirectly via i.e. we do it once for the primary DM, then |
And is the TSAlpha grabbing all work vectors from the DM? |
It's using |
Clarification: Not every test using |
|
Ok, so in gdb I did a backtrace for every call to |
Here are those backtraces, in case that's of interest. |
It doesn't help, but that block is more complicated than it needs to be $ git diff
diff --git a/src/vec/vec/impls/seq/cupm/vecseqcupm_impl.hpp b/src/vec/vec/impls/seq/cupm/vecseqcupm_impl.hpp
index 719ad054f24..682554196fe 100644
--- a/src/vec/vec/impls/seq/cupm/vecseqcupm_impl.hpp
+++ b/src/vec/vec/impls/seq/cupm/vecseqcupm_impl.hpp
@@ -1394,26 +1394,15 @@ inline PetscErrorCode VecSeq_CUPM<T>::CopyAsync(Vec xin, Vec yout, PetscDeviceCo
// translate from PetscOffloadMask to cupmMemcpyKind
PetscCall(PetscDeviceContextGetOptionalNullContext_Internal(&dctx));
switch (const auto ymask = yout->offloadmask) {
- case PETSC_OFFLOAD_UNALLOCATED: {
- PetscBool yiscupm;
-
- PetscCall(PetscObjectTypeCompareAny(PetscObjectCast(yout), &yiscupm, VECSEQCUPM(), VECMPICUPM(), ""));
- if (yiscupm) {
- mode = PetscOffloadDevice(xmask) ? cupmMemcpyDeviceToDevice : cupmMemcpyHostToHost;
- break;
- }
- } // fall-through if unallocated and not cupm
-#if PETSC_CPP_VERSION >= 17
- [[fallthrough]];
-#endif
+ case PETSC_OFFLOAD_UNALLOCATED:
case PETSC_OFFLOAD_CPU: {
PetscBool yiscupm;
PetscCall(PetscObjectTypeCompareAny(PetscObjectCast(yout), &yiscupm, VECSEQCUPM(), VECMPICUPM(), ""));
if (yiscupm) {
- mode = PetscOffloadHost(xmask) ? cupmMemcpyHostToDevice : cupmMemcpyDeviceToDevice;
+ mode = PetscOffloadDevice(xmask) ? cupmMemcpyDeviceToDevice : cupmMemcpyHostToDevice;
} else {
- mode = PetscOffloadHost(xmask) ? cupmMemcpyHostToHost : cupmMemcpyDeviceToHost;
+ mode = PetscOffloadDevice(xmask) ? cupmMemcpyDeviceToHost : cupmMemcpyHostToHost;
}
break;
} |
With this diff (we should maybe apply here $ git diff
diff --git a/examples/fluids/src/misc.c b/examples/fluids/src/misc.c
index 4510be1f..ef754c77 100644
--- a/examples/fluids/src/misc.c
+++ b/examples/fluids/src/misc.c
@@ -142,7 +142,7 @@ PetscErrorCode LoadFluidsBinaryVec(MPI_Comm comm, PetscViewer viewer, Vec Q, Pet
PetscErrorCode RegressionTest(AppCtx app_ctx, Vec Q) {
Vec Qref;
PetscViewer viewer;
- PetscReal error, Qrefnorm;
+ PetscReal error, Q_norm, Q_ref_norm, Q_err_norm;
MPI_Comm comm = PetscObjectComm((PetscObject)Q);
PetscFunctionBeginUser;
@@ -153,14 +153,17 @@ PetscErrorCode RegressionTest(AppCtx app_ctx, Vec Q) {
PetscCall(LoadFluidsBinaryVec(comm, viewer, Qref, NULL, NULL));
// Compute error with respect to reference solution
+ PetscCall(VecNorm(Q, NORM_MAX, &Q_norm));
+ PetscCall(VecNorm(Qref, NORM_MAX, &Q_ref_norm));
PetscCall(VecAXPY(Q, -1.0, Qref));
- PetscCall(VecNorm(Qref, NORM_MAX, &Qrefnorm));
- PetscCall(VecScale(Q, 1. / Qrefnorm));
+ PetscCall(VecNorm(Qref, NORM_MAX, &Q_err_norm));
+ PetscCall(VecScale(Q, 1. / Q_err_norm));
PetscCall(VecNorm(Q, NORM_MAX, &error));
// Check error
if (error > app_ctx->test_tol) {
- PetscCall(PetscPrintf(PETSC_COMM_WORLD, "Test failed with error norm %g\n", (double)error));
+ PetscCall(PetscPrintf(PETSC_COMM_WORLD, "Test failed with error norm %g\nReference solution norm %g; Computed solution norm %g\n", (double)error,
+ (double)Q_ref_norm, (double)Q_norm));
}
// Cleanup I get this, which looks fishy
|
2c6bb1e
to
8535f0a
Compare
The error appears to be strictly in the 5th component? |
That's expected. The advection-diffusion problems are setup so the 5th component ("energy") is the transported scalar. The other 4 are always constant. |
We should add an issue to undo our reversion of the offending commit when the bug is patched, but we can merge this now |
Not sure what the GitLab CI is saying the pipeline is still running, but all the jobs passed. |
Bad script, no cookies for it |
The startup logic for
ts_type: alpha
was fixed in petsc/petsc!7816, so the reference solutions needed to be fixed as well.