-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MOM6 config working with gnu, but not intel compile #1461
Comments
I've run your experiment (provided separately) with Intel and GNU, debug and repro, but all appear to be running without any problems. Could this possibly have already been fixed? What is the hash of your MOM6? I am using 7883f63 (from 6 August). |
I was finally able to replicate something here, though not exactly what you are seeing. These are from Intel 18 "debug":
I am using the GFDL branches of the packages, plus whatever flags are in the mkmf templates, so it's probably slightly different from what you are running. It took quite a few timesteps for a problem to emerge. (looks like about ~48 timesteps here). The error here is coming from Icepack, so perhaps this is where the problem is arising. Seemed to happen at the 4-hour mark, although I don't see anything signficant about that timestep. (Coupling steps are 1 hr). Will keep looking. Edit: Not exactly sure why, but I now get your error in regrid_edge_values.F90. Could have been an unrelated initialization issue in Icepack which I can no longer replicate. |
Somehow this line has resulted in a div-by-zero: The zero is from I don't know how it happened, but somehow
While I could just add an additional check for |
I see that the line that is crashing is using the 2018_answers version of the code. There is a newer option within the code (generally set by selecting Basically, instead of naively doing Gaussian elimination for matrix solvers or just the standard Thomas algorithm for solving tridiagonal matrices, the newer versions expand the expressions with mathematical expressions to replace differences of numbers that might be very close together other positive definite expressions, effectively replacing This raises several questions for me: |
@Hallberg-NOAA I don't think this is happening with |
The error appears to be coming from the ALE sponge, not the ALE code. The issue seems to be that Although the In this case, GCC is probably just re-using the old value, which is why it still works, whereas Intel is allocating a new part of the heap, with random model-crashing values. In any case, we have to assume that the array is uninitialized, and we either need to re-initialize these values, or we need avoid re-allocating the arrays inside of the function. Having said all that, even if I re-initialize this array, I see the icepack error later in the run (at ~hour 12) so we are not yet through this problem. |
Following up on this issue:
I have looked further into this error, and it seems that the SIS2 field I do not believe that this is due to icepack, but is rather due to SIS2. I believe something is happening in this code block: I am not very familiar with SIS2, but I will keep looking and see if anything makes sense here. There is a lot of pointer manipulation going on, so it's possible that something is not being set properly. |
I've looked at this more closely. This seems to be a bug in SIS2. The
If I just add this:
then the model appears to run (so far at least; up to day 2) although I am not sure if this is the preferred solution. We should talk more on Monday. |
This commit appears to have fixed the issue: NOAA-GFDL/SIS2@fa3f6b1 (This was in PR NOAA-GFDL/SIS2#179) |
I feel comfortable that all of the issues here have been resolved, so I will close the issue. |
I am running a regional MOM6 configuration that includes sponges (u, v, tracer) and ice. Currently, the simulation fails if the code is compiled with intel but runs (very slowly!) if compiled with gnu.
Backtrace for intel compile, REPRO mode:
Backtrace for intel compile, DEBUG mode:
or
The text was updated successfully, but these errors were encountered: