Bug fixed 1d evp #568

TillRasmussen · 2021-03-02T20:23:54Z

For detailed information about submitting Pull Requests (PRs) to the CICE-Consortium,
please refer to: https://github.com/CICE-Consortium/About-Us/wiki/Resource-Index#information-for-developers

PR checklist

Short (1 sentence) summary of your PR:
Removes all bad departure point bugs
Copied from @ssrethmeier
The issue with departure errors, resulting from default initialization of full land blocks in the gathering part of the EVP 1D implementation. To solve this, and to avoid gathering and scattering constant arrays, the HTE and HTN arrays (including ghost cells) are now, for the EVP 1D implementation, allocated and given their values during initialization
The issue with tests not being bit-for-bit comparable when compiled with debug/-O0 optimization, resulting from the order of computations for computing tinyarea not being identical between the EVP 2D implementation and the EVP 1D implementation. This was solved by making the EVP 1D computation order identical to the EVP 2D computation order
The issue with tests not being bit-for-bit comparable when compiled with debug/-O0 optimization, resulting from icetmask not being a subset of iceumask, as initialily assumed in the EVP 1D implementation. This was solved by adding a skiptcell logical array as well for skipping stress calculations in the EVP 1D implementation

Then we did some modifications to the OpenMP parts of the EVP 1D implementation. The core part of the implementation (stress and stepu) was implemented to be NUMA-aware for performance. This was done differently for the interfacing part of the implementation, but is now aligned throughout the EVP 1D implementation for all of it to be NUMA-aware.

As for namelist changes, the only change that has been made, is the renaming of kevp_kernel to evp_algorithm and changing it from integer to string. EVP 2D is now enabled by setting evp_algorithm = 'standard_2d' instead of kevp_kernel = 0 and EVP 1D by setting evp_algorithm = 'shared_mem_1d' instead of kevp_kernel = 102. In connection with this, the option set_nml.evp1d was also added. Documentation has also been updated to reflect this modification.

Finally, we did a full cleanup of ice_dyn_evp_1d.F90 and some cleanup of ice_dyn_evp.F90. This has mainly included:

Fixing indentations
Aligning variable names with the rest of the code base
Removing old, uncommented code blocks, that was use in development, but not needed any longer

This implementation should work for revised evp and "traditional evp". The first should be tested.

…r task to avoid gathering constant arrays when running EVP 1D implementation. This solves issue with departure errors resulting from erroneous initialization of full land blocks in EVP 1D gathering. Correct initialization of remaining variables for full land blocks in EVP 1D gathering.

…ge of file. Machine 'freya' was double in file.

Evp 1d

apcraig · 2021-03-02T22:38:28Z

I can run a test suite on this before we merge with the 1d evp turned on. Let me know when you think it's ready. It seems there are at least a couple things still to do,

decide if we are happy with the approach
clean up
update documentation
add test case

TillRasmussen · 2021-03-02T22:48:30Z

I updated with the output from the full test suite. It shows (as expected that binary identical results should not be expected).
Jacob and Mads wrote a report which should become an article about this. This showed that bfb should only be expected when low optimization flags were chosen.

TillRasmussen · 2021-03-03T07:23:26Z

note that all test (except 1) fails when compared to
freya_intel_logbfb_gx3_4x2x25x29x4_diag1_dslenderX2_reprosum this is with kevp_kernel=0,
restart files are binary identical when compared with
freya_intel_logbfb_gx3_4x2x25x29x4_diag1_dslenderX2_kevp102_reprosum.test0-kevp102/
(kevp_kernel=2)
Conclusion when comparing kevp_kernel=2 only remaining issue is :
"restart gbox128 4x2. "This test runs but fails to restart exactly."
@apcraig It would be nice if you could run a test

eclare108213

This looks pretty straight-forward. Are there also documentation changes needed? Suggestions for code modifications below. I'm a bit confused about the state of the tests. Are you saying that they should all not be BFB, and there's only one test that doesn't complete and still needs a closer look? It would be a good idea to run the QC tests comparing standard 2D evp with the 1D version (or did we already do that?). Thanks for figuring out the issues here.

eclare108213 · 2021-03-05T16:58:27Z

cicecore/cicedynB/infrastructure/ice_grid.F90

+! Initialize global primary grid lengths array with ghost cells from
+! global primary grid lengths array
+
+      subroutine primary_grid_lengths_global_ext(ARRAY_O, ARRAY_I)


It makes more sense to me for this routine to be put in the boundary modules, since it fills the ghost cells. This is all done on master task so MPI doesn't matter, and yet it would need to be put in both mpi and serial directories... That's the down side of how the code is currently structured, but I think we should stick with it for this, otherwise things can get really confusing.

eclare108213 · 2021-03-05T17:01:39Z

cicecore/cicedynB/dynamics/ice_dyn_evp_1d.F90

-    call gather_global_ext(G_stress12_2, I_stress12_2, master_task, distrb_info)
-    call gather_global_ext(G_stress12_3, I_stress12_3, master_task, distrb_info)
-    call gather_global_ext(G_stress12_4, I_stress12_4, master_task, distrb_info)
+    call gather_global_ext(G_icetmask,   I_icetmask,   master_task, distrb_info, 0      )


It's curious that icetmask has 0 at the end of the argument list and iceumask has .false.! I see that is how they are defined. No need to fix it here...

eclare108213 · 2021-03-05T17:11:37Z

cicecore/cicedynB/general/ice_init.F90

@@ -314,6 +315,7 @@ subroutine input_data
      ndtd = 1           ! dynamic time steps per thermodynamic time step
      ndte = 120         ! subcycles per dynamics timestep:  ndte=dt_dyn/dte
      kevp_kernel = 0    ! EVP kernel (0 = 2D, >0: 1D. Only ver. 2 is implemented yet)
+      pgl_global_ext = .false. ! if true, init primary grid lebgths (global ext.)


spell lengths

eclare108213 · 2021-03-05T17:14:40Z

cicecore/cicedynB/general/ice_init.F90

@@ -314,6 +315,7 @@ subroutine input_data
      ndtd = 1           ! dynamic time steps per thermodynamic time step
      ndte = 120         ! subcycles per dynamics timestep:  ndte=dt_dyn/dte
      kevp_kernel = 0    ! EVP kernel (0 = 2D, >0: 1D. Only ver. 2 is implemented yet)


The comment here is kind of confusing. Does kevp_kernel=102 mean 1D, version 02? Do we have a version 1? Should we simplify the kevp_kernel choices? I remember that we chose 102 because it wasn't really ready, so this was a way to 'hide' it from users, or at least make it less likely they'd set it up.

Originally several versions of 1d evp was created. Version 1 included most of the refactoring. Version 2 moved derived grid parameters of HTE and HTN to the 1d solver as a scalar.
Version 3 changed some of the variables from real4 to real8. The influence of this was limited change especially when compared to the accuracy of the iteration. In the end only v2 was implemented. I think that kevp_kernel could be included in kdyn as option 4. I have not thought this thrue

Right.
102="Our version 2"
Version 1 and version 2 gave identical results (maybe except for really aggressive flags, I do not recall exactly). But v2 only takes about half of memory and is a bit faster. For conservative flags we were also able to produce BFB results.
That's not the case for v3, where many internal variables was calculated as real*4. But it was again faster and also takes up less memory. As I recall, the uncertainties calculated using uvel,vvel was less or comparable to the uncertainties obtained across different computational architecture.
v3 is an interesting exercise I think and worth to considering for special cases to reduce memory load (and gain of speed as well).

TillRasmussen · 2021-03-09T17:19:32Z

Sorry about the confusion. I may have confused myself a bit as well.
Main conclusion:
We have removed all bad departure point bugs.
If compiler flags are set to standard (see machine freya) then all test run to the end but there are nunerical differences when testing kevp=102
If compile flag “-no-vec” is added all test are successful. It was not expected that numerical differences was observed when testing with kevp=102 but this means that the order of calculations is important to the result.
We will test the significance of these numerical differences by runing a 1 year simulation and compare concentration and thickness between the test.

This test has was used.

./cice.setup --suite first_suite,base_suite,travis_suite,decomp_suite,reprosum_suite,quick_suite --mach freya --env intel --testid test8-kevp102-0 --bgen baseline8

The following results has been achieved.

EVP-2d
379 measured results of 379 total results
377 of 379 tests PASSED
2 of 379 tests PENDING
0 of 379 tests MISSING data
0 of 379 tests FAILED
The two pending test ask for processors/node and our hpc system only has 36, thus they never start

EVP-1d

Here we have 7 not successful test. They all complete their runs but fail the comparison with other

FAIL freya_intel_restart_gbox128_4x2_short test
FAIL freya_intel_restart_gx3_1x1x50x58x4_droundrobin_thread bfbcomp freya_intel_restart_gx3_4x2x25x29x4_dslenderX2 different-data
FAIL freya_intel_restart_gx3_4x1x25x116x1_dslenderX1_thread bfbcomp freya_intel_restart_gx3_4x2x25x29x4_dslenderX2 different-data
FAIL freya_intel_restart_gx3_1x20x5x29x80_dsectrobin_short bfbcomp freya_intel_restart_gx3_4x2x25x29x4_dslenderX2 different-data
FAIL freya_intel_restart_gx3_1x4x25x29x16_droundrobin bfbcomp freya_intel_restart_gx3_4x2x25x29x4_dslenderX2 different-data
FAIL freya_intel_logbfb_gx3_1x20x5x29x80_diag1_dsectrobin_reprosum_short bfbcomp freya_intel_logbfb_gx3_4x2x25x29x4_diag1_dslenderX2_reprosum different-data
FAIL freya_intel_smoke_gx3_2x1_run2day_thread bfbcomp freya_intel_smoke_gx3_1x2_run2day different-data

All 7 are gone if compile flag “-no-vec” is turned on.

Debug
We found two bugs when adding debug flag to the compilation

1/ spacecurves. This was found in #560 as well

2/ The second is a call to tt = mod(ftime/secday,dayyr). This is in the initialization, thus I expect dayyr to be 0 (I have not checked). This would result in a division by 0.

orrtl: error (65): floating invalid
Image PC Routine Line
cice 0000000000B5BAB1 ice_forcing_mp_in 1242 ice_forcing.F90
cice 0000000000C159D8 ice_forcing_bgc_m 184 ice_forcing_bgc.F90
cice 00000000011AC027 ice_init_column_m 876 ice_init_column.F90
cice 000000000041118C cice_initmod_mp_i 439 CICE_InitMod.F90
cice 000000000040BC9C cice_initmod_mp_c 174 CICE_InitMod.F90
cice 000000000040ABAB cice_initmod_mp_c 52 CICE_InitMod.F90
cice 000000000040A391 MAIN__ 43

@mhrib

…568 (comment)). @mhrib questions to default value settings removed.

eclare108213 · 2021-03-12T22:04:56Z

See comments in #575 re kevp_kernel namelist value

… to MPI and serial implementations of ice_boundary modules (CICE-Consortium#568 (comment)). Please note duplication of subroutine.

apcraig · 2021-03-14T18:43:40Z

Thanks for the recent updates. I think the next things to do are to get rid of "102" and switch kevp_kernal=1 for the 1d implementation (0 = standard 2d). This is as suggested in #575. Actually, what I propose is that we change kevp_kernal to a string called evp_algorithm (or something like it) and have the valid values be "standard_2d" and "shared_memory_1d" (or something like that). We want to move away from numbers for namelist entries when we can. kevp_kernal is not really in use yet, so I think it's safe to do that.

Then, we'll want to add a test to turn on the 1d solver.

We can do those things as a separate PR and I can take it on if you like. I think we want to do it before the release though.

TillRasmussen · 2021-03-14T21:06:38Z

I can implement the evp_algorithm part with the suggested changes. One question not directly related but while I am at it. Is the revised evp included in the eap solver?
If this is only valid for evp then I might want to include the revised evp into the evp_algorithm namelist

@eclare108213

Changes requested by @eclare108213 in CICE-Consortium#568

apcraig · 2021-03-14T23:03:29Z

I think it depends whether revised evp is it's own standalone flag relative to other flags. Could we have evp with/without revised evp with/without 1d solver? If so, it might be nice to keep them as separate flags, otherwise we have to deal with multiple combinations of things being set by 1 flag.

It might be nice to create a tree of the possible dynamics solver options in the documentation, something like (this is proto code only, not necessarily complete/correct). This assumes it adds to the understanding of what things go together in terms of setting up the dynamics. Just an idea.

kdyn=0 (none)
- ...
kdyn=1 (evp)
- revised_evp=.true.
  - evp_algorithm="standard_2d"
  - evp_algorithm="shared_memory_1d"
- revised_evp=.false.
  - evp_algorithm="standard_2d"
  - evp_algorithm="shared_memory_1d"
kdyn=2 (eap)
- ...
kdyn=3 (implicit)
- algo_nonlin='picard'
  - precond='pgmres'
- ...

eclare108213 · 2021-03-15T02:12:48Z

If I remember correctly, revp was set up so that it could be used with eap, but I don't think that was ever actually implemented and tested. So maybe it wouldn't work. At any rate, revp seems like a fundamentally different approach than the 1D kernel - it's more about how the subcycling iteration is handled, while kevp is more about the vectorization, right? I'd keep them separate. I like @apcraig 's outline of dynamics options, for the documentation. This outline should be essentially what's coded in the verbose diagnostics.

Clean up. Changed namelist parameter kevp to evp_algorithm

apcraig · 2021-07-19T17:22:22Z

I will do a QC test of 1d vs 2d with standard optimization and report back.

Looking back at the PR, these modifications primarily fix issues in the 1d evp that prevented identical results with 2d (at reduced optimization). That includes at least a fix to the tiny area computation, addition of grid_lengths_global_ext, and addition of skiptcell logic. In addition, the namelist was changed so we now have strings instead of integers to set evp_algorithm, and a bunch of other formatting (i.e. indentation) was cleaned up. There may have been a few other changes, maybe @srethmeier, @mhrib, or @TillRasmussen can comment.

The issues related to 1x1 and 2x2 blocks, calendar, and spacecurve were addressed elsewhere.

I agree there is still an outstanding issue about what parts of the dynamics work with each other (i.e. revp, eap plans, etc). I think it should also include overall discussion of control, namelist, and implementation of the dynamics. I have created a new issue, #619 that can serve as a general place holder for those issues. This is something that was raised recently in our monthly meetings as well.

apcraig · 2021-07-20T01:44:10Z

The qc test with the 1d evp fails at the end of the 4th year with zap snow temperature errors,

  (zap_snow_temperature)Tmin:  -100.000000000000
  (zap_snow_temperature)Tmax:  4.486434370863951E-006
  (zap_snow_temperature)zqsn:  -182365602.580354
  (zap_snow_temperature)zap_snow_temperature: temperature out of bounds!
  (zap_snow_temperature)k:           1
  (zap_snow_temperature)zTsn:  -103.894744835945
  (zap_snow_temperature)Tmin:  -100.000000000000
  (zap_snow_temperature)Tmax:  5.419524611890995E-006
  (zap_snow_temperature)zqsn:  -182424769.766085

The baseline 2d runs fine. This suggests there still may be some issues with the 1d implementation. I actually ran it twice, once with 9x4 pes (threaded) and another time with 36x1 (not threaded) and both failed at the same time in the same way. I guess that means it's a robust problem.

I still suggest we merge this but then create a new issue noting the new problem discovered.

srethmeier · 2021-07-20T09:21:52Z

I will do a QC test of 1d vs 2d with standard optimization and report back.

Looking back at the PR, these modifications primarily fix issues in the 1d evp that prevented identical results with 2d (at reduced optimization). That includes at least a fix to the tiny area computation, addition of grid_lengths_global_ext, and addition of skiptcell logic. In addition, the namelist was changed so we now have strings instead of integers to set evp_algorithm, and a bunch of other formatting (i.e. indentation) was cleaned up. There may have been a few other changes, maybe @srethmeier, @mhrib, or @TillRasmussen can comment.

The issues related to 1x1 and 2x2 blocks, calendar, and spacecurve were addressed elsewhere.

I agree there is still an outstanding issue about what parts of the dynamics work with each other (i.e. revp, eap plans, etc). I think it should also include overall discussion of control, namelist, and implementation of the dynamics. I have created a new issue, #619 that can serve as a general place holder for those issues. This is something that was raised recently in our monthly meetings as well.

@apcraig, I believe that actually summarizes all major things covered in this PR. We mainly fixed three issues:

The issue with departure errors, resulting from default initialization of full land blocks in the gathering part of the EVP 1D implementation. To solve this, and to avoid gathering and scattering constant arrays, the HTE and HTN arrays (including ghost cells) are now, for the EVP 1D implementation, allocated and given their values during initialization
The issue with tests not being bit-for-bit comparable when compiled with debug/-O0 optimization, resulting from the order of computations for computing tinyarea not being identical between the EVP 2D implementation and the EVP 1D implementation. This was solved by making the EVP 1D computation order identical to the EVP 2D computation order
The issue with tests not being bit-for-bit comparable when compiled with debug/-O0 optimization, resulting from icetmask not being a subset of iceumask, as initialily assumed in the EVP 1D implementation. This was solved by adding a skiptcell logical array as well for skipping stress calculations in the EVP 1D implementation

Then we did some modifications to the OpenMP parts of the EVP 1D implementation. The core part of the implementation (stress and stepu) was implemented to be NUMA-aware for performance. This was done differently for the interfacing part of the implementation, but is now aligned throughout the EVP 1D implementation for all of it to be NUMA-aware.

As for namelist changes, the only change that has been made, is the renaming of kevp_kernel to evp_algorithm and changing it from integer to string. EVP 2D is now enabled by setting evp_algorithm = 'standard_2d' instead of kevp_kernel = 0 and EVP 1D by setting evp_algorithm = 'shared_mem_1d' instead of kevp_kernel = 102. In connection with this, the option set_nml.evp1d was also added. Documentation has also been updated to reflect this modification.

Finally, we did a full cleanup of ice_dyn_evp_1d.F90 and some cleanup of ice_dyn_evp.F90. This has mainly included:

Fixing indentations
Aligning variable names with the rest of the code base
Removing old, uncommented code blocks, that was use in development, but not needed any longer

Maybe @TillRasmussen can update the PR checklist with this, so that it is at the top of the PR? I can't modify it.

srethmeier · 2021-07-20T09:38:04Z

The qc test with the 1d evp fails at the end of the 4th year with zap snow temperature errors,
  (zap_snow_temperature)Tmin:  -100.000000000000
  (zap_snow_temperature)Tmax:  4.486434370863951E-006
  (zap_snow_temperature)zqsn:  -182365602.580354
  (zap_snow_temperature)zap_snow_temperature: temperature out of bounds!
  (zap_snow_temperature)k:           1
  (zap_snow_temperature)zTsn:  -103.894744835945
  (zap_snow_temperature)Tmin:  -100.000000000000
  (zap_snow_temperature)Tmax:  5.419524611890995E-006
  (zap_snow_temperature)zqsn:  -182424769.766085
The baseline 2d runs fine. This suggests there still may be some issues with the 1d implementation. I actually ran it twice, once with 9x4 pes (threaded) and another time with 36x1 (not threaded) and both failed at the same time in the same way. I guess that means it's a robust problem.

Okay, looks like there still is something we need to look at here. I'll try and dig into it next week.

I still suggest we merge this but then create a new issue noting the new problem discovered.

Sounds good to me. Whatever you prefer and find to be the best approach.

srethmeier · 2021-07-20T09:43:46Z

The rEVP question needs a github issue for follow up -- can this vectorization technique be run for that case, and if so, does it work in the current code? I think there's already an issue for "kernelizing" EAP.

Not completely sure what this entails? Is it that the combination of Revised EVP and EVP 1D has not been tested?

TillRasmussen · 2021-07-20T09:55:30Z

Revised evp should work

TillRasmussen · 2021-07-20T09:56:11Z

Revised evp should work

eclare108213 · 2021-07-20T14:44:01Z

The rEVP question needs a github issue for follow up -- can this vectorization technique be run for that case, and if so, does it work in the current code? I think there's already an issue for "kernelizing" EAP.

Not completely sure what this entails? Is it that the combination of Revised EVP and EVP 1D has not been tested?

Yes, if rEVP and 1D can be run together, then test to make sure it works. Otherwise put an abort in the code in case they're both turned on.

Thanks for summarizing all the changes in the PR. I'd like to better understand which, if any, change the answers when 1D EVP is turned off. @apcraig 's testing showed regression failures for alt04, 1x1 and 2x2 blocks, and unit tests, plus some that timed out due to the low optimization level. Please confirm that you're confident that all of these failures are expected, and that none of them affect the 2D EVP results. My preference would be to find and fix the problem causing snow temperatures to explode, but I'm fine with merging this PR and then fixing that problem. Thanks everyone -- 1D EVP has been a huge effort!

apcraig · 2021-07-20T16:36:01Z

Regarding the testing and failed tests. All of the 2d tests are bit-for-bit with this PR. This PR has no impact on the standard 2d results which are most of our test suite.

The failed regression tests in the 1d (with this PR) vs 2d (with master) with debug on are also all expected. It's also only the regression (comparison with baseline) part of the tests that fail, the tests themselves pass. The alt04 test fails because it has the 1d evp on by default and we have changed answers. The 1x1 and 2x2 tests fail because my sandboxes did not include the recent fixes. The unit tests fail because there continue to be a couple of nagging (but still bit-for-bit) issues in the regression testing on unit tests. The tests that timed out could be rerun with longer submission times and I will try to do some of that today to see.

The aborted QC test case is unfortunate. It could be that the roundoff errors introduced by the 1d evp have changed the state of the solution such that an abort is created. I don't think this is the most likely case because then we'd be encountering this as well with 2d cases. It could be that there are still some differences in 1d and 2d that would appear if we ran both cases with debug on for 5 years. We really only tested bit-for-bit with debug on for days or months. However, if all we are doing is introducing a roundoff difference, why is the model aborting. It should be more robust than that. It's surprising to me with all the testing that we've done that the QC evp 1d run aborted.

apcraig · 2021-07-20T23:03:24Z

I submitted most of the 1d tests that timed out for intel and gnu with more time and they all pass including regression vs 2d. So I think all the 1d vs 2d debug results are accounted for now with no outstanding issues. The only known issue is the QC failure after 4 years (out of a 5 year run).

TillRasmussen · 2021-08-01T09:52:43Z

I think I have something working. I will need to test more. It turns out we need to add a skiptmask like there is a skipumask. We need to be able to separate logic for doing stress and stepu computations.

Given that Till is away for several weeks, would it make sense for me to create a new PR with the latest updates including everything in this PR, to delete this PR, and for us to merge a new PR. Alternatively (and my preference) would be that I PR to Till's branch, that gets merged into this PR, and that we merge this PR to the Consortium. @srethmeier, are you able to merge a PR to Till's branch if I create it? I know you are also busy and about to go on vacation. Let me know what would work best. My preference is to get this merged once it's working rather than wait a month when everyone is back.

@apcraig Thank you for implementing this. I agree that a skiptmask is needed.

TillRasmussen · 2021-08-01T10:05:19Z

I will do a QC test of 1d vs 2d with standard optimization and report back.

Looking back at the PR, these modifications primarily fix issues in the 1d evp that prevented identical results with 2d (at reduced optimization). That includes at least a fix to the tiny area computation, addition of grid_lengths_global_ext, and addition of skiptcell logic. In addition, the namelist was changed so we now have strings instead of integers to set evp_algorithm, and a bunch of other formatting (i.e. indentation) was cleaned up. There may have been a few other changes, maybe @srethmeier, @mhrib, or @TillRasmussen can comment.

The issues related to 1x1 and 2x2 blocks, calendar, and spacecurve were addressed elsewhere.

I agree there is still an outstanding issue about what parts of the dynamics work with each other (i.e. revp, eap plans, etc). I think it should also include overall discussion of control, namelist, and implementation of the dynamics. I have created a new issue, #619 that can serve as a general place holder for those issues. This is something that was raised recently in our monthly meetings as well.

Done

TillRasmussen · 2021-08-01T10:39:13Z

I can confirm that some test fail due to wall time on the DMI system as well. All were fixed by increasing wall time.
I would also recommend that this merge is accepted. The only issue is the QC test that we will look at. We should open an issues that explains this. I think that this would make it more clear compared to continuing this request.

This should not change the 2d solution and it did not do so before the changes by @apcraig. I will rerun the test with the new changes.
The difference between the 1d and the 2d soiution is only in the subcycling loop. The point is to avoid memory usage when calling the two functions multiple times and to avoid mpi communications (multiple), which both scale poorly. The conversion between 1d and 2d may have some overhead, therefore the benefit may not be that clear in a test domain like gx3 which is relatively small as it may not be memory bounded the same way as the larger domains. This conversion is only needed as long as CICE is a 2d model.
A scientific improvement would be that it should be feasible to increase the number of iterations as recent literature has suggested to +500 with limited cost.
EAP: is not currently implemented. An implementation of this would reuse the conversion from 2d to 1d. The functions within would need to be rewritten to 1d.
ESMF now enable separation of different modules (ocean, sea ice, atmosphere.... ) onto different different nodes with different hybrid versions of OMP+MPI. This should enable running CICE on its own node and thereby allow CICE to run on OMP only. I have not tested this. Further modulization/split of CICE could increase the usage of this. Test of the latter is not planned.

apcraig · 2021-08-04T18:56:57Z

Does anyone object to merging this at this point? We should create a new issue as part of the merge to track outstanding issues.

eclare108213 · 2021-08-04T19:08:14Z

I'm okay with merging this, although the QC failure is a nagging issue. Does it mean that the code may abort for any user trying to run with 1Devp, or is that just a failure in the QC software?

apcraig · 2021-08-04T19:14:49Z

The QC failure is an abort in the ice model, nothing related to the QC test as far as I can tell. It took 4 years to hit it. If we believe the 2d implementation is robust, then this suggests there may still be a bug in the 1d implementation that's possibly triggered only on timescales of months to years. All our bit-for-bit testing with debug flags has been shorter than a year. We could run QC with debug flags on with 1d and 2d and see what happens. My concern is the time to completion in that case with reduced optimization, running 1 degree for 5 years. But it's probably worth doing. I propose we add that idea to the issue that we'll create.

eclare108213 · 2021-08-04T19:29:11Z

I guess you can't restart the QC from shortly before the crash with debug flags on, and expect it to crash again. Would that be worth a try?

apcraig · 2021-08-04T19:51:48Z

@eclare108213, we could try something like that but no guarantees it'll help. There are probably a number of steps to try to isolate the problem. The first thing I might try is to turn debug on with 1d and 2d and run 5 years, 1 year at a time with restarts, just to see if the models are bit-for-bit throughout and to see if the 1d with debug also fails. Then we might try running evp1d with optimization but threading off. This might provide insight about an OpenMP issues. Depending what we learn there, we might create a case with a restart just before the failure and then start debugging the actual abort.

eclare108213 · 2021-08-04T19:57:41Z

That's quite a debugging project. It needs to be done, but I don't think it needs to be done before this particular PR is merged.

TillRasmussen · 2021-08-04T20:12:16Z

I will try to rerun the qc test and output more diagnostic.

eclare108213 · 2021-08-05T11:30:10Z

@dabail10 Is this the type of error you get for CFL violations in CESM?

  (zap_snow_temperature)Tmin:  -100.000000000000
  (zap_snow_temperature)Tmax:  4.486434370863951E-006
  (zap_snow_temperature)zqsn:  -182365602.580354
  (zap_snow_temperature)zap_snow_temperature: temperature out of bounds!
  (zap_snow_temperature)k:           1
  (zap_snow_temperature)zTsn:  -103.894744835945
  (zap_snow_temperature)Tmin:  -100.000000000000
  (zap_snow_temperature)Tmax:  5.419524611890995E-006
  (zap_snow_temperature)zqsn:  -182424769.766085

I'm wondering if the 1D evp QC test just happens to be hitting one of these, and the 2D case barely misses it. (The "incremental" part of incremental remap assumes that ice moves no farther than 1 grid cell in 1 time step.) If restarting the 1D case with a reduced timestep runs through this point, CFL could be the culprit -- there might not be anything wrong with the 1D evp implementation at all, just unlucky.

@apcraig Here's a suggestion for the warning system. Is is possible to have it call print_state for the failing grid cell, just for the aborts that are from solutions becoming physically unreasonable, like this one?

apcraig · 2021-08-05T18:22:48Z

It could be a CFL issue, but then I think if it were a matter of being "unlucky", then we'd be seeing it more often in the 2d. I think it's unlikely that the 2d is quite robust for lots of configurations, but that the 1d gets unlucky on the first long run we do.

@eclare108213, I like your idea of trying to leverage print_state more. I have created an issue, #622.

apcraig · 2021-08-05T18:55:01Z

See #623 for followup discussion.

srethmeier and others added 3 commits March 1, 2021 14:03

Reverted configuration/scripts/cice.batch.csh to resolve bad auto mer…

f5c1c1f

…ge of file. Machine 'freya' was double in file.

Merge pull request #13 from TillRasmussen/evp-1d

17d48d8

Evp 1d

apcraig requested review from phil-blain, apcraig, eclare108213 and mhrib March 5, 2021 16:44

apcraig added Dynamics Software Engineering Type: Bug labels Mar 5, 2021

eclare108213 requested changes Mar 5, 2021

View reviewed changes

srethmeier added 2 commits March 12, 2021 09:30

Spell error corrected (CICE-Consortium#568 (comment)).

b4bf34b

Redundant initializations to default values removed (CICE-Consortium#…

4b702b8

…568 (comment)). @mhrib questions to default value settings removed.

eclare108213 mentioned this pull request Mar 12, 2021

evp kernel version 2 testing and validation #279

Closed

Subroutine primary_grid_lengths_global_ext moved from ice_grid module…

a711bdc

… to MPI and serial implementations of ice_boundary modules (CICE-Consortium#568 (comment)). Please note duplication of subroutine.

srethmeier mentioned this pull request Mar 14, 2021

Changes requested by @eclare108213 in https://github.com/CICE-Consortium/CICE/pull/568 TillRasmussen/CICE#14

Merged

Merge pull request #14 from TillRasmussen/evp-1d

d669c8d

Changes requested by @eclare108213 in CICE-Consortium#568

Added 1d solver to be fully integrated. Removed kevp=102 bypass.

aabb12f

Clean up. Changed namelist parameter kevp to evp_algorithm

apcraig mentioned this pull request Jul 19, 2021

Dynamics control logic and implementation #619

Open

eclare108213 approved these changes Aug 4, 2021

View reviewed changes

apcraig mentioned this pull request Aug 5, 2021

Add debug output on aborts #622

Closed

apcraig mentioned this pull request Aug 5, 2021

evp1d failures #623

Closed

apcraig merged commit 0ccdea1 into CICE-Consortium:master Aug 5, 2021

TillRasmussen mentioned this pull request Nov 8, 2023

New 1d evp solver #895

Merged

14 tasks

Bug fixed 1d evp #568

Bug fixed 1d evp #568

Conversation

TillRasmussen commented Mar 2, 2021 • edited Loading

PR checklist

apcraig commented Mar 2, 2021

TillRasmussen commented Mar 2, 2021

TillRasmussen commented Mar 3, 2021 • edited Loading

eclare108213 left a comment

Choose a reason for hiding this comment

eclare108213 Mar 5, 2021

Choose a reason for hiding this comment

eclare108213 Mar 5, 2021

Choose a reason for hiding this comment

eclare108213 Mar 5, 2021

Choose a reason for hiding this comment

eclare108213 Mar 5, 2021

Choose a reason for hiding this comment

TillRasmussen Mar 5, 2021

Choose a reason for hiding this comment

mhrib Mar 6, 2021 • edited Loading

Choose a reason for hiding this comment

TillRasmussen commented Mar 9, 2021

eclare108213 commented Mar 12, 2021

apcraig commented Mar 14, 2021

TillRasmussen commented Mar 14, 2021 • edited Loading

apcraig commented Mar 14, 2021

eclare108213 commented Mar 15, 2021

apcraig commented Jul 19, 2021

apcraig commented Jul 20, 2021

srethmeier commented Jul 20, 2021 • edited Loading

srethmeier commented Jul 20, 2021

srethmeier commented Jul 20, 2021 • edited Loading

TillRasmussen commented Jul 20, 2021

TillRasmussen commented Jul 20, 2021

eclare108213 commented Jul 20, 2021

apcraig commented Jul 20, 2021

apcraig commented Jul 20, 2021

TillRasmussen commented Aug 1, 2021

TillRasmussen commented Aug 1, 2021

TillRasmussen commented Aug 1, 2021 • edited Loading

apcraig commented Aug 4, 2021

eclare108213 commented Aug 4, 2021

apcraig commented Aug 4, 2021

eclare108213 commented Aug 4, 2021

apcraig commented Aug 4, 2021

eclare108213 commented Aug 4, 2021

TillRasmussen commented Aug 4, 2021

eclare108213 commented Aug 5, 2021

apcraig commented Aug 5, 2021

apcraig commented Aug 5, 2021

TillRasmussen commented Mar 2, 2021 •

edited

Loading

TillRasmussen commented Mar 3, 2021 •

edited

Loading

mhrib Mar 6, 2021 •

edited

Loading

TillRasmussen commented Mar 14, 2021 •

edited

Loading

srethmeier commented Jul 20, 2021 •

edited

Loading

srethmeier commented Jul 20, 2021 •

edited

Loading

TillRasmussen commented Aug 1, 2021 •

edited

Loading