Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fixed 1d evp #568

Merged
merged 41 commits into from
Aug 5, 2021
Merged

Bug fixed 1d evp #568

merged 41 commits into from
Aug 5, 2021

Conversation

TillRasmussen
Copy link
Contributor

@TillRasmussen TillRasmussen commented Mar 2, 2021

For detailed information about submitting Pull Requests (PRs) to the CICE-Consortium,
please refer to: https://github.com/CICE-Consortium/About-Us/wiki/Resource-Index#information-for-developers

PR checklist

  • Short (1 sentence) summary of your PR:
    Removes all bad departure point bugs
    Copied from @ssrethmeier
    The issue with departure errors, resulting from default initialization of full land blocks in the gathering part of the EVP 1D implementation. To solve this, and to avoid gathering and scattering constant arrays, the HTE and HTN arrays (including ghost cells) are now, for the EVP 1D implementation, allocated and given their values during initialization
    The issue with tests not being bit-for-bit comparable when compiled with debug/-O0 optimization, resulting from the order of computations for computing tinyarea not being identical between the EVP 2D implementation and the EVP 1D implementation. This was solved by making the EVP 1D computation order identical to the EVP 2D computation order
    The issue with tests not being bit-for-bit comparable when compiled with debug/-O0 optimization, resulting from icetmask not being a subset of iceumask, as initialily assumed in the EVP 1D implementation. This was solved by adding a skiptcell logical array as well for skipping stress calculations in the EVP 1D implementation

Then we did some modifications to the OpenMP parts of the EVP 1D implementation. The core part of the implementation (stress and stepu) was implemented to be NUMA-aware for performance. This was done differently for the interfacing part of the implementation, but is now aligned throughout the EVP 1D implementation for all of it to be NUMA-aware.

As for namelist changes, the only change that has been made, is the renaming of kevp_kernel to evp_algorithm and changing it from integer to string. EVP 2D is now enabled by setting evp_algorithm = 'standard_2d' instead of kevp_kernel = 0 and EVP 1D by setting evp_algorithm = 'shared_mem_1d' instead of kevp_kernel = 102. In connection with this, the option set_nml.evp1d was also added. Documentation has also been updated to reflect this modification.

Finally, we did a full cleanup of ice_dyn_evp_1d.F90 and some cleanup of ice_dyn_evp.F90. This has mainly included:

Fixing indentations
Aligning variable names with the rest of the code base
Removing old, uncommented code blocks, that was use in development, but not needed any longer

This implementation should work for revised evp and "traditional evp". The first should be tested.

  • Developer(s):
    @TillRasmussen @srethmeier (Stefan Rethmeier, DMI)
  • Suggest PR reviewers from list in the column to the right.
  • @eclare108213 @apcraig @phil-blain @mhrib
  • Please copy the PR test results link or provide a summary of testing completed below.
    Removes bad departure point bug bug.
    failed test
    "restart gbox128 4x2. This test runs but fails to restart exactly.
    /results.csh | grep FAIL
    FAIL freya_intel_restart_gbox128_4x2_kevp102_short test
    FAIL freya_intel_logbfb_gx3_1x1x50x58x4_diag1_droundrobin_kevp102_maskhalo_reprosum_thread bfbcomp freya_intel_logbfb_gx3_4x2x25x29x4_diag1_dslenderX2_reprosum different-data
    FAIL freya_intel_logbfb_gx3_4x1x25x116x1_diag1_dslenderX1_kevp102_maskhalo_reprosum_thread bfbcomp freya_intel_logbfb_gx3_4x2x25x29x4_diag1_dslenderX2_reprosum different-data
    FAIL freya_intel_logbfb_gx3_1x20x5x29x80_diag1_dsectrobin_kevp102_reprosum_short bfbcomp freya_intel_logbfb_gx3_4x2x25x29x4_diag1_dslenderX2_reprosum different-data
    FAIL freya_intel_logbfb_gx3_8x2x8x10x20_diag1_droundrobin_kevp102_reprosum bfbcomp freya_intel_logbfb_gx3_4x2x25x29x4_diag1_dslenderX2_reprosum different-data
    FAIL freya_intel_logbfb_gx3_6x2x50x58x1_diag1_droundrobin_kevp102_reprosum bfbcomp freya_intel_logbfb_gx3_4x2x25x29x4_diag1_dslenderX2_reprosum different-data
    FAIL freya_intel_logbfb_gx3_6x2x4x29x18_diag1_dspacecurve_kevp102_maskhalo_reprosum bfbcomp freya_intel_logbfb_gx3_4x2x25x29x4_diag1_dslenderX2_reprosum different-data
    FAIL - test failed
    7 of 390 tests FAILED
  • How much do the PR code changes differ from the unmodified code?
    • bit for bit
    • different at roundoff level
    • [x ] more substantial
  • Does this PR create or have dependencies on Icepack or any other models?
    • Yes
    • [ x] No
  • Does this PR add any new test cases?
    • Yes
    • [ x] No
  • Is the documentation being updated? ("Documentation" includes information on the wiki or in the .rst files from doc/source/, which are used to create the online technical docs at https://readthedocs.org/projects/cice-consortium-cice/. A test build of the technical docs will be performed as part of the PR testing.)
    • Yes
    • [x ] No, does the documentation need to be updated at a later time?
      • [x ] Yes
      • No
  • Please provide any additional information or relevant details below:
    This is a bug fix. The 1d evp solver speeds up the evp solver substantially but is as expected not bit for bit reproducible. If it is chosen the solver will use OMP only. The bug was in the gather method when masked blocks where neigbour of an ocean point. When derivatives were found it got the special value which ny default is a very big number. velocities and stresses are set to zero, whereas HTE and HTN stores a global 2d Array.
    There are alternative choices to fix the bug.
    a/ ffil the halo zones in the gather routine with values of neighbouring blocks.
    b/ merge the conversion from 2d to 1d and the gather routine.
    A bit of cleaning should be done
    There should be test including 1devp

srethmeier and others added 3 commits March 1, 2021 14:03
…r task to avoid gathering constant arrays when running EVP 1D implementation. This solves issue with departure errors resulting from erroneous initialization of full land blocks in EVP 1D gathering. Correct initialization of remaining variables for full land blocks in EVP 1D gathering.
…ge of file. Machine 'freya' was double in file.
@apcraig
Copy link
Contributor

apcraig commented Mar 2, 2021

I can run a test suite on this before we merge with the 1d evp turned on. Let me know when you think it's ready. It seems there are at least a couple things still to do,

  • decide if we are happy with the approach
  • clean up
  • update documentation
  • add test case

@TillRasmussen
Copy link
Contributor Author

I updated with the output from the full test suite. It shows (as expected that binary identical results should not be expected).
Jacob and Mads wrote a report which should become an article about this. This showed that bfb should only be expected when low optimization flags were chosen.

@TillRasmussen
Copy link
Contributor Author

TillRasmussen commented Mar 3, 2021

note that all test (except 1) fails when compared to
freya_intel_logbfb_gx3_4x2x25x29x4_diag1_dslenderX2_reprosum this is with kevp_kernel=0,
restart files are binary identical when compared with
freya_intel_logbfb_gx3_4x2x25x29x4_diag1_dslenderX2_kevp102_reprosum.test0-kevp102/
(kevp_kernel=2)
Conclusion when comparing kevp_kernel=2 only remaining issue is :
"restart gbox128 4x2. "This test runs but fails to restart exactly."
@apcraig It would be nice if you could run a test

Copy link
Contributor

@eclare108213 eclare108213 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty straight-forward. Are there also documentation changes needed? Suggestions for code modifications below. I'm a bit confused about the state of the tests. Are you saying that they should all not be BFB, and there's only one test that doesn't complete and still needs a closer look? It would be a good idea to run the QC tests comparing standard 2D evp with the 1D version (or did we already do that?). Thanks for figuring out the issues here.

! Initialize global primary grid lengths array with ghost cells from
! global primary grid lengths array

subroutine primary_grid_lengths_global_ext(ARRAY_O, ARRAY_I)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes more sense to me for this routine to be put in the boundary modules, since it fills the ghost cells. This is all done on master task so MPI doesn't matter, and yet it would need to be put in both mpi and serial directories... That's the down side of how the code is currently structured, but I think we should stick with it for this, otherwise things can get really confusing.

call gather_global_ext(G_stress12_2, I_stress12_2, master_task, distrb_info)
call gather_global_ext(G_stress12_3, I_stress12_3, master_task, distrb_info)
call gather_global_ext(G_stress12_4, I_stress12_4, master_task, distrb_info)
call gather_global_ext(G_icetmask, I_icetmask, master_task, distrb_info, 0 )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's curious that icetmask has 0 at the end of the argument list and iceumask has .false.! I see that is how they are defined. No need to fix it here...

@@ -314,6 +315,7 @@ subroutine input_data
ndtd = 1 ! dynamic time steps per thermodynamic time step
ndte = 120 ! subcycles per dynamics timestep: ndte=dt_dyn/dte
kevp_kernel = 0 ! EVP kernel (0 = 2D, >0: 1D. Only ver. 2 is implemented yet)
pgl_global_ext = .false. ! if true, init primary grid lebgths (global ext.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spell lengths

@@ -314,6 +315,7 @@ subroutine input_data
ndtd = 1 ! dynamic time steps per thermodynamic time step
ndte = 120 ! subcycles per dynamics timestep: ndte=dt_dyn/dte
kevp_kernel = 0 ! EVP kernel (0 = 2D, >0: 1D. Only ver. 2 is implemented yet)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment here is kind of confusing. Does kevp_kernel=102 mean 1D, version 02? Do we have a version 1? Should we simplify the kevp_kernel choices? I remember that we chose 102 because it wasn't really ready, so this was a way to 'hide' it from users, or at least make it less likely they'd set it up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally several versions of 1d evp was created. Version 1 included most of the refactoring. Version 2 moved derived grid parameters of HTE and HTN to the 1d solver as a scalar.
Version 3 changed some of the variables from real4 to real8. The influence of this was limited change especially when compared to the accuracy of the iteration. In the end only v2 was implemented. I think that kevp_kernel could be included in kdyn as option 4. I have not thought this thrue

Copy link
Contributor

@mhrib mhrib Mar 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right.
102="Our version 2"
Version 1 and version 2 gave identical results (maybe except for really aggressive flags, I do not recall exactly). But v2 only takes about half of memory and is a bit faster. For conservative flags we were also able to produce BFB results.
That's not the case for v3, where many internal variables was calculated as real*4. But it was again faster and also takes up less memory. As I recall, the uncertainties calculated using uvel,vvel was less or comparable to the uncertainties obtained across different computational architecture.
v3 is an interesting exercise I think and worth to considering for special cases to reduce memory load (and gain of speed as well).

@TillRasmussen
Copy link
Contributor Author

Sorry about the confusion. I may have confused myself a bit as well.
Main conclusion:
We have removed all bad departure point bugs.
If compiler flags are set to standard (see machine freya) then all test run to the end but there are nunerical differences when testing kevp=102
If compile flag “-no-vec” is added all test are successful. It was not expected that numerical differences was observed when testing with kevp=102 but this means that the order of calculations is important to the result.
We will test the significance of these numerical differences by runing a 1 year simulation and compare concentration and thickness between the test.

This test has was used.

./cice.setup --suite first_suite,base_suite,travis_suite,decomp_suite,reprosum_suite,quick_suite --mach freya --env intel --testid test8-kevp102-0 --bgen baseline8

The following results has been achieved.

EVP-2d
379 measured results of 379 total results
377 of 379 tests PASSED
2 of 379 tests PENDING
0 of 379 tests MISSING data
0 of 379 tests FAILED
The two pending test ask for processors/node and our hpc system only has 36, thus they never start

EVP-1d

Here we have 7 not successful test. They all complete their runs but fail the comparison with other

FAIL freya_intel_restart_gbox128_4x2_short test
FAIL freya_intel_restart_gx3_1x1x50x58x4_droundrobin_thread bfbcomp freya_intel_restart_gx3_4x2x25x29x4_dslenderX2 different-data
FAIL freya_intel_restart_gx3_4x1x25x116x1_dslenderX1_thread bfbcomp freya_intel_restart_gx3_4x2x25x29x4_dslenderX2 different-data
FAIL freya_intel_restart_gx3_1x20x5x29x80_dsectrobin_short bfbcomp freya_intel_restart_gx3_4x2x25x29x4_dslenderX2 different-data
FAIL freya_intel_restart_gx3_1x4x25x29x16_droundrobin bfbcomp freya_intel_restart_gx3_4x2x25x29x4_dslenderX2 different-data
FAIL freya_intel_logbfb_gx3_1x20x5x29x80_diag1_dsectrobin_reprosum_short bfbcomp freya_intel_logbfb_gx3_4x2x25x29x4_diag1_dslenderX2_reprosum different-data
FAIL freya_intel_smoke_gx3_2x1_run2day_thread bfbcomp freya_intel_smoke_gx3_1x2_run2day different-data

All 7 are gone if compile flag “-no-vec” is turned on.

Debug
We found two bugs when adding debug flag to the compilation

1/ spacecurves. This was found in #560 as well

2/ The second is a call to tt = mod(ftime/secday,dayyr). This is in the initialization, thus I expect dayyr to be 0 (I have not checked). This would result in a division by 0.

orrtl: error (65): floating invalid
Image PC Routine Line
cice 0000000000B5BAB1 ice_forcing_mp_in 1242 ice_forcing.F90
cice 0000000000C159D8 ice_forcing_bgc_m 184 ice_forcing_bgc.F90
cice 00000000011AC027 ice_init_column_m 876 ice_init_column.F90
cice 000000000041118C cice_initmod_mp_i 439 CICE_InitMod.F90
cice 000000000040BC9C cice_initmod_mp_c 174 CICE_InitMod.F90
cice 000000000040ABAB cice_initmod_mp_c 52 CICE_InitMod.F90
cice 000000000040A391 MAIN__ 43

@eclare108213
Copy link
Contributor

See comments in #575 re kevp_kernel namelist value

… to MPI and serial implementations of ice_boundary modules (CICE-Consortium#568 (comment)). Please note duplication of subroutine.
@apcraig
Copy link
Contributor

apcraig commented Mar 14, 2021

Thanks for the recent updates. I think the next things to do are to get rid of "102" and switch kevp_kernal=1 for the 1d implementation (0 = standard 2d). This is as suggested in #575. Actually, what I propose is that we change kevp_kernal to a string called evp_algorithm (or something like it) and have the valid values be "standard_2d" and "shared_memory_1d" (or something like that). We want to move away from numbers for namelist entries when we can. kevp_kernal is not really in use yet, so I think it's safe to do that.

Then, we'll want to add a test to turn on the 1d solver.

We can do those things as a separate PR and I can take it on if you like. I think we want to do it before the release though.

@TillRasmussen
Copy link
Contributor Author

TillRasmussen commented Mar 14, 2021

I can implement the evp_algorithm part with the suggested changes. One question not directly related but while I am at it. Is the revised evp included in the eap solver?
If this is only valid for evp then I might want to include the revised evp into the evp_algorithm namelist

@apcraig
Copy link
Contributor

apcraig commented Mar 14, 2021

I think it depends whether revised evp is it's own standalone flag relative to other flags. Could we have evp with/without revised evp with/without 1d solver? If so, it might be nice to keep them as separate flags, otherwise we have to deal with multiple combinations of things being set by 1 flag.

It might be nice to create a tree of the possible dynamics solver options in the documentation, something like (this is proto code only, not necessarily complete/correct). This assumes it adds to the understanding of what things go together in terms of setting up the dynamics. Just an idea.

  • kdyn=0 (none)
    • ...
  • kdyn=1 (evp)
    • revised_evp=.true.
      • evp_algorithm="standard_2d"
      • evp_algorithm="shared_memory_1d"
    • revised_evp=.false.
      • evp_algorithm="standard_2d"
      • evp_algorithm="shared_memory_1d"
  • kdyn=2 (eap)
    • ...
  • kdyn=3 (implicit)
    • algo_nonlin='picard'
      • precond='pgmres'
    • ...

@eclare108213
Copy link
Contributor

If I remember correctly, revp was set up so that it could be used with eap, but I don't think that was ever actually implemented and tested. So maybe it wouldn't work. At any rate, revp seems like a fundamentally different approach than the 1D kernel - it's more about how the subcycling iteration is handled, while kevp is more about the vectorization, right? I'd keep them separate. I like @apcraig 's outline of dynamics options, for the documentation. This outline should be essentially what's coded in the verbose diagnostics.

Clean up. Changed namelist parameter kevp to evp_algorithm
@apcraig
Copy link
Contributor

apcraig commented Jul 19, 2021

I will do a QC test of 1d vs 2d with standard optimization and report back.

Looking back at the PR, these modifications primarily fix issues in the 1d evp that prevented identical results with 2d (at reduced optimization). That includes at least a fix to the tiny area computation, addition of grid_lengths_global_ext, and addition of skiptcell logic. In addition, the namelist was changed so we now have strings instead of integers to set evp_algorithm, and a bunch of other formatting (i.e. indentation) was cleaned up. There may have been a few other changes, maybe @srethmeier, @mhrib, or @TillRasmussen can comment.

The issues related to 1x1 and 2x2 blocks, calendar, and spacecurve were addressed elsewhere.

I agree there is still an outstanding issue about what parts of the dynamics work with each other (i.e. revp, eap plans, etc). I think it should also include overall discussion of control, namelist, and implementation of the dynamics. I have created a new issue, #619 that can serve as a general place holder for those issues. This is something that was raised recently in our monthly meetings as well.

@apcraig
Copy link
Contributor

apcraig commented Jul 20, 2021

The qc test with the 1d evp fails at the end of the 4th year with zap snow temperature errors,

  (zap_snow_temperature)Tmin:  -100.000000000000
  (zap_snow_temperature)Tmax:  4.486434370863951E-006
  (zap_snow_temperature)zqsn:  -182365602.580354
  (zap_snow_temperature)zap_snow_temperature: temperature out of bounds!
  (zap_snow_temperature)k:           1
  (zap_snow_temperature)zTsn:  -103.894744835945
  (zap_snow_temperature)Tmin:  -100.000000000000
  (zap_snow_temperature)Tmax:  5.419524611890995E-006
  (zap_snow_temperature)zqsn:  -182424769.766085

The baseline 2d runs fine. This suggests there still may be some issues with the 1d implementation. I actually ran it twice, once with 9x4 pes (threaded) and another time with 36x1 (not threaded) and both failed at the same time in the same way. I guess that means it's a robust problem.

I still suggest we merge this but then create a new issue noting the new problem discovered.

@srethmeier
Copy link
Contributor

srethmeier commented Jul 20, 2021

I will do a QC test of 1d vs 2d with standard optimization and report back.

Looking back at the PR, these modifications primarily fix issues in the 1d evp that prevented identical results with 2d (at reduced optimization). That includes at least a fix to the tiny area computation, addition of grid_lengths_global_ext, and addition of skiptcell logic. In addition, the namelist was changed so we now have strings instead of integers to set evp_algorithm, and a bunch of other formatting (i.e. indentation) was cleaned up. There may have been a few other changes, maybe @srethmeier, @mhrib, or @TillRasmussen can comment.

The issues related to 1x1 and 2x2 blocks, calendar, and spacecurve were addressed elsewhere.

I agree there is still an outstanding issue about what parts of the dynamics work with each other (i.e. revp, eap plans, etc). I think it should also include overall discussion of control, namelist, and implementation of the dynamics. I have created a new issue, #619 that can serve as a general place holder for those issues. This is something that was raised recently in our monthly meetings as well.

@apcraig, I believe that actually summarizes all major things covered in this PR. We mainly fixed three issues:

  1. The issue with departure errors, resulting from default initialization of full land blocks in the gathering part of the EVP 1D implementation. To solve this, and to avoid gathering and scattering constant arrays, the HTE and HTN arrays (including ghost cells) are now, for the EVP 1D implementation, allocated and given their values during initialization
  2. The issue with tests not being bit-for-bit comparable when compiled with debug/-O0 optimization, resulting from the order of computations for computing tinyarea not being identical between the EVP 2D implementation and the EVP 1D implementation. This was solved by making the EVP 1D computation order identical to the EVP 2D computation order
  3. The issue with tests not being bit-for-bit comparable when compiled with debug/-O0 optimization, resulting from icetmask not being a subset of iceumask, as initialily assumed in the EVP 1D implementation. This was solved by adding a skiptcell logical array as well for skipping stress calculations in the EVP 1D implementation

Then we did some modifications to the OpenMP parts of the EVP 1D implementation. The core part of the implementation (stress and stepu) was implemented to be NUMA-aware for performance. This was done differently for the interfacing part of the implementation, but is now aligned throughout the EVP 1D implementation for all of it to be NUMA-aware.

As for namelist changes, the only change that has been made, is the renaming of kevp_kernel to evp_algorithm and changing it from integer to string. EVP 2D is now enabled by setting evp_algorithm = 'standard_2d' instead of kevp_kernel = 0 and EVP 1D by setting evp_algorithm = 'shared_mem_1d' instead of kevp_kernel = 102. In connection with this, the option set_nml.evp1d was also added. Documentation has also been updated to reflect this modification.

Finally, we did a full cleanup of ice_dyn_evp_1d.F90 and some cleanup of ice_dyn_evp.F90. This has mainly included:

  • Fixing indentations
  • Aligning variable names with the rest of the code base
  • Removing old, uncommented code blocks, that was use in development, but not needed any longer

Maybe @TillRasmussen can update the PR checklist with this, so that it is at the top of the PR? I can't modify it.

@srethmeier
Copy link
Contributor

The qc test with the 1d evp fails at the end of the 4th year with zap snow temperature errors,

  (zap_snow_temperature)Tmin:  -100.000000000000
  (zap_snow_temperature)Tmax:  4.486434370863951E-006
  (zap_snow_temperature)zqsn:  -182365602.580354
  (zap_snow_temperature)zap_snow_temperature: temperature out of bounds!
  (zap_snow_temperature)k:           1
  (zap_snow_temperature)zTsn:  -103.894744835945
  (zap_snow_temperature)Tmin:  -100.000000000000
  (zap_snow_temperature)Tmax:  5.419524611890995E-006
  (zap_snow_temperature)zqsn:  -182424769.766085

The baseline 2d runs fine. This suggests there still may be some issues with the 1d implementation. I actually ran it twice, once with 9x4 pes (threaded) and another time with 36x1 (not threaded) and both failed at the same time in the same way. I guess that means it's a robust problem.

Okay, looks like there still is something we need to look at here. I'll try and dig into it next week.

I still suggest we merge this but then create a new issue noting the new problem discovered.

Sounds good to me. Whatever you prefer and find to be the best approach.

@srethmeier
Copy link
Contributor

srethmeier commented Jul 20, 2021

The rEVP question needs a github issue for follow up -- can this vectorization technique be run for that case, and if so, does it work in the current code? I think there's already an issue for "kernelizing" EAP.

Not completely sure what this entails? Is it that the combination of Revised EVP and EVP 1D has not been tested?

@TillRasmussen
Copy link
Contributor Author

Revised evp should work

1 similar comment
@TillRasmussen
Copy link
Contributor Author

Revised evp should work

@eclare108213
Copy link
Contributor

The rEVP question needs a github issue for follow up -- can this vectorization technique be run for that case, and if so, does it work in the current code? I think there's already an issue for "kernelizing" EAP.

Not completely sure what this entails? Is it that the combination of Revised EVP and EVP 1D has not been tested?

Yes, if rEVP and 1D can be run together, then test to make sure it works. Otherwise put an abort in the code in case they're both turned on.

Thanks for summarizing all the changes in the PR. I'd like to better understand which, if any, change the answers when 1D EVP is turned off. @apcraig 's testing showed regression failures for alt04, 1x1 and 2x2 blocks, and unit tests, plus some that timed out due to the low optimization level. Please confirm that you're confident that all of these failures are expected, and that none of them affect the 2D EVP results. My preference would be to find and fix the problem causing snow temperatures to explode, but I'm fine with merging this PR and then fixing that problem. Thanks everyone -- 1D EVP has been a huge effort!

@apcraig
Copy link
Contributor

apcraig commented Jul 20, 2021

Regarding the testing and failed tests. All of the 2d tests are bit-for-bit with this PR. This PR has no impact on the standard 2d results which are most of our test suite.

The failed regression tests in the 1d (with this PR) vs 2d (with master) with debug on are also all expected. It's also only the regression (comparison with baseline) part of the tests that fail, the tests themselves pass. The alt04 test fails because it has the 1d evp on by default and we have changed answers. The 1x1 and 2x2 tests fail because my sandboxes did not include the recent fixes. The unit tests fail because there continue to be a couple of nagging (but still bit-for-bit) issues in the regression testing on unit tests. The tests that timed out could be rerun with longer submission times and I will try to do some of that today to see.

The aborted QC test case is unfortunate. It could be that the roundoff errors introduced by the 1d evp have changed the state of the solution such that an abort is created. I don't think this is the most likely case because then we'd be encountering this as well with 2d cases. It could be that there are still some differences in 1d and 2d that would appear if we ran both cases with debug on for 5 years. We really only tested bit-for-bit with debug on for days or months. However, if all we are doing is introducing a roundoff difference, why is the model aborting. It should be more robust than that. It's surprising to me with all the testing that we've done that the QC evp 1d run aborted.

@apcraig
Copy link
Contributor

apcraig commented Jul 20, 2021

I submitted most of the 1d tests that timed out for intel and gnu with more time and they all pass including regression vs 2d. So I think all the 1d vs 2d debug results are accounted for now with no outstanding issues. The only known issue is the QC failure after 4 years (out of a 5 year run).

@TillRasmussen
Copy link
Contributor Author

I think I have something working. I will need to test more. It turns out we need to add a skiptmask like there is a skipumask. We need to be able to separate logic for doing stress and stepu computations.

Given that Till is away for several weeks, would it make sense for me to create a new PR with the latest updates including everything in this PR, to delete this PR, and for us to merge a new PR. Alternatively (and my preference) would be that I PR to Till's branch, that gets merged into this PR, and that we merge this PR to the Consortium. @srethmeier, are you able to merge a PR to Till's branch if I create it? I know you are also busy and about to go on vacation. Let me know what would work best. My preference is to get this merged once it's working rather than wait a month when everyone is back.

@apcraig Thank you for implementing this. I agree that a skiptmask is needed.

@TillRasmussen
Copy link
Contributor Author

I will do a QC test of 1d vs 2d with standard optimization and report back.

Looking back at the PR, these modifications primarily fix issues in the 1d evp that prevented identical results with 2d (at reduced optimization). That includes at least a fix to the tiny area computation, addition of grid_lengths_global_ext, and addition of skiptcell logic. In addition, the namelist was changed so we now have strings instead of integers to set evp_algorithm, and a bunch of other formatting (i.e. indentation) was cleaned up. There may have been a few other changes, maybe @srethmeier, @mhrib, or @TillRasmussen can comment.

The issues related to 1x1 and 2x2 blocks, calendar, and spacecurve were addressed elsewhere.

I agree there is still an outstanding issue about what parts of the dynamics work with each other (i.e. revp, eap plans, etc). I think it should also include overall discussion of control, namelist, and implementation of the dynamics. I have created a new issue, #619 that can serve as a general place holder for those issues. This is something that was raised recently in our monthly meetings as well.

Done

@TillRasmussen
Copy link
Contributor Author

TillRasmussen commented Aug 1, 2021

I can confirm that some test fail due to wall time on the DMI system as well. All were fixed by increasing wall time.
I would also recommend that this merge is accepted. The only issue is the QC test that we will look at. We should open an issues that explains this. I think that this would make it more clear compared to continuing this request.

  • This should not change the 2d solution and it did not do so before the changes by @apcraig. I will rerun the test with the new changes.
  • The difference between the 1d and the 2d soiution is only in the subcycling loop. The point is to avoid memory usage when calling the two functions multiple times and to avoid mpi communications (multiple), which both scale poorly. The conversion between 1d and 2d may have some overhead, therefore the benefit may not be that clear in a test domain like gx3 which is relatively small as it may not be memory bounded the same way as the larger domains. This conversion is only needed as long as CICE is a 2d model.
  • A scientific improvement would be that it should be feasible to increase the number of iterations as recent literature has suggested to +500 with limited cost.
  • EAP: is not currently implemented. An implementation of this would reuse the conversion from 2d to 1d. The functions within would need to be rewritten to 1d.
  • ESMF now enable separation of different modules (ocean, sea ice, atmosphere.... ) onto different different nodes with different hybrid versions of OMP+MPI. This should enable running CICE on its own node and thereby allow CICE to run on OMP only. I have not tested this. Further modulization/split of CICE could increase the usage of this. Test of the latter is not planned.

@apcraig
Copy link
Contributor

apcraig commented Aug 4, 2021

Does anyone object to merging this at this point? We should create a new issue as part of the merge to track outstanding issues.

@eclare108213
Copy link
Contributor

I'm okay with merging this, although the QC failure is a nagging issue. Does it mean that the code may abort for any user trying to run with 1Devp, or is that just a failure in the QC software?

@apcraig
Copy link
Contributor

apcraig commented Aug 4, 2021

The QC failure is an abort in the ice model, nothing related to the QC test as far as I can tell. It took 4 years to hit it. If we believe the 2d implementation is robust, then this suggests there may still be a bug in the 1d implementation that's possibly triggered only on timescales of months to years. All our bit-for-bit testing with debug flags has been shorter than a year. We could run QC with debug flags on with 1d and 2d and see what happens. My concern is the time to completion in that case with reduced optimization, running 1 degree for 5 years. But it's probably worth doing. I propose we add that idea to the issue that we'll create.

@eclare108213
Copy link
Contributor

I guess you can't restart the QC from shortly before the crash with debug flags on, and expect it to crash again. Would that be worth a try?

@apcraig
Copy link
Contributor

apcraig commented Aug 4, 2021

@eclare108213, we could try something like that but no guarantees it'll help. There are probably a number of steps to try to isolate the problem. The first thing I might try is to turn debug on with 1d and 2d and run 5 years, 1 year at a time with restarts, just to see if the models are bit-for-bit throughout and to see if the 1d with debug also fails. Then we might try running evp1d with optimization but threading off. This might provide insight about an OpenMP issues. Depending what we learn there, we might create a case with a restart just before the failure and then start debugging the actual abort.

@eclare108213
Copy link
Contributor

That's quite a debugging project. It needs to be done, but I don't think it needs to be done before this particular PR is merged.

@TillRasmussen
Copy link
Contributor Author

I will try to rerun the qc test and output more diagnostic.

@eclare108213
Copy link
Contributor

@dabail10 Is this the type of error you get for CFL violations in CESM?

  (zap_snow_temperature)Tmin:  -100.000000000000
  (zap_snow_temperature)Tmax:  4.486434370863951E-006
  (zap_snow_temperature)zqsn:  -182365602.580354
  (zap_snow_temperature)zap_snow_temperature: temperature out of bounds!
  (zap_snow_temperature)k:           1
  (zap_snow_temperature)zTsn:  -103.894744835945
  (zap_snow_temperature)Tmin:  -100.000000000000
  (zap_snow_temperature)Tmax:  5.419524611890995E-006
  (zap_snow_temperature)zqsn:  -182424769.766085

I'm wondering if the 1D evp QC test just happens to be hitting one of these, and the 2D case barely misses it. (The "incremental" part of incremental remap assumes that ice moves no farther than 1 grid cell in 1 time step.) If restarting the 1D case with a reduced timestep runs through this point, CFL could be the culprit -- there might not be anything wrong with the 1D evp implementation at all, just unlucky.

@apcraig Here's a suggestion for the warning system. Is is possible to have it call print_state for the failing grid cell, just for the aborts that are from solutions becoming physically unreasonable, like this one?

@apcraig
Copy link
Contributor

apcraig commented Aug 5, 2021

It could be a CFL issue, but then I think if it were a matter of being "unlucky", then we'd be seeing it more often in the 2d. I think it's unlikely that the 2d is quite robust for lots of configurations, but that the 1d gets unlucky on the first long run we do.

@eclare108213, I like your idea of trying to leverage print_state more. I have created an issue, #622.

@apcraig apcraig mentioned this pull request Aug 5, 2021
@apcraig apcraig merged commit 0ccdea1 into CICE-Consortium:master Aug 5, 2021
@apcraig
Copy link
Contributor

apcraig commented Aug 5, 2021

See #623 for followup discussion.

@TillRasmussen TillRasmussen mentioned this pull request Nov 8, 2023
14 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

evp kernel version 2 testing and validation
5 participants