Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VP exact restart and other nonBFB problems #518

Closed
apcraig opened this issue Sep 25, 2020 · 7 comments
Closed

VP exact restart and other nonBFB problems #518

apcraig opened this issue Sep 25, 2020 · 7 comments

Comments

@apcraig
Copy link
Contributor

apcraig commented Sep 25, 2020

See also #491.

I ran a decomp test suite with evp, eap, and vp-picard. I did not test vp-anderson as that is not ready to use out of the box. Below are the results. evp and eap pass all tests with the various decomps. However, vp-picard is more of a mixed bag. Some configurations don't run at all and some don't restart exactly. The test suite is running the same tests just changing the dynamics option. Looking at one of the failed runs, I see

(abort_ice)ABORTED: 
(abort_ice) error = (horizontal_remap)ERROR: bad departure points

PASS cheyenne_intel_restart_gx3_4x2x25x29x4_dslenderX2 run 10.46 2.72 5.35
PASS cheyenne_intel_restart_gx3_4x2x25x29x4_dslenderX2 test 
PASS cheyenne_intel_restart_gx3_1x1x50x58x4_droundrobin_thread run 38.01 8.49 21.03
PASS cheyenne_intel_restart_gx3_1x1x50x58x4_droundrobin_thread test 
PASS cheyenne_intel_restart_gx3_4x1x25x116x1_dslenderX1_thread run 12.50 2.70 6.91
PASS cheyenne_intel_restart_gx3_4x1x25x116x1_dslenderX1_thread test 
PASS cheyenne_intel_restart_gx3_6x2x4x29x18_dspacecurve run 12.03 4.06 4.37
PASS cheyenne_intel_restart_gx3_6x2x4x29x18_dspacecurve test 
PASS cheyenne_intel_restart_gx3_8x2x8x10x20_droundrobin run 7.95 2.58 2.51
PASS cheyenne_intel_restart_gx3_8x2x8x10x20_droundrobin test 
PASS cheyenne_intel_restart_gx3_6x2x50x58x1_droundrobin run 11.20 2.46 6.08
PASS cheyenne_intel_restart_gx3_6x2x50x58x1_droundrobin test 
PASS cheyenne_intel_restart_gx3_4x2x19x19x10_droundrobin run 11.25 3.11 4.81
PASS cheyenne_intel_restart_gx3_4x2x19x19x10_droundrobin test 
PASS cheyenne_intel_restart_gx3_1x20x5x29x80_dsectrobin_short run 30.55 13.72 7.23
PASS cheyenne_intel_restart_gx3_1x20x5x29x80_dsectrobin_short test 
PASS cheyenne_intel_restart_gx3_16x2x5x10x20_drakeX2 run 4.87 1.67 1.78
PASS cheyenne_intel_restart_gx3_16x2x5x10x20_drakeX2 test 
PASS cheyenne_intel_restart_gx3_8x2x8x10x20_droundrobin_maskhalo run 7.32 2.31 2.42
PASS cheyenne_intel_restart_gx3_8x2x8x10x20_droundrobin_maskhalo test 
PASS cheyenne_intel_restart_gx3_1x4x25x29x16_droundrobin run 28.46 9.31 17.47
PASS cheyenne_intel_restart_gx3_1x4x25x29x16_droundrobin test 

PASS cheyenne_intel_restart_gx3_4x2x25x29x4_dslenderX2_dyneap run 32.14 24.53 5.23
PASS cheyenne_intel_restart_gx3_4x2x25x29x4_dslenderX2_dyneap test 
PASS cheyenne_intel_restart_gx3_1x1x50x58x4_droundrobin_dyneap_thread run 110.39 80.04 21.20
PASS cheyenne_intel_restart_gx3_1x1x50x58x4_droundrobin_dyneap_thread test 
PASS cheyenne_intel_restart_gx3_4x1x25x116x1_dslenderX1_dyneap_thread run 36.50 26.83 6.77
PASS cheyenne_intel_restart_gx3_4x1x25x116x1_dslenderX1_dyneap_thread test 
PASS cheyenne_intel_restart_gx3_6x2x4x29x18_dspacecurve_dyneap run 46.75 38.86 4.30
PASS cheyenne_intel_restart_gx3_6x2x4x29x18_dspacecurve_dyneap test 
PASS cheyenne_intel_restart_gx3_8x2x8x10x20_droundrobin_dyneap run 24.54 19.22 2.75
PASS cheyenne_intel_restart_gx3_8x2x8x10x20_droundrobin_dyneap test 
PASS cheyenne_intel_restart_gx3_6x2x50x58x1_droundrobin_dyneap run 33.14 24.53 5.95
PASS cheyenne_intel_restart_gx3_6x2x50x58x1_droundrobin_dyneap test 
PASS cheyenne_intel_restart_gx3_4x2x19x19x10_droundrobin_dyneap run 36.26 28.27 4.64
PASS cheyenne_intel_restart_gx3_4x2x19x19x10_droundrobin_dyneap test 
PASS cheyenne_intel_restart_gx3_1x20x5x29x80_dsectrobin_dyneap_short run 116.90 100.69 7.68
PASS cheyenne_intel_restart_gx3_1x20x5x29x80_dsectrobin_dyneap_short test 
PASS cheyenne_intel_restart_gx3_16x2x5x10x20_drakeX2_dyneap run 14.25 11.05 1.84
PASS cheyenne_intel_restart_gx3_16x2x5x10x20_drakeX2_dyneap test 
PASS cheyenne_intel_restart_gx3_8x2x8x10x20_droundrobin_dyneap_maskhalo run 23.86 18.80 2.37
PASS cheyenne_intel_restart_gx3_8x2x8x10x20_droundrobin_dyneap_maskhalo test 
PASS cheyenne_intel_restart_gx3_1x4x25x29x16_droundrobin_dyneap run 102.08 82.70 38.28
PASS cheyenne_intel_restart_gx3_1x4x25x29x16_droundrobin_dyneap test 

PASS cheyenne_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard run 15.13 7.46 5.30
FAIL cheyenne_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard test 
PASS cheyenne_intel_restart_gx3_1x1x50x58x4_droundrobin_dynpicard_thread run 47.71 18.42 20.95
PASS cheyenne_intel_restart_gx3_1x1x50x58x4_droundrobin_dynpicard_thread test 
PASS cheyenne_intel_restart_gx3_4x1x25x116x1_dslenderX1_dynpicard_thread run 15.70 6.00 6.92
PASS cheyenne_intel_restart_gx3_4x1x25x116x1_dslenderX1_dynpicard_thread test 
FAIL cheyenne_intel_restart_gx3_6x2x4x29x18_dspacecurve_dynpicard run
FAIL cheyenne_intel_restart_gx3_6x2x4x29x18_dspacecurve_dynpicard test 
FAIL cheyenne_intel_restart_gx3_8x2x8x10x20_droundrobin_dynpicard run
FAIL cheyenne_intel_restart_gx3_8x2x8x10x20_droundrobin_dynpicard test 
PASS cheyenne_intel_restart_gx3_6x2x50x58x1_droundrobin_dynpicard run 14.13 8.56 6.04
PASS cheyenne_intel_restart_gx3_6x2x50x58x1_droundrobin_dynpicard test 
FAIL cheyenne_intel_restart_gx3_4x2x19x19x10_droundrobin_dynpicard run 
FAIL cheyenne_intel_restart_gx3_4x2x19x19x10_droundrobin_dynpicard test 
FAIL cheyenne_intel_restart_gx3_1x20x5x29x80_dsectrobin_dynpicard_short run
FAIL cheyenne_intel_restart_gx3_1x20x5x29x80_dsectrobin_dynpicard_short test 
PASS cheyenne_intel_restart_gx3_16x2x5x10x20_drakeX2_dynpicard run 15.44 12.28 2.09
FAIL cheyenne_intel_restart_gx3_16x2x5x10x20_drakeX2_dynpicard test 
FAIL cheyenne_intel_restart_gx3_8x2x8x10x20_droundrobin_dynpicard_maskhalo run
FAIL cheyenne_intel_restart_gx3_8x2x8x10x20_droundrobin_dynpicard_maskhalo test 
FAIL cheyenne_intel_restart_gx3_1x4x25x29x16_droundrobin_dynpicard run
FAIL cheyenne_intel_restart_gx3_1x4x25x29x16_droundrobin_dynpicard test 
@phil-blain
Copy link
Member

/cc @JFLemieux73

@phil-blain
Copy link
Member

I will try to take a look at some of those this week.

@phil-blain
Copy link
Member

phil-blain commented Jun 15, 2021

I'm getting back to this now (!). I ran the decomp_suite with VP dynamics (adding -s dynpicard to cice.setup). I can reproduce the failures. This is what I'm planning to do:

  • Investigate the crashes (some tests segfault because of NaN initialization)
  • Investigate the "bad departure points" occurrences
  • Investigate the inexact restarts

and hopefully fix all of the above.

@phil-blain
Copy link
Member

I'll document the failures in phil-blain#39

@apcraig apcraig mentioned this issue Jan 19, 2022
16 tasks
@apcraig
Copy link
Contributor Author

apcraig commented Mar 5, 2022

New omp_suite also shows non-bfb with dynpicard setting,

FAIL cheyenne_intel_smoke_gx3_18x1_cmplogrest_dynpicard_reprosum_run10day_thread bfbcomp cheyenne_intel_smoke_gx3_6x4_dynpicard_reprosum_run10day different-data

different decomp producing different answers, this happens to also test omp.

@phil-blain
Copy link
Member

Hi Tony, I did some more runs of the decomp suite on the latest main (see phil-blain#39 (comment)) and it seems all restart-related failures are gone. I think your OpenMP fixes in ice_dyn_vp in #680 were the key. Thanks a lot for that. As I wrote in #491 I had not tested the new solver under OpenMP so thanks for starting that effort.

The decomp suite still has several different-data failures when ran with dynpicard. I'll investigate these. I noticed that there are no MPI-only cases compared against other MPI-only cases, so maybe I'll start that way just to make sure.

In the mean time maybe we could retitle this here issue since there are no restart issues anymore?

@eclare108213 eclare108213 changed the title VP exact restart problems VP exact restart and other nonBFB problems May 14, 2022
@eclare108213
Copy link
Contributor

I retitled it. Let me know if you'd prefer a different title. e

phil-blain added a commit to phil-blain/CICE that referenced this issue Aug 25, 2022
The 'pgmres' subroutine implements a separate GMRES solver and is used
as a preconditioner for the FGMRES linear solver. Since it is only a
preconditioner, it was decided to skip the halo updates after computing
the matrix-vector product (in 'matvec'), for efficiency.

This leads to non-reproducibility since the content of the non-updated
halos depend on the block / MPI distribution.

Add the required halo updates, but only perform them when we are
explicitely asking for bit-for-bit global sums, i.e. when 'bfbflag' is
set to something else than 'not'.

Adjust the interfaces of 'pgmres' and 'precondition' (from which
'pgmres' is called) to accept 'halo_info_mask', since it is needed for
masked updates.

Closes CICE-Consortium#518
phil-blain added a commit to phil-blain/CICE that referenced this issue Oct 5, 2022
The 'pgmres' subroutine implements a separate GMRES solver and is used
as a preconditioner for the FGMRES linear solver. Since it is only a
preconditioner, it was decided to skip the halo updates after computing
the matrix-vector product (in 'matvec'), for efficiency.

This leads to non-reproducibility since the content of the non-updated
halos depend on the block / MPI distribution.

Add the required halo updates, but only perform them when we are
explicitely asking for bit-for-bit global sums, i.e. when 'bfbflag' is
set to something else than 'not'.

Adjust the interfaces of 'pgmres' and 'precondition' (from which
'pgmres' is called) to accept 'halo_info_mask', since it is needed for
masked updates.

Closes CICE-Consortium#518
phil-blain added a commit to phil-blain/CICE that referenced this issue Oct 17, 2022
The 'pgmres' subroutine implements a separate GMRES solver and is used
as a preconditioner for the FGMRES linear solver. Since it is only a
preconditioner, it was decided to skip the halo updates after computing
the matrix-vector product (in 'matvec'), for efficiency.

This leads to non-reproducibility since the content of the non-updated
halos depend on the block / MPI distribution.

Add the required halo updates, but only perform them when we are
explicitely asking for bit-for-bit global sums, i.e. when 'bfbflag' is
set to something else than 'not'.

Adjust the interfaces of 'pgmres' and 'precondition' (from which
'pgmres' is called) to accept 'halo_info_mask', since it is needed for
masked updates.

Closes CICE-Consortium#518
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants