Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Append .NO suffix to the mediator restart for the C48mx500 test case on Orion/Hercules and WCOSS #2769

Closed
guillaumevernieres opened this issue Jul 16, 2024 · 14 comments
Assignees
Labels
bug Something isn't working

Comments

@guillaumevernieres
Copy link
Contributor

What is wrong?

The mediator restart on glopara Hera for the C48mx500 test case was suffixed with .NO to insure that the model would not make use of it.
The same thing needs to be done on Orion and WCOSS.

What should have happened?

On Orion, in the glopara directory

/work/noaa/global/glopara/data/ICSDIR/C48mx500/gdas.20210324/06/model_data/med/restart/

there are 2 files: 20210324.090000.ufs.cpld.cpl.r.nc and 20210324.090000.ufs.cpld.cpl.r.nc.NO
20210324.090000.ufs.cpld.cpl.r.nc needs to be deleted.

What machines are impacted?

WCOSS2, Orion, Hercules

Steps to reproduce

N/A

Additional information

N/A

Do you have a proposed solution?

No response

@guillaumevernieres guillaumevernieres added bug Something isn't working triage Issues that are triage labels Jul 16, 2024
@guillaumevernieres
Copy link
Contributor Author

@RussTreadon-NOAA
Copy link
Contributor

File /work/noaa/global/glopara/data/ICSDIR/C48mx500/gdas.20210324/06/model_data/med/restart/20210324.090000.ufs.cpld.cpl.r.nc still exists.

C48mx500_3DVarAOWCDA fails on Hercules due to the presence of this file.

@KateFriedman-NOAA or @WalterKolczynski-NOAA : Can file 20210324.090000.ufs.cpld.cpl.r.nc be removed from v?

@guillaumevernieres
Copy link
Contributor Author

File /work/noaa/global/glopara/data/ICSDIR/C48mx500/gdas.20210324/06/model_data/med/restart/20210324.090000.ufs.cpld.cpl.r.nc still exists.

C48mx500_3DVarAOWCDA fails on Hercules due to the presence of this file.

@KateFriedman-NOAA or @WalterKolczynski-NOAA : Can file 20210324.090000.ufs.cpld.cpl.r.nc be removed from v?

FYI @RussTreadon-NOAA , you won't be able to run successfully the marine DA tasks on hercules/orion. Some of the changes that bring us closer to that goal are here: #2749
but even then, it requires to manualy adjust a yaml to point to our experimental obs.

@WalterKolczynski-NOAA WalterKolczynski-NOAA removed the triage Issues that are triage label Jul 23, 2024
@KateFriedman-NOAA
Copy link
Member

@guillaumevernieres Question about the mediator restart 20210324.090000.ufs.cpld.cpl.r.nc file for the C48mx500_3DVarAOWCDA CI test...I am testing an revamped staging job and it's failing because that file is missing (because it's renamed with ".NO" at the end. Should I disable the mediator for now or is a change coming to resolve this?

See lines 108-110 in this snippet from my new staging job yaml file for what the job is looking for when DO_OCN=YES:

 22 {% set r_prefix = model_start_date_current_cycle | to_YMD + "." + model_start_date_current_cycle | strftime("%H") + "0000" %}
...
 94 {% if DO_OCN %}
 95 ocean:
 96     mkdir:
 97         - "{{ COMOUT_OCEAN_RESTART_PREV }}"
 98     copy:
 99         - ["{{ ICSDIR }}/{{ COMOUT_OCEAN_RESTART_PREV | relpath(ROTDIR) }}/{{ r_prefix }}.MOM.res.nc", "{{ COMOUT_OCEAN_RESTART_PREV }}"]
100         {% if OCNRES == "025" %}
101             {% for nn in range(1, 3) %}
102         - ["{{ ICSDIR }}/{{ COMOUT_OCEAN_RESTART_PREV | relpath(ROTDIR) }}/{{ r_prefix }}.MOM.res_{{ nn }}.nc", "{{ COMOUT_OCEAN_RESTART_PREV }}"]
103             {% endfor %}
104         {% endif %}
105         {% if REPLAY_ICS == "YES" %}
106         - ["{{ ICSDIR }}/{{ COMOUT_OCEAN_ANALYSIS | relpath(ROTDIR) }}/{{ r_prefix }}.mom6_perturbation.nc", "{{ COMOUT_OCEAN_ANALYSIS }}/mom6_increment.nc"]
107         {% endif %}
108         {% if EXP_WARM_START == True %}
109         - ["{{ ICSDIR }}/{{ COMOUT_MED_RESTART_PREV | relpath(ROTDIR) }}/{{ r_prefix }}.ufs.cpld.cpl.r.nc", "{{ COMOUT_MED_RESTART_PREV }}"]
110         {% endif %}
111 {% endif %}

@guillaumevernieres
Copy link
Contributor Author

@KateFriedman-NOAA , the mediator file should be optional and I assume your refactoring should probably keep that functionality. Is there an option to no abort when syncing the file handler?

@KateFriedman-NOAA
Copy link
Member

@guillaumevernieres Optional, got it, thanks!

@RussTreadon-NOAA
Copy link
Contributor

The WCDA g-w CI test failed on WCOSS2 (Cactus) during the gdasmarinebmat job with the traceback

nid003046.cactus.wcoss2.ncep.noaa.gov 0:  MOM_in domain decomposition
whalo =    2, ehalo =    2, shalo =    2, nhalo =    2
  X-AXIS =    9   9   9   9
  Y-AXIS =    5   4   4   4
nid003046.cactus.wcoss2.ncep.noaa.gov 0: NOTE from PE     0: MOM_restart: MOM run restarted using : INPUT/MOM.res.nc
nid003046.cactus.wcoss2.ncep.noaa.gov 0:
FATAL from PE     0: NetCDF: Variable not found: variable_att_exists: file:INPUT/MOM.res.nc- variable:

nid003046.cactus.wcoss2.ncep.noaa.gov 0:
FATAL from PE     0: NetCDF: Variable not found: variable_att_exists: file:INPUT/MOM.res.nc- variable:

nid003046.cactus.wcoss2.ncep.noaa.gov 0: Image              PC                Routine            Line        Source
libifcoremt.so.5   000014CBAC47FD4A  tracebackqq_          Unknown  Unknown
libsoca.so         000014CBCA84BBBE  mpp_mod_mp_mpp_er          72  mpp_util_mpi.inc
libsoca.so         000014CBCABCAD52  fms_io_utils_mod_         190  fms_io_utils.F90
libsoca.so         000014CBCA76F443  netcdf_io_mod_mp_         381  netcdf_io.F90
libsoca.so         000014CBCA76F4E5  netcdf_io_mod_mp_         465  netcdf_io.F90
libsoca.so         000014CBCA7A04E8  netcdf_io_mod_mp_        1187  netcdf_io.F90
libsoca.so         000014CBCB7F2528  mom_io_infra_mp_g         530  MOM_io_infra.F90
libsoca.so         000014CBCB0A64AF  mom_io_file_mp_ge        1230  MOM_io_file.F90
libsoca.so         000014CBCB0C1CA7  mom_restart_mp_re        1633  MOM_restart.F90
libsoca.so         000014CBCB1CCA91  mom_state_initial         538  MOM_state_initialization.F90
libsoca.so         000014CBCAD05B7B  mom_mp_initialize        2961  MOM.F90
libsoca.so         000014CBCA65096E  Unknown               Unknown  Unknown

I am running g-w built from g-w PR #2833. The AERO and UFSDA g-w CI run to completion. WCDA aborts as shown above. The log file with the failure is /lfs/h2/emc/da/noscrub/russ.treadon/COMROOT/prwcda/logs/2021032418/gdasmarinebmat.log on Cactus.

Two questions

  1. Is the gdasmarinebmat failure I am seeing related to this issue or should a new issue be opened?
  2. Have we successfully run WCDA g-w CI on Cactus using the current head, 336b78a, of g-w develop?

@AndrewEichmann-NOAA
Copy link
Contributor

Regarding question 1, the marine bmat task recently had updates (for refactoring) in global-workflow, and I am encountering some bugs that are flushed out when trying to run with an ensemble (on Hera), but it's not clear to me why what you're seeing here would be confined to WCOSS.

@CatherineThomas-NOAA
Copy link
Contributor

@AndrewEichmann-NOAA: Do you think this could be related to Issue #2797? The MOM_input file was updated, but only for the high resolution. Does it need to be updated for lower res as well? Counterpoint to this is that it should fail on Hera as well and the WCDA test passed for PR 2751.

@RussTreadon-NOAA
Copy link
Contributor

RussTreadon-NOAA commented Aug 15, 2024

@AndrewEichmann-NOAA , I have only run g-w CI on WCOSS2 (Cactus). I do not know if g-w WCDA CI runs on other machines. I found that env/WCOSS2.env does not contain entries for marine jobs. I added these entries in PR #2833. The fact that these entries are not in develop env/WCOSS2.env makes me wonder if we ready to run g-w WCDA CI on WCOSS2

@AndrewEichmann-NOAA
Copy link
Contributor

@CatherineThomas-NOAA @RussTreadon-NOAA I'll have to dig deeper into this but I have been running the WCDA CI on Hera successfully, though it's possible that updating will catch something

@RussTreadon-NOAA
Copy link
Contributor

@AndrewEichmann-NOAA , g-w WCDA CI works on Hera. I set up it this morning. All jobs successfully ran to completion

Hera(hfe05):/scratch1/NCEPDEV/stmp2/Russ.Treadon/EXPDIR/prwcda$ rocotostat -d prwcda.db -w prwcda.xml -c all -s
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202103241200        Done    Aug 15 2024 13:36:19    Aug 15 2024 13:50:24
202103241800        Done    Aug 15 2024 13:36:19    Aug 15 2024 14:50:22

ci/cases/pr/C48mx500_3DVarAOWCDA.yaml from develop at 336b78a has

skip_ci_on_hosts:
  - wcoss2
  - gaea
  - orion
  - hercules

I should not try running g-w WCDA CI on WCOSS2. I should stick to Hera.

@CatherineThomas-NOAA
Copy link
Contributor

@RussTreadon-NOAA @AndrewEichmann-NOAA
The last that I heard about the WCDA test on WCOSS2 was that the C++ issue was resolved and that there was a push to get all the needed files on the machine. That conversation predates the discovery of the problems with the v17 cycling prototypes which took most of @guillaumevernieres's attention before he went on leave. I don't think everything's been sorted yet.

@KateFriedman-NOAA
Copy link
Member

The original request in this issue has been completed. Please open new issues to address any related needs discussed above. Closing as complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants