Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using the latest executables in an older simulation #72

Open
mauricehuguenin opened this issue Oct 19, 2021 · 38 comments
Open

Using the latest executables in an older simulation #72

mauricehuguenin opened this issue Oct 19, 2021 · 38 comments

Comments

@mauricehuguenin
Copy link

mauricehuguenin commented Oct 19, 2021

We (Ryan, me) would like to run some perturbation experiments in ACCESS-OM2-025 using the new perturbations code in libaccessom2 using atmosphere/forcing.json. We would like to branch these simulations off the 650-year 025deg_jra55_ryf spin-up at /g/data/ik11/outputs/access-om2-025/025deg_jra55_ryf9091_gadi.

However, this spin-up was performed with old executables (see https://github.com/rmholmes/025deg_jra55_ryf/blob/ryf9091_gadi/config.yaml) that do not contain the new libaccessom2 perturbations code. Unfortunately it looks like the new executables (with libaccessom2 hash _a227a61.exe) aren't backwards compatible with the config files from the old spin-up. Specifically, we get the error:
assertion failed: accessom2_sync_config incompatible config between atm and ice: num_atm_to_ice_fields
which seems to be linked to the to ice/input_ice.nml that now requires exchanged fields to be specified (e.g. through the fields_from_atm input - and the number of fields no longer matches).

@nichannah @aekiss do you have any suggestions on the best approach to pursue in order to get this working? We would really like to avoid doing another spin-up given the cost and time involved.

One approach might be to create new executables based on those used for the spin-up that only include the new libaccessom2 code involving the perturbations. Another might be to update the config files as much as possible (still using JRA55 v1.3), but still use the old restarts, and hope/evaluate that nothing material to the solution has changed?
Any suggestions would be really helpful.

@aekiss
Copy link
Contributor

aekiss commented Oct 21, 2021

hmm, yes the latest libaccessom2 is set up for JRA55-do 1.4 which has separate solid and liquid runoff and is incompatible with JRA55-do 1.3. We may need to set up a JRA55-do 1.3 branch for libaccessom2 and cherry-pick the perturbation code changes.

@nichannah does that sound possible?

@aekiss
Copy link
Contributor

aekiss commented Oct 21, 2021

@mauricehuguenin your executables are really old - they use libaccessom2 1bb8904 from 10 Dec 2019.

There have been a lot of commits since then, so applying the perturbation code changes could be tricky, but that's really a question for @nichannah.

It looks like the most recent commit supporting JRA55-do v1.3 was f6cf437 from 16 Apr 2020, so that might make a better starting point.

JRA55-do 1.4 support was merged into master at 4198e15 but it looks like this branch also included some unrelated commits.

See https://github.com/COSIMA/libaccessom2/network

@rmholmes
Copy link

Thanks @aekiss! Yes the 025deg_jra55_ryf9091_gadi spin-up was started at the end of December 2019, soon after Gadi came online. It would be a pity not to continue to use it given the resources that went into it.

@mauricehuguenin a good starting point might be to try using the f6cf437 libaccessom2 commit to extend the control run. If that works, then we can think about building the more recent perturbations code into that.

@mauricehuguenin
Copy link
Author

I fetched the commit from the 16th of April COSIMA/025deg_jra55_ryf@2eb6a35 that has changes to atmosphere/forcing.json, config.yaml and ice/ice_input.nml. I then changed to the latest _a227a61 executables as those have the additive forcing functions.

Extending the spin-up with the 2eb6a35c commit works fine, with the latest executables I however get this abort message:

MPI_ABORT was invoked on rank 1550 in communicator MPI_COMM_WORLD
with errorcode 1.

Do the latest .exe files require the licalvf input files? These are currently not in my atmosphere/forcing.json file from the 2eb6a35c commit.

@rmholmes
Copy link

rmholmes commented Oct 26, 2021

@mauricehuguenin I presume your run is the one at /home/561/mv7494/access-om2/025deg_jra55_ryf_ENSOWind/? If so, the error looks like Invalid restart_format: nc. This seems to be a cice error associated with the ice restarts (in https://github.com/COSIMA/cice5/blob/master/io_pio/ice_restart.F90). Something to do with Parallel IO changes??

However, in looking around I also noticed that there are many differences between your configs and the ones used for the spin-up (e.g. /g/data/ik11/outputs/access-om2-025/025deg_jra55_ryf9091_gadi/, or equivalently https://github.com/rmholmes/025deg_jra55_ryf/tree/ryf9091_gadi). E.e. you're using input_236a3011 rather than input_20200530 (although this may not make any difference). To me the best approach would be to start with the configs at https://github.com/rmholmes/025deg_jra55_ryf/tree/ryf9091_gadi and update only that which we need to. In this case; the changes to atmosphere/forcing.json and ice/ice_input.nml in COSIMA/025deg_jra55_ryf@2eb6a35 (and the executables of course).

@mauricehuguenin
Copy link
Author

I agree that this is the way to go. With the following changes to Ryan's 025deg_jra55_ryf/ryf9091_gadi spin-up:

In atmosphere/forcing.json:

+      "cname": "runof_ai",
+       "domain": "land"

In config.yaml the latest executables:

+      exe: /g/data/ik11/inputs/access-om2/bin/yatm_a227a61.exe
+      exe: /g/data/ik11/inputs/access-om2/bin/fms_ACCESS-OM_af3a94d_libaccessom2_a227a61.x
+      exe: /g/data/ik11/inputs/access-om2/bin/cice_auscom_1440x1080_480p_2572851_libaccessom2_a227a61.exe

In /ice/input_ice.nml:

+    fields_from_atm = 'swfld_i', 'lwfld_i', 'rain_i', 'snow_i', 'press_i', 'runof_i', 'tair_i', 'qair_i', 'uwnd_i', 'vwnd_i'
+    fields_to_ocn = 'strsu_io', 'strsv_io', 'rain_io', 'snow_io', 'stflx_io', 'htflx_io', 'swflx_io', 'qflux_io', 'shflx_io', 'lwflx_io', 'runof_io', 'press_io', 'aice_io', 'melt_io', 'form_io'
+    fields_from_ocn = 'sst_i', 'sss_i', 'ssu_i', 'ssv_i', 'sslx_i', 'ssly_i', 'pfmice_i'
+/

I run into the Invalid restart_format: nc abort. @aekiss Do you maybe know what might happen here? Is it something with the parallelization mentioned by Ryan above #72 (comment)?

@aidanheerdegen
Copy link
Contributor

@rmholmes If you want to keep this spin up would an alternate option be to try spinning off a new control with the updated forcing (just use the ocean temp/salt state as the initial conditions), and keep running the control you have for a decade, say, and compare to your new control run. Then compare and see if you're happy that they're broadly similar, or if they are different it is what you'd expect? Or does this not really work as a strategy?

@rmholmes
Copy link

@aidanheerdegen that is another option, although changing forcing mid-way through a run is not very clean. If the differences between v1.3 and v1.4 are not significant it may not make a big difference.

@nichannah - it would be great to your opinion on whether minor tweaks to the code to make it backwards-compatible are feasible.

@aekiss
Copy link
Contributor

aekiss commented Oct 27, 2021

The default restart format was changed to pio in recent executables.
You could try setting restart_format = 'nc' in &setup_nml in ice/cice_in.nml.
This will disable parallel IO but that's less important at 0.25deg.

@mauricehuguenin
Copy link
Author

Thanks Andrew, this option is already active in ice/cice_in.nml so it might be something else that is causing it.

@aekiss
Copy link
Contributor

aekiss commented Oct 27, 2021

Ah ok, that may be the problem - have you tried restart_format = 'pio'?

@aekiss
Copy link
Contributor

aekiss commented Oct 27, 2021

FYI I'm in the process of updating the model executables. This will include a fix to a bug in libaccessom2 a227a61.

@mauricehuguenin
Copy link
Author

It works! I extended the spin-up by two years and the output is what I expected.

I switched to restart_format = 'pio' in ice/cice_in.nml and also replaced the #Collation and #Misc flags in the config.yaml file with those of the latest COSIMA/025deg_jra55_ryf@2b2be7b commit to avoid segmentation fault errors.

@aekiss
Copy link
Contributor

aekiss commented Oct 28, 2021

@mauricehuguenin I've put the latest executables here. It might be good to use these instead as they include a fix to a rounding error bug in libaccessom2. But they are completely untested so I'd be interested to hear if you have any issues with them.

/g/data/ik11/inputs/access-om2/bin/yatm_0ab7295.exe
/g/data/ik11/inputs/access-om2/bin/fms_ACCESS-OM-BGC_6256fdc_libaccessom2_0ab7295.x
/g/data/ik11/inputs/access-om2/bin/cice_auscom_360x300_24p_2572851_libaccessom2_0ab7295.exe
/g/data/ik11/inputs/access-om2/bin/cice_auscom_3600x2700_722p_2572851_libaccessom2_0ab7295.exe
/g/data/ik11/inputs/access-om2/bin/cice_auscom_18x15.3600x2700_1682p_2572851_libaccessom2_0ab7295.exe
/g/data/ik11/inputs/access-om2/bin/cice_auscom_1440x1080_480p_2572851_libaccessom2_0ab7295.exe

@mauricehuguenin
Copy link
Author

I can confirm that these latest executables work with no issues. 👍

@rmholmes
Copy link

@mauricehuguenin if this is working - can you close the issue?

@mpudig
Copy link

mpudig commented Mar 10, 2022

Hi - Ryan and I are attempting to run an RYF9091 ACCESS-OM2-01 simulation that supports relative humidity forcing and the perturbations code. We have used the same executables (but for 1/10-deg) posted above by @aekiss and are restarting from /g/data/ik11/outputs/access-om2-01/01deg_jra55v13_ryf9091/restart995.

The model is crashing because of what we believe to be a parallel I/O problem in the CICE outputs. The error logs are spitting out, among other things, the following:

ibhdf5.so 000014827787BD11 H5D__layout_oh_cr Unknown Unknown
libhdf5.so 0000148277870EEF H5D__create Unknown Unknown
libhdf5.so.103.1. 000014827787D455 Unknown Unknown Unknown
libhdf5.so 0000148277974C3B H5O_obj_create Unknown Unknown
libhdf5.so.103.1. 0000148277938445 Unknown Unknown Unknown
libhdf5.so.103.1. 0000148277909FB2 Unknown Unknown Unknown
libhdf5.so 000014827790ABA0 H5G_traverse Unknown Unknown
libhdf5.so.103.1. 0000148277934D73 Unknown Unknown Unknown
libhdf5.so 0000148277939B72 H5L_link_object Unknown Unknown
libhdf5.so 000014827786E574 H5D__create_named Unknown Unknown
libhdf5.so 0000148277849473 H5Dcreate2 Unknown Unknown
libnetcdf.so.18.0 000014827B26BBBD Unknown Unknown Unknown
libnetcdf.so.18.0 000014827B26D099 Unknown Unknown Unknown
libnetcdf.so 000014827B26D854 nc4_rec_write_met Unknown Unknown
libnetcdf.so.18.0 000014827B26FADF Unknown Unknown Unknown
libnetcdf.so 000014827B27061D nc4_enddef_netcdf Unknown Unknown
libnetcdf.so.18.0 000014827B270180 Unknown Unknown Unknown
libnetcdf.so 000014827B27009D NC4__enddef Unknown Unknown
libnetcdf.so 000014827B2193EB nc_enddef Unknown Unknown
cice_auscom_3600x 000000000093A87F Unknown Unknown Unknown
cice_auscom_3600x 00000000006ADFAC ice_history_write 947 ice_history_write.f90
cice_auscom_3600x 000000000066699F ice_history_mp_ac 2023 ice_history.f90
cice_auscom_3600x 00000000004165C5 cice_runmod_mp_ci 411 CICE_RunMod.f90
cice_auscom_3600x 0000000000411212 MAIN__ 70 CICE.f90
cice_auscom_3600x 00000000004111A2 Unknown Unknown Unknown
libc-2.28.so 0000148279999493 __libc_start_main Unknown Unknown
cice_auscom_3600x 00000000004110AE Unknown Unknown Unknown

The model was crashing at the end of the first month when certain icefields_nml fields in cice_in.nml were set to 'm', and crashing at the end of the first day when set to 'd', so we are fairly confident the issue is coming from CICE.

It would be great if someone could have a look at this to see what is going wrong. My files are at /home/561/mp2135/access-om2/01deg_jra55_ryf_cont/ and all my changes have been pushed here: https://github.com/mpudig/01deg_jra55_ryf/tree/v13_rcpcont.

Thanks!

@aidanheerdegen
Copy link
Contributor

The stack traces point to different builds, don't know if that is relevant, but if they're built against different MPI and/or PIO/netCDF/HDF5 libraries it might be problematic:

34 0x0000000000933ade pioc_change_def()  /home/156/aek156/github/COSIMA/access-om2-new/src/cice5/ParallelIO/src/clib/pioc_support.c:2985
35 0x00000000006ae0ec ice_history_write_mp_ice_write_hist_.V()  /home/156/aek156/github/COSIMA/access-om2/src/cice5/build_auscom_3600x2700_722p/ice_history_write.f90:947

So specifically /home/156/aek156/github/COSIMA/access-om2-new/ and /home/156/aek156/github/COSIMA/access-om2

@aekiss
Copy link
Contributor

aekiss commented Mar 10, 2022

Does this configuration work with other executables?

@russfiedler
Copy link
Contributor

I also note that the status of the various pio_... calls is hardly ever checked before the pio_enddef call that finally fails is called. Naughty programmers!
Anyway, it's dying in ROMIO

@mpudig
Copy link

mpudig commented Mar 10, 2022

@aekiss, yes, we ran it with

/g/data/ik11/inputs/access-om2/bin/yatm_a227a61.exe
/g/data/ik11/inputs/access-om2/bin/fms_ACCESS-OM_af3a94d_libaccessom2_a227a61.x
/g/data/ik11/inputs/access-om2/bin/cice_auscom_3600x2700_722p_2572851_libaccessom2_a227a61.exe

originally and it crashed at the end of the first month as well.

@aekiss
Copy link
Contributor

aekiss commented Mar 10, 2022

Are these the versions you need to use, or would something else be more ideal? If so I could try compiling that.

@mpudig
Copy link

mpudig commented Mar 10, 2022

Those are the latest ones we have used (https://github.com/mpudig/01deg_jra55_ryf/blob/v13_rcpcont/config.yaml) and the issue is still occurring with them.

I should maybe add too that these executables worked when running the 1/4-degree configuration (with relative humidity forcing)!

@russfiedler
Copy link
Contributor

You haven't set the correct switches for using PIO in the mpirun command in config.yaml

e.g.
mpirun: --mca io ompio --mca io_ompio_num_aggregators 1
Also you want to set the UCX_LOG_LEVEL
See, for example

/g/data/ik11/outputs//access-om2-01/01deg_jra55v140_iaf_cycle4/output830/config.yaml

@rmholmes
Copy link

Ah thanks @russfiedler, that indeed looks promising. @mpudig can you try again including all the options between # Misc and userscripts in the config.yaml that Russ has listed above? Don't add the userscripts because currently those aren't in your config directory.

@rmholmes
Copy link

Also - I guess it would be best to remove the specification of openmpi/4.0.1 as that could clash with the versions used for compilation?

@russfiedler
Copy link
Contributor

russfiedler commented Mar 10, 2022

There could be other things in the cice_in.nml file that might need checking for PIO use.
You're probably right about that openmpi version. I'm not sure why it's there or what its affect is.

@aidanheerdegen
Copy link
Contributor

Specifying modules like that overrides the automatic discovery using ldd, which is what mpirun does too I believe. Yes it is best not to do that, and just let it find the right one to use.

@rmholmes
Copy link

I've compared the cice_in.nml files (see /scratch/e14/rmh561/diff_cice_in.nml). The only differences I see that could be relevant are the history_chunksize ones - do these need to be specificed for the parallel I/O?

@aekiss
Copy link
Contributor

aekiss commented Mar 10, 2022

yes, Nic added these for parallel IO

@aekiss
Copy link
Contributor

aekiss commented Mar 10, 2022

might be worth comparing your whole config with https://github.com/COSIMA/01deg_jra55_ryf to see if there's anything else amiss

mpudig pushed a commit to mpudig/01deg_jra55_ryf that referenced this issue Mar 10, 2022
@mpudig
Copy link

mpudig commented Mar 14, 2022

Hi, thanks all for your comments the other day. Implementing Russ's comments on including mpirun: --mca io ompio --mca io_ompio_num_aggregators 1 in config.yaml and Ryan's on adding history_chunksize to cice_in.nml fixed the original issue: the model ran successfully past month 1 and completed a full 3-month simulation.

However, the output has troubled us slightly. Comparing to the ik11 run over the same period (/g/data/ik11/outputs/access-om2-01/01deg_jra55v13_ryf9091/output996) there seem to be some physical differences between sea ice and some other variables. I'm attaching plots of the global average salt in my run and the ik11 run, as well as the difference in sea ice concentration between my run and the ik11 run. There seems to be systematically more sea ice in my run than the ik11 run. My run is sitting at /scratch/e14/mp2135/access-om2/archive/01deg_jra55_ryf_cont/.

compare_sea_ice_conc
comparing_salt

We can't see any major changes in ice configs between our run and the ik11 run. However, there are lots of changes in the CICE executable between commits 2572851 and d3e8bdf which seem mostly to do with parallel I/O and WOMBAT. Do you think the (small) changes we are seeing are realistic with these executable changes, or has something gone awry?

@mauricehuguenin
Copy link
Author

mauricehuguenin commented Mar 15, 2022

I can see that the run on ik11 uses additional input for mom:

input:
          - /g/data/ik11/inputs/access-om2/input_08022019/mom_01deg
          - /g/data/x77/amh157/passive/passive4

Matt is running without these passive fields on /g/data/x77. Is this input maybe causing the difference in the global fields? Unfortunately I am not a member of x77 and cannot have a look at the fields.

@rmholmes
Copy link

@mauricehuguenin that's just a passive tracer that Andy had included in the original control run. It won't influence the physics.

@aekiss
Copy link
Contributor

aekiss commented Mar 16, 2022

Hmm, that seems surprising to me. Have you carefully checked all your .nml files?
nmltab can make this easier: https://github.com/aekiss/nmltab

@aekiss
Copy link
Contributor

aekiss commented Mar 17, 2022

You're using all the same input files, right?

@mpudig
Copy link

mpudig commented Mar 17, 2022

One difference is that we use RYF.r_10.1990_1991.nc instead of RYF.q_10.1990_1991.nc as an atmospheric input field. But since no perturbation has been applied this shouldn't change things substantially. I think @rmholmes has tested this pretty extensively.

There are a few differences between some .nml files. I assume they're mostly because of various updates since the ik11 simulation was run (but maybe not...?):

In ocean/input.nml:

  • The ik11 run has max_axes = 100 under &diag_manager_nml, whereas my run doesn't.

In ice/input_ice.nml

  • My run has fields_from_atm, fields_to_ocn and fields_from_ocn options, whereas the ik11 run doesn't.

In ice/cice_in.nml

  • My run has istep0 = 0, whereas the ik11 run has istep0 = 6454080. (Does this seem strange?!)
  • My run has runtype = 'initial', whereas the ik11 run has runtype = 'continue'.
  • My run has restart = .false., whereas the ik11 run has restart = .true..
  • My run has restart_format = 'pio', whereas the ik11 run has restart = 'nc'.
  • My run has history_chunksize_x and _y (per Ryan's comment above).

@rmholmes
Copy link

In addition to Matt's comments above, yes we're using the same inputs (input_08022019).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants