Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWICM3 changes to coupling, restart file writing and DMOC #15

Merged
merged 29 commits into from
Oct 6, 2020

Conversation

JanStreffing
Copy link
Collaborator

Testing for now, dont merge yet.

@JanStreffing
Copy link
Collaborator Author

I'm a little unsure as to what happened in src/gen_modules_diag.F90. I think we can just take the master branch here. Comments?

@JanStreffing
Copy link
Collaborator Author

We are not bitwise identical:

 AssertionError: One or several tests have failed.
Variable: a_ice, current_value: 0.31723560353922026, master_value: 0.3172354666705257
FAIL!!!: DIFFERENCE: 1.3686869454465622e-07. For a_ice we got 0.31723560353922026. Got 0.3172354666705257 instead.
Variable: salt, current_value: 23.873993684996698, master_value: 23.87399368211628
FAIL!!!: DIFFERENCE: 2.880419458506367e-09. For salt we got 23.873993684996698. Got 23.87399368211628 instead.
Variable: sst, current_value: 8.52788833586562, master_value: 8.527885124059662
FAIL!!!: DIFFERENCE: 3.211805958969194e-06. For sst we got 8.52788833586562. Got 8.527885124059662 instead.
Variable: temp, current_value: 3.7243866981614353, master_value: 3.7243864330046867
FAIL!!!: DIFFERENCE: 2.6515674855787097e-07. For temp we got 3.7243866981614353. Got 3.7243864330046867 instead.
Variable: u, current_value: -0.0013590466314604363, master_value: -0.0013590246454750715
FAIL!!!: DIFFERENCE: -2.198598536475721e-08. For u we got -0.0013590466314604363. Got -0.0013590246454750715 instead.
Variable: v, current_value: 0.0008864110589454028, master_value: 0.000886436018956168
FAIL!!!: DIFFERENCE: -2.4960010765163153e-08. For v we got 0.0008864110589454028. Got 0.000886436018956168 instead.

Any idea which parts of the merge would cause this?

@koldunovn
Copy link
Member

@JanStreffing Last time I changed the values for pi-mesh check is when new advection schemes were merged. Although they are technically the same, there are some differences related to treatment of boundaries, so results of the check were different. Make sure you have everything consistent in your config folder (although seems that you did not change anything there).

@JanStreffing
Copy link
Collaborator Author

@koldunovn Does that mean also the master branch before the merge would not complete the tests?
I typically don't touch the config folder since I use the esm-tools which come with their own config folder.

src/ice_thermo_oce.F90 Outdated Show resolved Hide resolved
@koldunovn
Copy link
Member

@koldunovn Does that mean also the master branch before the merge would not complete the tests?
I typically don't touch the config folder since I use the esm-tools which come with their own config folder.

No the master if fine :) https://github.com/FESOM/fesom2/runs/1116986615

@koldunovn
Copy link
Member

I see nice green check mark next to the PR 👍 🍺

@JanStreffing
Copy link
Collaborator Author

Last question before pulling the trigger, does the automated test also cover restarting? I'm asking since I made some changes to the way restart files are written.

@koldunovn
Copy link
Member

Last question before pulling the trigger, does the automated test also cover restarting? I'm asking since I made some changes to the way restart files are written.

Unfortunatelly they don't include it, that's in plans. I can test it if you wish manually.

@JanStreffing
Copy link
Collaborator Author

Hey Nikolay,
that would be great, ty. Also a huge praise for the auto-testing capabilities btw. I'm loving it!
Cheers, Jan

@koldunovn
Copy link
Member

Will try to do it tommorow and let you know. Anything in particular I should pay attention to?

@JanStreffing
Copy link
Collaborator Author

No, it's should just work bit-identical on ollie or mistral. On juwels it fixes the chunking in the POSIX layer, making the IO speed go from 6.46 MiB/s to 1480.39 MiB/s according to darshan. More info on the darshan-logs in the old gitlab issue #19

I think the NetCDF installations on other HPCs are somehow configured to do this automatically, while on juwels we had to do this manually and correctly in FESOM2. My only worry is that the options I set in the code now might conflict with the preconfigured options of the netcdf installation on other machines. Hopefully not.

Good night!

@koldunovn
Copy link
Member

Actually @hegish works on the async output now in #14 But my understanding is that the restarts will be the next thing to touch, so if something is wrong with your implementation, Jan will find out :)

@koldunovn
Copy link
Member

I have tested restarts, and model can run from restart created by version of the model from master. But there are still a lot of

size:         3140 lev:            1 gather_nod:    0.0000000000000000     
 size:         3140 lev:            1 nf_put_var:    0.0000000000000000     
 size:         3140 lev:            2 gather_nod:    0.0000000000000000     
 size:         3140 lev:            2 nf_put_var:    0.0000000000000000  

In the output, would be nice to get rid of them :)

@JanStreffing
Copy link
Collaborator Author

Good to know. I'll hide those behind the debug output switch and then merge.

@koldunovn
Copy link
Member

I think @dsidoren should merge :)

@JanStreffing
Copy link
Collaborator Author

You may want to change some permissions around then, cause after the test it's giving me the option to merge it myself.

@koldunovn
Copy link
Member

Yes, I forgot to add branch rules, should be blocked now :)

src/io_restart.F90 Outdated Show resolved Hide resolved
@JanStreffing
Copy link
Collaborator Author

I resolved the conflict in io_meandata by preserving @hegish io-improvements.

@koldunovn
Copy link
Member

koldunovn commented Oct 5, 2020

Currently fails on ollie and mistral with:

initializing I/O file for sst
 associating mean I/O file /scratch/a/a270088/output_mistral_sanity/sst.fesom.1948.nc
 sst: current mean I/O counter =            1
 writing mean record for sst; rec. count =            1
 initializing I/O file for a_ice
 associating mean I/O file /scratch/a/a270088/output_mistral_sanity/a_ice.fesom.1948.nc
 a_ice: current mean I/O counter =            1
 writing mean record for a_ice; rec. count =            1
 initializing I/O file for temp
 associating mean I/O file /scratch/a/a270088/output_mistral_sanity/temp.fesom.1948.nc
 temp: current mean I/O counter =            1
 writing mean record for temp; rec. count =            1
 initializing I/O file for salt
 associating mean I/O file /scratch/a/a270088/output_mistral_sanity/salt.fesom.1948.nc
 salt: current mean I/O counter =            1
 writing mean record for salt; rec. count =            1
 Do output (netCDF, restart) ...
 associating restart file /scratch/a/a270088/output_mistral_sanity/fesom.1948.oce.restart.nc
 initializing restart file /scratch/a/a270088/output_mistral_sanity/fesom.1948.oce.restart.nc
 current restart counter =            1
 error counter=          22
 Error: NetCDF: Invalid argument
 Run finished unexpectedly!

No problems on juwels.

do not check for errors if NF_CHUNKED does not exist in the netcdf library, otherwise crashes on several machines (i.e. ollie)
@koldunovn
Copy link
Member

Now it works on all three machines.

@dsidoren dsidoren merged commit 9bf1558 into master Oct 6, 2020
@JanStreffing JanStreffing deleted the juwels_chunk_io branch January 25, 2021 15:26
wiltonloch pushed a commit to wiltonloch/fesom2 that referenced this pull request Feb 1, 2023
AWICM3 changes to coupling, restart file writing and DMOC
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants