Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI timeout for some izumi_nag tests reading in datm forcing files in NUOPC cap #1317

Closed
ekluzek opened this issue Mar 30, 2021 · 3 comments
Closed
Assignees
Labels
bug something is working incorrectly

Comments

@ekluzek
Copy link
Collaborator

ekluzek commented Mar 30, 2021

Brief summary of bug

There are two tests on izumi with NUOPC that I consistently get a timeout in MPI
when initially reading in datm forcing files

General bug information

CTSM version you are using: ctsm5.1.dev030 (what will be anyway)

Does this bug cause significantly incorrect results in the model's science? No

Configurations affected:

SMS_Ld5_D_P48x1_Vnuopc.f10_f10_mg37.IHistClm51Bgc.izumi_nag.clm-decStart
SMS_P48x1_D_Ld5_Vnuopc.f10_f10_mg37.I2000Clm50Cn.izumi_nag.clm-default

Details of bug

At initialization on startup the model reads in the time coordinate for all of the forcing files. It successfully goes through the majority of them, but times out near the end.

It's not clear to me what's different about these two tests than other tests that run successfully. Many tests on izumi are with CRU forcing, but still some are GSWP3 for this same resolution.

Important details of your setup / configuration so we can reproduce the bug

See this comment in PR #1309

#1309 (comment)

Important output or errors that show the problem

atm.log file:

(shr_stream_readTCoord) opening stream filename = /project/tss/atm_forcing.datm7.GSWP3.0.5d.v1.c170516/TPHWL/clmforc.GSWP3.c2011.0.5x0.5.TPQWL.1901-02.nc
(shr_stream_readTCoord) closing stream filename = /project/tss/atm_forcing.datm7.GSWP3.0.5d.v1.c170516/TPHWL/clmforc.GSWP3.c2011.0.5x0.5.TPQWL.1901-02.nc
(shr_stream_readTCoord) opening stream filename = /project/tss/atm_forcing.datm7.GSWP3.0.5d.v1.c170516/TPHWL/clmforc.GSWP3.c2011.0.5x0.5.TPQWL.1901-03.nc

cesm.log file:

[47] proc= 47 clump no = 1 clump id= 48 beg patch = 6064 end patch = 6183 total patches per clump = 120
[47] proc= 47 clump no = 1 clump id= 48 beg cohort = 249 end cohort = 253 total cohorts per clump = 5
[mpiexec@i037.unified.ucar.edu] HYDT_bscd_pbs_wait_for_completion (tools/bootstrap/external/pbs_wait.c:67): tm_poll(obit_event) failed with TM error 17002
[mpiexec@i037.unified.ucar.edu] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion

@ekluzek ekluzek added the bug something is working incorrectly label Mar 30, 2021
@ekluzek ekluzek self-assigned this Mar 30, 2021
@ekluzek
Copy link
Collaborator Author

ekluzek commented Apr 9, 2021

It looks like there are some FATES tests that show this same problem. @rgknox @glemieux

@ekluzek
Copy link
Collaborator Author

ekluzek commented Jun 11, 2021

The BGC test worked for me in upcoming ctsm5.1.dev044.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Nov 17, 2021

This seems to be working fine in what will be ctsm5.1.dev062

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something is working incorrectly
Projects
None yet
Development

No branches or pull requests

1 participant