Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACCESS-OM2 crashes reading atmosphere/input.nml #274

Open
aekiss opened this issue Jun 18, 2020 · 2 comments
Open

ACCESS-OM2 crashes reading atmosphere/input.nml #274

aekiss opened this issue Jun 18, 2020 · 2 comments

Comments

@aekiss
Copy link
Contributor

aekiss commented Jun 18, 2020

On rare occasions ACCESS-OM2 crashes with

forrtl: severe (24): end-of-file during read, unit -129, file /scratch/x77/aek156/access-om2/work/01deg_jra55v140_iaf/atmosphere/input.nml
Image              PC                Routine            Line        Source
fms_ACCESS-OM_08c  0000000002EC7F4B  Unknown               Unknown  Unknown
fms_ACCESS-OM_08c  0000000002F0571E  Unknown               Unknown  Unknown
fms_ACCESS-OM_08c  000000000040FBA7  MAIN__.V                  183  ocean_solo.F90
fms_ACCESS-OM_08c  000000000040F922  Unknown               Unknown  Unknown
libc-2.28.so       0000149968F1D873  __libc_start_main     Unknown  Unknown
fms_ACCESS-OM_08c  000000000040F82E  Unknown               Unknown  Unknown

for example, see
/scratch/x77/aek156/access-om2/archive/01deg_jra55v140_iaf/*_logs/*7187716*

I think @penguian mentioned he had the same issue last week.

It's non-reproducible - sweeping and resubmitting fixes the problem, so I've modified resub.sh to include atmosphere/input.nml.

This is a weird issue - MOM is looking for input.nml in atmosphere, not ocean.
input.nml is not present in atmosphere in the control directory, but work/atmosphere/input.nml exists (and is empty) in a crashed run:

$ ls -l work/atmosphere/
total 152
-rw-r--r-- 1 aek156 x77    348 Jun 17 23:46 atm.nml
-rw-r--r-- 1 aek156 x77   2189 Jun 17 23:46 forcing.json
drwxr-s--- 2 aek156 x77 131072 Jun 17 23:46 INPUT
-rw-r----- 1 aek156 x77      0 Jun 17 23:47 input.nml
drwxr-s--- 2 aek156 x77  16384 Jun 17 23:46 log
lrwxrwxrwx 1 aek156 x77     51 Jun 17 23:46 yatm_a6e5d87.exe -> /g/data/ik11/inputs/access-om2/bin/yatm_a6e5d87.exe

whereas in a normal run work/atmosphere/input.nml doesn't exist:

$ ls -l work/atmosphere/
total 276
-rw-r--r-- 1 aek156 x77    348 Jun 18 07:21 atm.nml
-rw-r----- 1 aek156 x77     65 Jun 18 07:22 debug.root.02
-rw-r--r-- 1 aek156 x77   2189 Jun 18 07:21 forcing.json
drwxr-s--- 2 aek156 x77 131072 Jun 18 07:21 INPUT
drwxr-s--- 2 aek156 x77  16384 Jun 18 07:22 log
-rw-r----- 1 aek156 x77 119840 Jun 18 07:22 nout.000000
lrwxrwxrwx 1 aek156 x77     51 Jun 18 07:21 yatm_a6e5d87.exe -> /g/data/ik11/inputs/access-om2/bin/yatm_a6e5d87.exe

also I'm not sure if it's relevant but
/scratch/x77/aek156/access-om2/archive/01deg_jra55v140_iaf/pbs_logs/01deg_jra55_iaf.e7187716
contains

[gadi-cpu-clx-0675.gadi.nci.org.au:03917] PMIX ERROR: UNREACHABLE in file ../../../../../../../../../../opal/mca/pmix/pmix3x/pmix/src/mca/ptl/tcp/ptl_tcp_component.c at line 1700
[gadi-cpu-clx-0675.gadi.nci.org.au:03917] PMIX ERROR: UNREACHABLE in file ../../../../../../../../../../opal/mca/pmix/pmix3x/pmix/src/mca/ptl/tcp/ptl_tcp_component.c at line 1744
[gadi-cpu-clx-0675.gadi.nci.org.au:03917] PMIX ERROR: UNREACHABLE in file ../../../../../../../../../../opal/mca/pmix/pmix3x/pmix/src/mca/ptl/tcp/ptl_tcp_component.c at line 1700
[gadi-cpu-clx-0675.gadi.nci.org.au:03917] PMIX ERROR: UNREACHABLE in file ../../../../../../../../../../opal/mca/pmix/pmix3x/pmix/src/mca/ptl/tcp/ptl_tcp_component.c at line 1744
[gadi-cpu-clx-0652.gadi.nci.org.au:56339] PMIX ERROR: UNREACHABLE in file ../../../../../../../../../../opal/mca/pmix/pmix3x/pmix/src/mca/ptl/tcp/ptl_tcp_component.c at line 1700
[gadi-cpu-clx-0652.gadi.nci.org.au:56339] PMIX ERROR: UNREACHABLE in file ../../../../../../../../../../opal/mca/pmix/pmix3x/pmix/src/mca/ptl/tcp/ptl_tcp_component.c at line 1744
[gadi-cpu-clx-0652.gadi.nci.org.au:56339] PMIX ERROR: UNREACHABLE in file ../../../../../../../../../../opal/mca/pmix/pmix3x/pmix/src/mca/ptl/tcp/ptl_tcp_component.c at line 1700
[gadi-cpu-clx-0652.gadi.nci.org.au:56339] PMIX ERROR: UNREACHABLE in file ../../../../../../../../../../opal/mca/pmix/pmix3x/pmix/src/mca/ptl/tcp/ptl_tcp_component.c at line 1744
[gadi-cpu-clx-0652.gadi.nci.org.au:56339] PMIX ERROR: UNREACHABLE in file ../../../../../../../../../../opal/mca/pmix/pmix3x/pmix/src/mca/ptl/tcp/ptl_tcp_component.c at line 1700
[gadi-cpu-clx-0652.gadi.nci.org.au:56339] PMIX ERROR: UNREACHABLE in file ../../../../../../../../../../opal/mca/pmix/pmix3x/pmix/src/mca/ptl/tcp/ptl_tcp_component.c at line 1744
[gadi-cpu-clx-0652.gadi.nci.org.au:56339] PMIX ERROR: UNREACHABLE in file ../../../../../../../../../../opal/mca/pmix/pmix3x/pmix/src/mca/ptl/tcp/ptl_tcp_component.c at line 1700
[gadi-cpu-clx-0652.gadi.nci.org.au:56339] PMIX ERROR: UNREACHABLE in file ../../../../../../../../../../opal/mca/pmix/pmix3x/pmix/src/mca/ptl/tcp/ptl_tcp_component.c at line 1744
[gadi-cpu-clx-0040.gadi.nci.org.au:19442] [[32047,0],41] ORTE_ERROR_LOG: Not found in file ../../../../orte/mca/grpcomm/base/grpcomm_base_stubs.c at line 354
[gadi-cpu-clx-0040.gadi.nci.org.au:19442] [[32047,0],41] ORTE_ERROR_LOG: Not found in file ../../../../orte/mca/grpcomm/base/grpcomm_base_stubs.c at line 278
[gadi-cpu-clx-0040.gadi.nci.org.au:19442] [[32047,0],41] ORTE_ERROR_LOG: Not found in file ../../../../../orte/mca/grpcomm/direct/grpcomm_direct.c at line 187
@aidanheerdegen
Copy link
Collaborator

Weird. That is here:

https://github.com/mom-ocean/MOM5/blob/master/src/accessom_coupler/ocean_solo.F90#L182-L183

which suggests that the ocean model thinks its working directory is work/atmosphere. I wonder how that can happen. Unfortunately we don't capture the payu run command line, which might be interesting to see to confirm nothing odd happened there. Extremely unlikely, but worth ruling out.

@aekiss
Copy link
Contributor Author

aekiss commented Jun 18, 2020

it's also odd that input.nml actually exists in atmosphere (albeit as an empty file)

aekiss added a commit to COSIMA/01deg_jra55_iaf that referenced this issue Jun 18, 2020
aekiss added a commit to COSIMA/01deg_jra55_ryf that referenced this issue Jun 18, 2020
aekiss added a commit to COSIMA/025deg_jra55_iaf that referenced this issue Jun 18, 2020
aekiss added a commit to COSIMA/025deg_jra55_ryf that referenced this issue Jun 18, 2020
aekiss added a commit to COSIMA/1deg_jra55_iaf that referenced this issue Jun 18, 2020
aekiss added a commit to COSIMA/1deg_jra55_ryf that referenced this issue Jun 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants