Hanging on Cheyenne ... #190

uturuncoglu · 2020-09-10T16:26:17Z

@climbfuji @ligiabernardet i am having trouble with the model on Cheyenne and it is hanging when it is reading static input files such as global_shdmin.0.144x0.144.grb for the resolution > C96. This was also case for the new buildlib and I think it is not related with the build. Have you ever experience same problem? This is also reported previously on #184 (comment). Do we need increase the resources that are used by the model? For example, C192 is hanging/failing without any particular error and I am using following configuration options,

ntiles = 6
layout = 4, 6
write_groups: 1
write_tasks_per_group: 36

and total 180 processor.

ligiabernardet · 2020-09-10T17:17:17Z

@uturuncoglu I have not received any reports of the model hanging. @llpcarson Any insight wrt hanging on Cheyenne?

uturuncoglu · 2020-09-10T17:25:16Z

@ligiabernardet it is strange. I updated buildlib and I am waiting to resolve this issue. Let me know, if you see similar issue.

llpcarson · 2020-09-10T17:32:15Z

No, I haven't seen this lately on cheyenne. One thing to check is the processor layout and the job-node-request. If these don't match, sometimes the model will hang (use 48 tasks, but submit the job with 64, for example) Laurie

…

On Thu, Sep 10, 2020 at 11:17 AM ligiabernardet ***@***.***> wrote: @uturuncoglu <https://github.com/uturuncoglu> I have not received any reports of the model hanging. @llpcarson <https://github.com/llpcarson> Any insight wrt hanging on Cheyenne? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#190 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB2OWIXCOZPBY3E3MHYOJBTSFEC23ANCNFSM4RFMB6EA> .

ligiabernardet · 2020-09-10T17:34:08Z

@ufuk We are waiting on a PR of the updated build so we can merge it onto the release/public-v1 branch and conduct tests.

uturuncoglu · 2020-09-10T17:58:38Z

@llpcarson those options are consistent. Anyway, I'll make PR soon and you could test it. All those strange things happen in my account. Maybe there is some thing wrong in there. Let's see what you find in your tests.

uturuncoglu · 2020-09-10T18:04:20Z

@ligiabernardet I created PR in the app level.

ligiabernardet · 2020-09-10T20:38:58Z

@uturuncoglu Does it hang all the time or occasionally?

uturuncoglu · 2020-09-10T20:40:27Z

@ligiabernardet in my recent test all resolution failed with same way except C96 ones.

ligiabernardet · 2020-09-10T21:54:57Z

@uturuncoglu Here is a suggestion from @climbfuji: Are we using threading? If yes: Can we test compiling without OpenMP, or even easier, run with one OpenMP thread only, and see if this solves the problem?

ligiabernardet · 2020-09-10T21:56:31Z

@llpcarson is running some tests on Cheyenne. Laurie, let us know what you find out.

uturuncoglu · 2020-09-10T21:57:14Z

@climbfuji we are not using threading at least for following test

/glade/scratch/turuncu/SMS_Lh3.C192.GFSv15p2.cheyenne_intel.20200909_155451_1sg3p2

ant it still hang/fail when reading file.

@ligiabernardet thanks. I hope I am the only one that have the issue.

llpcarson · 2020-09-11T14:54:57Z

Ufuk, Ligia - I ran the default MRW case, at C96, C384 and C768 and all 3 ran: grib2 input, threaded (4), 20190829 I can try running the CIME reg-tests next (that's what that case is, correct?) Laurie

…

On Thu, Sep 10, 2020 at 3:57 PM Ufuk Turunçoğlu ***@***.***> wrote: @climbfuji <https://github.com/climbfuji> we are not using threading at least for following test /glade/scratch/turuncu/SMS_Lh3.C192.GFSv15p2.cheyenne_intel.20200909_155451_1sg3p2 ant it still hang/fail when reading file. @ligiabernardet <https://github.com/ligiabernardet> thanks. I hope I am the only one that have the issue. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#190 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB2OWIS5VJJPMQEXGGUYSGTSFFDUXANCNFSM4RFMB6EA> .

uturuncoglu · 2020-09-11T15:12:35Z

@llpcarson That is great! Yes, if you run full test suite that would be great. Once you run the test suite (if you run without specifying compiler such as --xml-compiler intel that will run both Intel and GNU tests), please let me know the directory and I could double check the results. Thanks for your help.

llpcarson · 2020-09-11T18:10:29Z

Partial results to report: All of the C96, C192 and C384 jobs have completed successfully. Three of the C786 jobs crashed in chgres_cube (and so the forecast jobs were killed for dependency-failure) 5 of the C768 cases are still in the batch queue waiting to run (these ran chgres_cube successfully) Will let you know when the C768 jobs start running...

…

On Fri, Sep 11, 2020 at 9:12 AM Ufuk Turunçoğlu ***@***.***> wrote: @llpcarson <https://github.com/llpcarson> That is great! Yes, if you run full test suite that would be great. Once you run the test suite (if you run without specifying compiler such as --xml-compiler intel that will run both Intel and GNU tests), please let me know the directory and I could double check the results. Thanks for your help. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#190 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB2OWIRC5D4AZWJRODMLJETSFI47JANCNFSM4RFMB6EA> .

uturuncoglu · 2020-09-11T19:41:12Z

@llpcarson i was having problem with C768 also on Cheyenne. Is this on Cheyenne? Probably couple of them will pass and couple of them will fail. We might need to increase the allocated resource for C768 because it is not stable at this point. What do you think @GeorgeGayno-NOAA?

GeorgeGayno-NOAA · 2020-09-11T19:45:47Z

@llpcarson i was having problem with C768 also on Cheyenne. Is this on Cheyenne? Probably couple of them will pass and couple of them will fail. We might need to increase the allocated resource for C768 because it is not stable at this point. What do you think @GeorgeGayno-NOAA?

Is the model hanging or chgres_cube? I am more familiar with the latter.

uturuncoglu · 2020-09-11T19:49:33Z

@GeorgeGayno-NOAA I think CHGRES is failing in C768. we are using 6 nodes with 6 core in per node as you suggested. It is running in some case and failing in other. So, not every C768 is failing.

llpcarson · 2020-09-11T19:57:28Z

On cheyenne: Yes, chgres cube is failing (seg-fault) for some of the C768 cases (but not all). The model/forecast jobs are still waiting in the queue (the ones that had a successful chgres_cube) Laurie

…

On Fri, Sep 11, 2020 at 1:49 PM Ufuk Turunçoğlu ***@***.***> wrote: @GeorgeGayno-NOAA <https://github.com/GeorgeGayno-NOAA> I think CHGRES is failing in C768. we are using 6 nodes with 6 core in per node as you suggested. It is running in some case and failing in other. So, not every C768 is failing. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#190 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB2OWIURWVHBM5GYZ5FDAB3SFJ5N5ANCNFSM4RFMB6EA> .

uturuncoglu · 2020-09-11T20:01:06Z

@llpcarson you could still check the run/INPUT folder of one of the cases to see CHGRES generated files. If they are there and the model will pick them and run. I hope it won't hang. What about other resolutions? Did you see any hang issue with the model?

GeorgeGayno-NOAA · 2020-09-11T20:06:37Z

On cheyenne: Yes, chgres cube is failing (seg-fault) for some of the C768 cases (but not all). The model/forecast jobs are still waiting in the queue (the ones that had a successful chgres_cube) Laurie
…
On Fri, Sep 11, 2020 at 1:49 PM Ufuk Turunçoğlu @.***> wrote: @GeorgeGayno-NOAA https://github.com/GeorgeGayno-NOAA I think CHGRES is failing in C768. we are using 6 nodes with 6 core in per node as you suggested. It is running in some case and failing in other. So, not every C768 is failing. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#190 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2OWIURWVHBM5GYZ5FDAB3SFJ5N5ANCNFSM4RFMB6EA .

Are the failures happening with a certain input data, like grib2 or nemsio?

uturuncoglu · 2020-09-11T20:07:44Z

@GeorgeGayno-NOAA The default input type is GRIB2 and the test suit uses that one.

llpcarson · 2020-09-11T20:35:09Z

Yes, the chgres_cube run worked for these cases that are waiting to run the model. Failed for others. All C768. All of the other resolutions ran without issue (at least I think so!) Rundir is: /glade/scratch/carson/ufs/* App dir is: /glade/scratch/carson/ufs/mrw.test/ufs-mrweather-app The logfile from chgres_cube shows: - FATAL ERROR: READING GRIB2 FILE - IOSTAT IS: 0 but - it's the same file successfully read in other cases? One time Intel, one time Gnu. Perhaps I'll try another full-test to see if it's consistent

…

On Fri, Sep 11, 2020 at 2:01 PM Ufuk Turunçoğlu ***@***.***> wrote: @llpcarson <https://github.com/llpcarson> you could still check the run/INPUT folder of one of the cases to see CHGRES generated files. If they are there and the model will pick them and run. I hope it won't hang. What about other resolutions? Did you see any hang issue with the model? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#190 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB2OWISABPTH3V767XGF6MLSFJ6ZFANCNFSM4RFMB6EA> .

uturuncoglu · 2020-09-11T20:44:05Z

@llpcarson I could not find /glade/scratch/carson/ufs/. Is it correct? Yes, the error is strange it show that the file is missing or corrupted but all the cases use same file. Did you also run GNU tests?

llpcarson · 2020-09-11T21:01:37Z

Yes, I ran both Gnu and Intel. Each had failure and success for chgres_cube. Here's one of the run-dirs with a failure: /glade/scratch/carson/ufs/SMS_Lh3_D.C768.GFSv15p2.cheyenne_gnu.G.20200911_091828_ou9in9/run Does the _D part refer to a debug-mode compile? (just curious)

…

On Fri, Sep 11, 2020 at 2:44 PM Ufuk Turunçoğlu ***@***.***> wrote: @llpcarson <https://github.com/llpcarson> I could not find /glade/scratch/carson/ufs/. Is it correct? Yes, the error is strange it show that the file is missing or corrupted but all the cases use same file. Did you also run GNU tests? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#190 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB2OWIVMGUJH27UEDKQLOFTSFKD2JANCNFSM4RFMB6EA> .

climbfuji · 2020-09-11T21:17:43Z

If it does, then this only applies to the model I guess, because chgres_cube is compiled as part of NCEPLIBS, which compiles in "production" mode. I looked at your run directory:

cat chgres_cube.200911-105345.log

... - CALL FieldScatter FOR INPUT GRID LONGITUDE. - CALL FieldScatter FOR INPUT GRID LATITUDE. #0 0x2b4035126aff in ??? #1 0x2b40357ae9bb in ??? #0 0x2ab63cbdbaff in ??? #1 0x2ab63d2639bb in ??? #0 0x2ab63cbdbaff in ??? #1 0x2ab63d2639bb in ??? #0 0x2ab63cbdbaff in ??? #1 0x2ab63d2639bb in ??? #0 0x2b2d04fcdaff in ??? #1 0x2b2d056559bb in ??? #0 0x2b2d04fcdaff in ??? #1 0x2b2d056559bb in ??? #0 0x2b2d04fcdaff in ??? #1 0x2b2d056559bb in ??? #0 0x2b2d04fcdaff in ??? #1 0x2b2d056559bb in ??? #0 0x2b2d04fcdaff in ??? #0 0x2b2d04fcdaff in ??? #1 0x2b2d056559bb in ??? #1 0x2b2d056559bb in ??? MPT ERROR: MPI_COMM_WORLD rank 21 has terminated without calling MPI_Finalize() aborting job MPT: Received signal 11 I also checked PET21.ESMF_LogFile for the mpi rank that reported the crash (first), but there is no useful useful information in the file. Let me compile chgres_cube manually with debugging flags on, then copy your run directory and run the preprocessing step manually.

…

On Sep 11, 2020, at 3:01 PM, Laurie Carson ***@***.***> wrote: Yes, I ran both Gnu and Intel. Each had failure and success for chgres_cube. Here's one of the run-dirs with a failure: /glade/scratch/carson/ufs/SMS_Lh3_D.C768.GFSv15p2.cheyenne_gnu.G.20200911_091828_ou9in9/run Does the _D part refer to a debug-mode compile? (just curious) On Fri, Sep 11, 2020 at 2:44 PM Ufuk Turunçoğlu ***@***.***> wrote: > @llpcarson <https://github.com/llpcarson> I could not find > /glade/scratch/carson/ufs/. Is it correct? Yes, the error is strange it > show that the file is missing or corrupted but all the cases use same file. > Did you also run GNU tests? > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#190 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AB2OWIVMGUJH27UEDKQLOFTSFKD2JANCNFSM4RFMB6EA> > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#190 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB5C2RLFX6SXOMU43DQ32S3SFKF4DANCNFSM4RFMB6EA>.

llpcarson · 2020-09-11T21:22:47Z

I just re-ran the reg-test for the C768 cases only, and all 8 tests ran chgres_cube without error (forecast/model are still in the queue). Very frustrating! And, unfortunately, even with a failed run directory, a re-run (with a simple qsub script) completes without error. Will check back later tonight to see if any of the model runs hang/finish :)

…

On Fri, Sep 11, 2020 at 3:00 PM Laurie Carson ***@***.***> wrote: Yes, I ran both Gnu and Intel. Each had failure and success for chgres_cube. Here's one of the run-dirs with a failure: /glade/scratch/carson/ufs/SMS_Lh3_D.C768.GFSv15p2.cheyenne_gnu.G.20200911_091828_ou9in9/run Does the _D part refer to a debug-mode compile? (just curious) On Fri, Sep 11, 2020 at 2:44 PM Ufuk Turunçoğlu ***@***.***> wrote: > @llpcarson <https://github.com/llpcarson> I could not find > /glade/scratch/carson/ufs/. Is it correct? Yes, the error is strange it > show that the file is missing or corrupted but all the cases use same file. > Did you also run GNU tests? > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#190 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AB2OWIVMGUJH27UEDKQLOFTSFKD2JANCNFSM4RFMB6EA> > . >

uturuncoglu · 2020-09-11T21:23:49Z

@llpcarson D is debug mode. I am not sure about the potions that are changed but if you need I could check it.

uturuncoglu · 2020-09-11T21:26:13Z

Yes, I ran both Gnu and Intel. Each had failure and success for chgres_cube. Here's one of the run-dirs with a failure: /glade/scratch/carson/ufs/SMS_Lh3_D.C768.GFSv15p2.cheyenne_gnu.G.20200911_091828_ou9in9/run Does the _D part refer to a debug-mode compile? (just curious)
…

Yes, I check your directory and it seems there is no build error but all C768 test are failed due to the failure in CHGRES.

uturuncoglu · 2020-09-11T21:28:01Z

@llpcarson yes, in some cases if you run the model again CHGRES process without any problem. I am not sure but it could be node allocation on Cheyenne. It might be nice to check in the other platforms.

ligiabernardet · 2020-09-17T19:16:14Z

@llpcarson do you have a failed chgres cube that we can use to check a) namelist and b) GRIB2 inventory ./chgres.inv?

@uturuncoglu I confirm that we also have a failure of chgres_cube C768 on Orion.
Results from other platforms:

Jet: C768 passed (only 1 run tested - it takes more than a day in the queue due to reservations, so hard to do many runs)
Hera: C768 passed (only 1 run tested)
Orion: 1/19 tests that are part of RT crashed on chgres_cube (C768 RT failed on orion #194)
Stampede: waiting RT results from @climbfuji
Gaea: waiting C768 results from @climbfuji

llpcarson · 2020-09-17T19:38:26Z

There's a set of cases on cheyenne here: /glade/scratch/carson/ufs/mrw.test/stack/ Fails: SMS_Lh3_D.C768.GFSv16beta.cheyenne_intel.20200915_090650_qga1v1/run/ Runs: SMS_Lh3_D.C768.GFSv15p2.cheyenne_intel.20200915_090650_qga1v1/run/ Both chgres.inv files are identical. Both namelist files are identical. Laurie

…

On Thu, Sep 17, 2020 at 1:16 PM ligiabernardet ***@***.***> wrote: @llpcarson <https://github.com/llpcarson> do you have a failed chgres cube that we can use to check a) namelist and b) GRIB2 inventory ./chgres.inv? @uturuncoglu <https://github.com/uturuncoglu> I confirm that we also have a failure of chgres_cube C768 on Orion. Results from other platforms: - Jet: C768 passed (only 1 run tested - it takes more than a day in the queue due to reservations, so hard to do many runs) - Hera: C768 passed (only 1 run tested) - Orion: 1/19 tests that are part of RT crashed on chgres_cube (#194 <#194>) - Stampede: waiting RT results from @climbfuji <https://github.com/climbfuji> - Gaea: waiting C768 results from @climbfuji <https://github.com/climbfuji> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#190 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB2OWIWUOVYCEVBOL5VK45TSGJOA5ANCNFSM4RFMB6EA> .

uturuncoglu · 2020-09-17T19:50:52Z

@ligiabernardet thanks for the update. I am not sure the where the source of problem. While we have also running cases, I am suspecting from chgres and it might have memory leak etc.

ligiabernardet · 2020-09-17T19:51:32Z

@GeorgeGayno-NOAA @climbfuji @uturuncoglu @arunchawla-NOAA @climbfuji @rsdunlapiv @llpcarson Do we have any other hypothesis or idea of what to try to get chgres_cube to work in CIME consistently?

Cheyenne: Occasional crashes of chgres_cube C768 when reading GRB2 file
Jet: C768 passed (only 1 run tested - it takes more than a day in the queue due to reservations, so hard to do many runs)
Hera: C768 passed (only 1 run tested)
Orion: 1/19 tests that are part of RT crashed on chgres_cube (#194)
Stampede: waiting RT results
Gaea: waiting C768 results

GeorgeGayno-NOAA · 2020-09-17T20:05:23Z

@GeorgeGayno-NOAA @climbfuji @uturuncoglu @arunchawla-NOAA @climbfuji @rsdunlapiv @llpcarson Do we have any other hypothesis or idea of what to try to get chgres_cube to work in CIME consistently?

Cheyenne: Occasional crashes of chgres_cube C768 when reading GRB2 file
Jet: C768 passed (only 1 run tested - it takes more than a day in the queue due to reservations, so hard to do many runs)
Hera: C768 passed (only 1 run tested)
Orion: 1/19 tests that are part of RT crashed on chgres_cube (#194)
Stampede: waiting RT results
Gaea: waiting C768 results

Is it always failing at the same spot (model_grid.F90 line 640)? And the failure occurs randomly? Do the RT tests run in sequence or simultaneously?

climbfuji · 2020-09-18T13:33:29Z

@ligiabernardet @GeorgeGayno-NOAA @climbfuji @uturuncoglu @arunchawla-NOAA @jedwards4b @rsdunlapiv @llpcarson

I just got a successful run of all regression tests on Cheyenne with Intel. This is what I did:

revert back from Intel 19.0.5 to Intel 18.0.5 (because we did not see any problems on hera and jet, which use Intel 18)
removed a whole bunch of stuff from the CIME UFS config that was there for no reason

PRs:

ufs_release_v1.1: remove unnecessary/incorrect configuration options for Cheyenne for the UFS; downgrade Intel 19.x.y to 18.m.n on Cheyenne, Gaea, Orion ESMCI/cime#3713
release/public-v1: update ccpp-physics (fix unicode errors), downgrade Intel compiler ufs-weather-model#204
not yet - PR for ufs-mrweather-app (will create after the above PRs were merged so that I can update Externals.cfg accordingly

climbfuji · 2020-09-18T14:18:38Z

Ok, here we go ... just got one failure with Intel 18.0.5 on Cheyenne in my second round of tests (when running both Intel and GNU tests with the same command). Super annoying. Will see how the rest works out.

climbfuji · 2020-09-18T14:19:48Z

@uturuncoglu is there a way to force the tests to run serially, i.e. only one regression test running at a time?

climbfuji · 2020-09-18T14:35:49Z

@uturuncoglu another question, how do I change the default MPI job size for chgres in cime? I want the regression tests to run on a different number of nodes with a different number of tasks per node, still 36 tasks in total for C768. Thanks ...

climbfuji · 2020-09-18T17:39:57Z

@jedwards4b Ufuk seems to be out or busy today, can you answer my basic questions in #190 (comment) and #190 (comment) by any chance? Thanks ...

mvertens · 2020-09-18T19:06:36Z

Ufuk will be back this afternoon. He was on PTO this morning.

…

On Fri, Sep 18, 2020 at 11:40 AM Dom Heinzeller ***@***.***> wrote: @jedwards4b <https://github.com/jedwards4b> Ufuk seems to be out or busy today, can you answer my basic questions in #190 (comment) <#190 (comment)> and #190 (comment) <#190 (comment)> by any chance? Thanks ... — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#190 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB4XCE6KBDXGE43OGC55DZLSGOLP5ANCNFSM4RFMB6EA> .

-- Mariana Vertenstein CESM Software Engineering Group Head National Center for Atmospheric Research Boulder, Colorado Office 303-497-1349 Email: mvertens@ucar.edu

jedwards4b · 2020-09-18T19:08:11Z

Default job size and pelayout are set in config_workflow.xml

You can run an individual test by naming that test on the create_test command line, for example:
./create_test SMS_Lh3_D.C768.GFSv15p2.cheyenne_intel --workflow ufs-mrweather_wo_post

uturuncoglu · 2020-09-18T19:08:27Z

@uturuncoglu i am back. I am not sure about that. They are just building and submitting to the queue. Maybe the submission script can be sourced to create dependency between them but I am not sure it is currently supported and also if one of them fails the dependent jobs also killed. @jedwards4b is there any way to run the test one by one.

uturuncoglu · 2020-09-18T19:09:59Z

@climbfuji yomu could change number of processor of CHGRES with

./xmlchange task_count=144 --subgroup case.chgres

but don't forget to run ./preview_namelist after it.

uturuncoglu · 2020-09-18T19:30:17Z

@climbfuji BTW, i don't think the compiler version is the problem. We are seeing similar issue in different platforms. Again, it could be a memory leak or something else in the CHGRES part. Did you have a change to look at it?

climbfuji · 2020-09-18T19:59:16Z

@climbfuji BTW, i don't think the compiler version is the problem. We are seeing similar issue in different platforms. Again, it could be a memory leak or something else in the CHGRES part. Did you have a change to look at it?

I am looking, yes. The Intel 18 test is because we don't see any issues on stampede, hera and jet (all Intel 18), but on Orion (Intel 19). I just got another successful pass through the regression tests on Cheyenne with Intel 18, using 3*12 tasks for chgres_cube.exe instead of 6*6. I'll repeat this a few times to check if this is not yet another red herring.

uturuncoglu · 2020-09-18T20:02:33Z

@climbfuji but if I understand correctly you got error with Intel 18 on Cheyenne also. So, we have still problem in here and it could lead to a failure in MR app. If you need to change from 66 to 312 it is very easy and I could make the required changes under buildnml.

climbfuji · 2020-09-18T20:03:21Z

@climbfuji but if I understand correctly you got error with Intel 18 on Cheyenne also. So, we have still problem in here and it could lead to a failure in MR app. If you need to change from 6_6 to 3_12 it is very easy and I could make the required changes under buildnml.

This formatting in GitHub ... I know, I changed buildnml to use 3 times 12 instead of 6 times 6 tasks.

climbfuji · 2020-09-18T22:51:01Z

20190918: Ok, I got one successful pass through all Intel tests with this configuration. Will launch a few more over the weekend with both Intel and GNU.

Updatae 20200919: I got one more successful pass through all Intel tests with this configuration.

Updatae 20200919 II: I got another successful pass through all Intel tests with this configuration. I will do one more, if that one passes as well I recommend merging (and retagging, where necessary) NOAA-EMC/NCEPLIBS-external#70, ESMCI/cime#3713, and ufs-community/ufs-weather-model#204 with its dependencies.

Update 20200920: a fourth successful pass, this time running Intel and GNU at the same time on Cheyenne. We should definitely merge all of the above PRs. On the long run, we need to find out if, and if yes why, chgres_cube.exe produces more segfaults with Intel 19 than Intel 18. There is a multitude of possibilities, starting from different memory mapping strategies, other optimizations, bugs in chgres_cube.exe or the underlying grib2 I/O layer, to bugs in the compiler. I hear several times that Intel 19 (in particular Intel 19.0.x) caused problems. @uturuncoglu @panll @ligiabernardet @GeorgeGayno-NOAA

GeorgeGayno-NOAA · 2020-09-22T19:58:58Z

There's a set of cases on cheyenne here: /glade/scratch/carson/ufs/mrw.test/stack/ Fails: SMS_Lh3_D.C768.GFSv16beta.cheyenne_intel.20200915_090650_qga1v1/run/ Runs: SMS_Lh3_D.C768.GFSv15p2.cheyenne_intel.20200915_090650_qga1v1/run/ Both chgres.inv files are identical. Both namelist files are identical. Laurie

I was just looking thru model_grid.F90. I noticed that the call to wgrib2 (grb2_mk_inv) that creates chgres.inv is done on all mpi tasks. It is possible one task may not finish before another task tries to use chgres.inv at line 640. A quick test would be to create the chgres.inv on one task, then add a barrer before all tasks try to read it. Just a wild guess. But that might explain the random nature of the error.

ufs_release_v1.1: remove unnecessary/incorrect configuration options for Cheyenne for the UFS; downgrade Intel 19.x.y to 18.m.n on Cheyenne, Gaea, Orion ### Description This PR removes a whole bunch of unnecessary/incorrect configuration options for Cheyenne for the UFS and reverts Intel back to Intel-18.0.5 on Cheyenne, Gaea, Orion. Beforehand, the CIME regression tests failed frequently in `chgres_cube.exe` for the highest resolution cases, see description here: ufs-community/ufs-mrweather-app#190 With this change and the associated change in the ufs-weather-model and NCEPLIBS-external (documentation only), the regression tests ran successfully (tried one time thus far, will conduct multiple runs to make sure I wasn't just lucky). ``` export UFS_DRIVER=nems export UFS_INPUT=$CESMDATAROOT export UFS_SCRATCH=/glade/work/heinzell/fv3/ufs-mrweather-app/ufs_scratch qcmd -l walltime=3:00:00 -- "export UFS_DRIVER=nems; CIME_MODEL=ufs ./create_test --xml-testlist ../../src/model/FV3/cime/cime_config/testlist.xml --xml-machine cheyenne --xml-compiler intel --workflow ufs-mrweather_wo_post -j 4 --walltime 03:00:00" ... heinzell@cheyenne3:/glade/work/heinzell/fv3/ufs-mrweather-app/ufs-mrweather-app-release-public-v1/cime/scripts> /glade/work/heinzell/fv3/ufs-mrweather-app/ufs_scratch/cs.status.20200917_210120_2ap63n | grep Overall ERS_Lh11.C96.GFSv15p2.cheyenne_intel (Overall: PASS) details: ERS_Lh11.C96.GFSv16beta.cheyenne_intel (Overall: PASS) details: PET_Lh11.C96.GFSv15p2.cheyenne_intel (Overall: PASS) details: SMS_Lh3.C192.GFSv15p2.cheyenne_intel (Overall: PASS) details: SMS_Lh3.C192.GFSv16beta.cheyenne_intel (Overall: PASS) details: SMS_Lh3.C384.GFSv15p2.cheyenne_intel (Overall: PASS) details: SMS_Lh3.C384.GFSv16beta.cheyenne_intel (Overall: PASS) details: SMS_Lh3.C768.GFSv15p2.cheyenne_intel (Overall: PASS) details: SMS_Lh3.C768.GFSv16beta.cheyenne_intel (Overall: PASS) details: SMS_Lh3.C96.GFSv15p2.cheyenne_intel (Overall: PASS) details: SMS_Lh3.C96.GFSv16beta.cheyenne_intel (Overall: PASS) details: SMS_Lh3_D.C192.GFSv15p2.cheyenne_intel (Overall: PASS) details: SMS_Lh3_D.C192.GFSv16beta.cheyenne_intel (Overall: PASS) details: SMS_Lh3_D.C384.GFSv15p2.cheyenne_intel (Overall: PASS) details: SMS_Lh3_D.C384.GFSv16beta.cheyenne_intel (Overall: PASS) details: SMS_Lh3_D.C768.GFSv15p2.cheyenne_intel (Overall: PASS) details: SMS_Lh3_D.C768.GFSv16beta.cheyenne_intel (Overall: PASS) details: SMS_Lh3_D.C96.GFSv15p2.cheyenne_intel (Overall: PASS) details: SMS_Lh3_D.C96.GFSv16beta.cheyenne_intel (Overall: PASS) details: ``` The PR also replaces a few tabs with whitespaces for consistent formatting in `config/ufs/machines/config_machines.xml`. Note that we will need to (re-)create tag `ufs-v1.1.0` (or whatever the CIME convention is; this is what we use for all the UFS components and external libraries, and what is also consistent with MRW App release 1.0).

climbfuji · 2020-09-25T15:30:56Z

Just so that everyone is up to date. After downgrading the compiler, we still got crashes on orion (but not on Cheyenne). I implemented George's fix in UFS_UTILS and this seems to solve the problems on orion. I will test those changes (see NOAA-EMC/NCEPLIBS#118) on other platforms as well before we merge them and roll out a retagged version of NCEPLIBS on all platforms.

I will update the testing information in this comment as I am progressing:

orion.intel: three successful passes through all MRW App regression tests thus far
cheyenne.intel: one successful pass through all MRW App regression tests using Intel and GNU (with the same command)

arunchawla-NOAA · 2020-09-28T20:46:45Z

this is now fixed

uturuncoglu changed the title ~~Hanging on Orion ...~~ Hanging on Cheyenne ... Sep 10, 2020

climbfuji added the release-1.1 label Sep 22, 2020

climbfuji mentioned this issue Sep 25, 2020

release/public-v1: fix grib2 read errors with large task counts (draft PR) ufs-community/UFS_UTILS#155

Merged

arunchawla-NOAA closed this as completed Sep 28, 2020

Hanging on Cheyenne ... #190

Hanging on Cheyenne ... #190

Comments

uturuncoglu commented Sep 10, 2020

ligiabernardet commented Sep 10, 2020

uturuncoglu commented Sep 10, 2020

llpcarson commented Sep 10, 2020 via email

ligiabernardet commented Sep 10, 2020

uturuncoglu commented Sep 10, 2020

uturuncoglu commented Sep 10, 2020

ligiabernardet commented Sep 10, 2020

uturuncoglu commented Sep 10, 2020

ligiabernardet commented Sep 10, 2020

ligiabernardet commented Sep 10, 2020

uturuncoglu commented Sep 10, 2020

llpcarson commented Sep 11, 2020 via email

uturuncoglu commented Sep 11, 2020

llpcarson commented Sep 11, 2020 via email

uturuncoglu commented Sep 11, 2020

GeorgeGayno-NOAA commented Sep 11, 2020

uturuncoglu commented Sep 11, 2020

llpcarson commented Sep 11, 2020 via email

uturuncoglu commented Sep 11, 2020

GeorgeGayno-NOAA commented Sep 11, 2020

uturuncoglu commented Sep 11, 2020

llpcarson commented Sep 11, 2020 via email

uturuncoglu commented Sep 11, 2020

llpcarson commented Sep 11, 2020 via email

climbfuji commented Sep 11, 2020 via email

llpcarson commented Sep 11, 2020 via email

uturuncoglu commented Sep 11, 2020

uturuncoglu commented Sep 11, 2020

uturuncoglu commented Sep 11, 2020

ligiabernardet commented Sep 17, 2020

llpcarson commented Sep 17, 2020 via email

uturuncoglu commented Sep 17, 2020

ligiabernardet commented Sep 17, 2020

GeorgeGayno-NOAA commented Sep 17, 2020 • edited Loading

climbfuji commented Sep 18, 2020

climbfuji commented Sep 18, 2020

climbfuji commented Sep 18, 2020

climbfuji commented Sep 18, 2020

climbfuji commented Sep 18, 2020

mvertens commented Sep 18, 2020 via email

jedwards4b commented Sep 18, 2020

uturuncoglu commented Sep 18, 2020

uturuncoglu commented Sep 18, 2020

uturuncoglu commented Sep 18, 2020

climbfuji commented Sep 18, 2020 • edited Loading

uturuncoglu commented Sep 18, 2020

climbfuji commented Sep 18, 2020

climbfuji commented Sep 18, 2020 • edited Loading

GeorgeGayno-NOAA commented Sep 22, 2020

climbfuji commented Sep 25, 2020 • edited Loading

arunchawla-NOAA commented Sep 28, 2020

GeorgeGayno-NOAA commented Sep 17, 2020 •

edited

Loading

climbfuji commented Sep 18, 2020 •

edited

Loading

climbfuji commented Sep 18, 2020 •

edited

Loading

climbfuji commented Sep 25, 2020 •

edited

Loading