Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hanging on Cheyenne ... #190

Closed
uturuncoglu opened this issue Sep 10, 2020 · 63 comments
Closed

Hanging on Cheyenne ... #190

uturuncoglu opened this issue Sep 10, 2020 · 63 comments

Comments

@uturuncoglu
Copy link
Collaborator

@climbfuji @ligiabernardet i am having trouble with the model on Cheyenne and it is hanging when it is reading static input files such as global_shdmin.0.144x0.144.grb for the resolution > C96. This was also case for the new buildlib and I think it is not related with the build. Have you ever experience same problem? This is also reported previously on #184 (comment). Do we need increase the resources that are used by the model? For example, C192 is hanging/failing without any particular error and I am using following configuration options,

ntiles = 6
layout = 4, 6
write_groups: 1
write_tasks_per_group: 36

and total 180 processor.

@uturuncoglu uturuncoglu changed the title Hanging on Orion ... Hanging on Cheyenne ... Sep 10, 2020
@ligiabernardet
Copy link
Collaborator

@uturuncoglu I have not received any reports of the model hanging. @llpcarson Any insight wrt hanging on Cheyenne?

@uturuncoglu
Copy link
Collaborator Author

@ligiabernardet it is strange. I updated buildlib and I am waiting to resolve this issue. Let me know, if you see similar issue.

@llpcarson
Copy link
Collaborator

llpcarson commented Sep 10, 2020 via email

@ligiabernardet
Copy link
Collaborator

@ufuk We are waiting on a PR of the updated build so we can merge it onto the release/public-v1 branch and conduct tests.

@uturuncoglu
Copy link
Collaborator Author

@llpcarson those options are consistent. Anyway, I'll make PR soon and you could test it. All those strange things happen in my account. Maybe there is some thing wrong in there. Let's see what you find in your tests.

@uturuncoglu
Copy link
Collaborator Author

@ligiabernardet I created PR in the app level.

@ligiabernardet
Copy link
Collaborator

@uturuncoglu Does it hang all the time or occasionally?

@uturuncoglu
Copy link
Collaborator Author

@ligiabernardet in my recent test all resolution failed with same way except C96 ones.

@ligiabernardet
Copy link
Collaborator

@uturuncoglu Here is a suggestion from @climbfuji: Are we using threading? If yes: Can we test compiling without OpenMP, or even easier, run with one OpenMP thread only, and see if this solves the problem?

@ligiabernardet
Copy link
Collaborator

@llpcarson is running some tests on Cheyenne. Laurie, let us know what you find out.

@uturuncoglu
Copy link
Collaborator Author

@climbfuji we are not using threading at least for following test

/glade/scratch/turuncu/SMS_Lh3.C192.GFSv15p2.cheyenne_intel.20200909_155451_1sg3p2

ant it still hang/fail when reading file.

@ligiabernardet thanks. I hope I am the only one that have the issue.

@llpcarson
Copy link
Collaborator

llpcarson commented Sep 11, 2020 via email

@uturuncoglu
Copy link
Collaborator Author

@llpcarson That is great! Yes, if you run full test suite that would be great. Once you run the test suite (if you run without specifying compiler such as --xml-compiler intel that will run both Intel and GNU tests), please let me know the directory and I could double check the results. Thanks for your help.

@llpcarson
Copy link
Collaborator

llpcarson commented Sep 11, 2020 via email

@uturuncoglu
Copy link
Collaborator Author

@llpcarson i was having problem with C768 also on Cheyenne. Is this on Cheyenne? Probably couple of them will pass and couple of them will fail. We might need to increase the allocated resource for C768 because it is not stable at this point. What do you think @GeorgeGayno-NOAA?

@GeorgeGayno-NOAA
Copy link
Collaborator

@llpcarson i was having problem with C768 also on Cheyenne. Is this on Cheyenne? Probably couple of them will pass and couple of them will fail. We might need to increase the allocated resource for C768 because it is not stable at this point. What do you think @GeorgeGayno-NOAA?

Is the model hanging or chgres_cube? I am more familiar with the latter.

@uturuncoglu
Copy link
Collaborator Author

@GeorgeGayno-NOAA I think CHGRES is failing in C768. we are using 6 nodes with 6 core in per node as you suggested. It is running in some case and failing in other. So, not every C768 is failing.

@llpcarson
Copy link
Collaborator

llpcarson commented Sep 11, 2020 via email

@uturuncoglu
Copy link
Collaborator Author

@llpcarson you could still check the run/INPUT folder of one of the cases to see CHGRES generated files. If they are there and the model will pick them and run. I hope it won't hang. What about other resolutions? Did you see any hang issue with the model?

@GeorgeGayno-NOAA
Copy link
Collaborator

On cheyenne: Yes, chgres cube is failing (seg-fault) for some of the C768 cases (but not all). The model/forecast jobs are still waiting in the queue (the ones that had a successful chgres_cube) Laurie

On Fri, Sep 11, 2020 at 1:49 PM Ufuk Turunçoğlu @.***> wrote: @GeorgeGayno-NOAA https://github.com/GeorgeGayno-NOAA I think CHGRES is failing in C768. we are using 6 nodes with 6 core in per node as you suggested. It is running in some case and failing in other. So, not every C768 is failing. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#190 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2OWIURWVHBM5GYZ5FDAB3SFJ5N5ANCNFSM4RFMB6EA .

Are the failures happening with a certain input data, like grib2 or nemsio?

@uturuncoglu
Copy link
Collaborator Author

@GeorgeGayno-NOAA The default input type is GRIB2 and the test suit uses that one.

@llpcarson
Copy link
Collaborator

llpcarson commented Sep 11, 2020 via email

@uturuncoglu
Copy link
Collaborator Author

@llpcarson I could not find /glade/scratch/carson/ufs/. Is it correct? Yes, the error is strange it show that the file is missing or corrupted but all the cases use same file. Did you also run GNU tests?

@llpcarson
Copy link
Collaborator

llpcarson commented Sep 11, 2020 via email

@climbfuji
Copy link
Collaborator

climbfuji commented Sep 11, 2020 via email

@llpcarson
Copy link
Collaborator

llpcarson commented Sep 11, 2020 via email

@uturuncoglu
Copy link
Collaborator Author

@llpcarson D is debug mode. I am not sure about the potions that are changed but if you need I could check it.

@uturuncoglu
Copy link
Collaborator Author

Yes, I ran both Gnu and Intel. Each had failure and success for chgres_cube. Here's one of the run-dirs with a failure: /glade/scratch/carson/ufs/SMS_Lh3_D.C768.GFSv15p2.cheyenne_gnu.G.20200911_091828_ou9in9/run Does the _D part refer to a debug-mode compile? (just curious)

Yes, I check your directory and it seems there is no build error but all C768 test are failed due to the failure in CHGRES.

@uturuncoglu
Copy link
Collaborator Author

@llpcarson yes, in some cases if you run the model again CHGRES process without any problem. I am not sure but it could be node allocation on Cheyenne. It might be nice to check in the other platforms.

@ligiabernardet
Copy link
Collaborator

@llpcarson do you have a failed chgres cube that we can use to check a) namelist and b) GRIB2 inventory ./chgres.inv?

@uturuncoglu I confirm that we also have a failure of chgres_cube C768 on Orion.
Results from other platforms:

  • Jet: C768 passed (only 1 run tested - it takes more than a day in the queue due to reservations, so hard to do many runs)
  • Hera: C768 passed (only 1 run tested)
  • Orion: 1/19 tests that are part of RT crashed on chgres_cube (C768 RT failed on orion #194)
  • Stampede: waiting RT results from @climbfuji
  • Gaea: waiting C768 results from @climbfuji

@llpcarson
Copy link
Collaborator

llpcarson commented Sep 17, 2020 via email

@uturuncoglu
Copy link
Collaborator Author

@ligiabernardet thanks for the update. I am not sure the where the source of problem. While we have also running cases, I am suspecting from chgres and it might have memory leak etc.

@ligiabernardet
Copy link
Collaborator

@GeorgeGayno-NOAA @climbfuji @uturuncoglu @arunchawla-NOAA @climbfuji @rsdunlapiv @llpcarson Do we have any other hypothesis or idea of what to try to get chgres_cube to work in CIME consistently?

Cheyenne: Occasional crashes of chgres_cube C768 when reading GRB2 file
Jet: C768 passed (only 1 run tested - it takes more than a day in the queue due to reservations, so hard to do many runs)
Hera: C768 passed (only 1 run tested)
Orion: 1/19 tests that are part of RT crashed on chgres_cube (#194)
Stampede: waiting RT results
Gaea: waiting C768 results

@GeorgeGayno-NOAA
Copy link
Collaborator

GeorgeGayno-NOAA commented Sep 17, 2020

@GeorgeGayno-NOAA @climbfuji @uturuncoglu @arunchawla-NOAA @climbfuji @rsdunlapiv @llpcarson Do we have any other hypothesis or idea of what to try to get chgres_cube to work in CIME consistently?

Cheyenne: Occasional crashes of chgres_cube C768 when reading GRB2 file
Jet: C768 passed (only 1 run tested - it takes more than a day in the queue due to reservations, so hard to do many runs)
Hera: C768 passed (only 1 run tested)
Orion: 1/19 tests that are part of RT crashed on chgres_cube (#194)
Stampede: waiting RT results
Gaea: waiting C768 results

Is it always failing at the same spot (model_grid.F90 line 640)? And the failure occurs randomly? Do the RT tests run in sequence or simultaneously?

@climbfuji
Copy link
Collaborator

@ligiabernardet @GeorgeGayno-NOAA @climbfuji @uturuncoglu @arunchawla-NOAA @jedwards4b @rsdunlapiv @llpcarson

I just got a successful run of all regression tests on Cheyenne with Intel. This is what I did:

  • revert back from Intel 19.0.5 to Intel 18.0.5 (because we did not see any problems on hera and jet, which use Intel 18)
  • removed a whole bunch of stuff from the CIME UFS config that was there for no reason

PRs:

@climbfuji
Copy link
Collaborator

Ok, here we go ... just got one failure with Intel 18.0.5 on Cheyenne in my second round of tests (when running both Intel and GNU tests with the same command). Super annoying. Will see how the rest works out.

@climbfuji
Copy link
Collaborator

@uturuncoglu is there a way to force the tests to run serially, i.e. only one regression test running at a time?

@climbfuji
Copy link
Collaborator

@uturuncoglu another question, how do I change the default MPI job size for chgres in cime? I want the regression tests to run on a different number of nodes with a different number of tasks per node, still 36 tasks in total for C768. Thanks ...

@climbfuji
Copy link
Collaborator

@jedwards4b Ufuk seems to be out or busy today, can you answer my basic questions in #190 (comment) and #190 (comment) by any chance? Thanks ...

@mvertens
Copy link
Collaborator

mvertens commented Sep 18, 2020 via email

@jedwards4b
Copy link
Collaborator

Default job size and pelayout are set in config_workflow.xml

You can run an individual test by naming that test on the create_test command line, for example:
./create_test SMS_Lh3_D.C768.GFSv15p2.cheyenne_intel --workflow ufs-mrweather_wo_post

@uturuncoglu
Copy link
Collaborator Author

@uturuncoglu i am back. I am not sure about that. They are just building and submitting to the queue. Maybe the submission script can be sourced to create dependency between them but I am not sure it is currently supported and also if one of them fails the dependent jobs also killed. @jedwards4b is there any way to run the test one by one.

@uturuncoglu
Copy link
Collaborator Author

@climbfuji yomu could change number of processor of CHGRES with

./xmlchange task_count=144 --subgroup case.chgres

but don't forget to run ./preview_namelist after it.

@uturuncoglu
Copy link
Collaborator Author

@climbfuji BTW, i don't think the compiler version is the problem. We are seeing similar issue in different platforms. Again, it could be a memory leak or something else in the CHGRES part. Did you have a change to look at it?

@climbfuji
Copy link
Collaborator

climbfuji commented Sep 18, 2020

@climbfuji BTW, i don't think the compiler version is the problem. We are seeing similar issue in different platforms. Again, it could be a memory leak or something else in the CHGRES part. Did you have a change to look at it?

I am looking, yes. The Intel 18 test is because we don't see any issues on stampede, hera and jet (all Intel 18), but on Orion (Intel 19). I just got another successful pass through the regression tests on Cheyenne with Intel 18, using 3*12 tasks for chgres_cube.exe instead of 6*6. I'll repeat this a few times to check if this is not yet another red herring.

@uturuncoglu
Copy link
Collaborator Author

@climbfuji but if I understand correctly you got error with Intel 18 on Cheyenne also. So, we have still problem in here and it could lead to a failure in MR app. If you need to change from 66 to 312 it is very easy and I could make the required changes under buildnml.

@climbfuji
Copy link
Collaborator

@climbfuji but if I understand correctly you got error with Intel 18 on Cheyenne also. So, we have still problem in here and it could lead to a failure in MR app. If you need to change from 6_6 to 3_12 it is very easy and I could make the required changes under buildnml.

This formatting in GitHub ... I know, I changed buildnml to use 3 times 12 instead of 6 times 6 tasks.

@climbfuji
Copy link
Collaborator

climbfuji commented Sep 18, 2020

20190918: Ok, I got one successful pass through all Intel tests with this configuration. Will launch a few more over the weekend with both Intel and GNU.

Updatae 20200919: I got one more successful pass through all Intel tests with this configuration.

Updatae 20200919 II: I got another successful pass through all Intel tests with this configuration. I will do one more, if that one passes as well I recommend merging (and retagging, where necessary) NOAA-EMC/NCEPLIBS-external#70, ESMCI/cime#3713, and ufs-community/ufs-weather-model#204 with its dependencies.

Update 20200920: a fourth successful pass, this time running Intel and GNU at the same time on Cheyenne. We should definitely merge all of the above PRs. On the long run, we need to find out if, and if yes why, chgres_cube.exe produces more segfaults with Intel 19 than Intel 18. There is a multitude of possibilities, starting from different memory mapping strategies, other optimizations, bugs in chgres_cube.exe or the underlying grib2 I/O layer, to bugs in the compiler. I hear several times that Intel 19 (in particular Intel 19.0.x) caused problems. @uturuncoglu @panll @ligiabernardet @GeorgeGayno-NOAA

@GeorgeGayno-NOAA
Copy link
Collaborator

There's a set of cases on cheyenne here: /glade/scratch/carson/ufs/mrw.test/stack/ Fails: SMS_Lh3_D.C768.GFSv16beta.cheyenne_intel.20200915_090650_qga1v1/run/ Runs: SMS_Lh3_D.C768.GFSv15p2.cheyenne_intel.20200915_090650_qga1v1/run/ Both chgres.inv files are identical. Both namelist files are identical. Laurie

I was just looking thru model_grid.F90. I noticed that the call to wgrib2 (grb2_mk_inv) that creates chgres.inv is done on all mpi tasks. It is possible one task may not finish before another task tries to use chgres.inv at line 640. A quick test would be to create the chgres.inv on one task, then add a barrer before all tasks try to read it. Just a wild guess. But that might explain the random nature of the error.

billsacks added a commit to ESMCI/cime that referenced this issue Sep 22, 2020
ufs_release_v1.1: remove unnecessary/incorrect configuration options for Cheyenne for the UFS; downgrade Intel 19.x.y to 18.m.n on Cheyenne, Gaea, Orion

### Description

This PR removes a whole bunch of unnecessary/incorrect configuration options for Cheyenne for the UFS and reverts Intel back to Intel-18.0.5 on Cheyenne, Gaea, Orion.

Beforehand, the CIME regression tests failed frequently in `chgres_cube.exe` for the highest resolution cases, see description here: ufs-community/ufs-mrweather-app#190

With this change and the associated change in the ufs-weather-model and NCEPLIBS-external (documentation only), the regression tests ran successfully (tried one time thus far, will conduct multiple runs to make sure I wasn't just lucky).
```
export UFS_DRIVER=nems
export UFS_INPUT=$CESMDATAROOT
export UFS_SCRATCH=/glade/work/heinzell/fv3/ufs-mrweather-app/ufs_scratch
qcmd -l walltime=3:00:00 -- "export UFS_DRIVER=nems; CIME_MODEL=ufs ./create_test --xml-testlist ../../src/model/FV3/cime/cime_config/testlist.xml --xml-machine cheyenne --xml-compiler intel --workflow ufs-mrweather_wo_post -j 4 --walltime 03:00:00"

...

heinzell@cheyenne3:/glade/work/heinzell/fv3/ufs-mrweather-app/ufs-mrweather-app-release-public-v1/cime/scripts> /glade/work/heinzell/fv3/ufs-mrweather-app/ufs_scratch/cs.status.20200917_210120_2ap63n | grep Overall
  ERS_Lh11.C96.GFSv15p2.cheyenne_intel (Overall: PASS) details:
  ERS_Lh11.C96.GFSv16beta.cheyenne_intel (Overall: PASS) details:
  PET_Lh11.C96.GFSv15p2.cheyenne_intel (Overall: PASS) details:
  SMS_Lh3.C192.GFSv15p2.cheyenne_intel (Overall: PASS) details:
  SMS_Lh3.C192.GFSv16beta.cheyenne_intel (Overall: PASS) details:
  SMS_Lh3.C384.GFSv15p2.cheyenne_intel (Overall: PASS) details:
  SMS_Lh3.C384.GFSv16beta.cheyenne_intel (Overall: PASS) details:
  SMS_Lh3.C768.GFSv15p2.cheyenne_intel (Overall: PASS) details:
  SMS_Lh3.C768.GFSv16beta.cheyenne_intel (Overall: PASS) details:
  SMS_Lh3.C96.GFSv15p2.cheyenne_intel (Overall: PASS) details:
  SMS_Lh3.C96.GFSv16beta.cheyenne_intel (Overall: PASS) details:
  SMS_Lh3_D.C192.GFSv15p2.cheyenne_intel (Overall: PASS) details:
  SMS_Lh3_D.C192.GFSv16beta.cheyenne_intel (Overall: PASS) details:
  SMS_Lh3_D.C384.GFSv15p2.cheyenne_intel (Overall: PASS) details:
  SMS_Lh3_D.C384.GFSv16beta.cheyenne_intel (Overall: PASS) details:
  SMS_Lh3_D.C768.GFSv15p2.cheyenne_intel (Overall: PASS) details:
  SMS_Lh3_D.C768.GFSv16beta.cheyenne_intel (Overall: PASS) details:
  SMS_Lh3_D.C96.GFSv15p2.cheyenne_intel (Overall: PASS) details:
  SMS_Lh3_D.C96.GFSv16beta.cheyenne_intel (Overall: PASS) details:
```

The PR also replaces a few tabs with whitespaces for consistent formatting in `config/ufs/machines/config_machines.xml`.

Note that we will need to (re-)create tag `ufs-v1.1.0` (or whatever the CIME convention is; this is what we use for all the UFS components and external libraries, and what is also consistent with MRW App release 1.0).
@climbfuji
Copy link
Collaborator

climbfuji commented Sep 25, 2020

Just so that everyone is up to date. After downgrading the compiler, we still got crashes on orion (but not on Cheyenne). I implemented George's fix in UFS_UTILS and this seems to solve the problems on orion. I will test those changes (see NOAA-EMC/NCEPLIBS#118) on other platforms as well before we merge them and roll out a retagged version of NCEPLIBS on all platforms.

I will update the testing information in this comment as I am progressing:

  • orion.intel: three successful passes through all MRW App regression tests thus far
  • cheyenne.intel: one successful pass through all MRW App regression tests using Intel and GNU (with the same command)

@arunchawla-NOAA
Copy link
Collaborator

this is now fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants