-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hanging on Cheyenne ... #190
Comments
@uturuncoglu I have not received any reports of the model hanging. @llpcarson Any insight wrt hanging on Cheyenne? |
@ligiabernardet it is strange. I updated buildlib and I am waiting to resolve this issue. Let me know, if you see similar issue. |
No, I haven't seen this lately on cheyenne. One thing to check is the
processor layout and the job-node-request. If these don't match, sometimes
the model will hang (use 48 tasks, but submit the job with 64, for example)
Laurie
…On Thu, Sep 10, 2020 at 11:17 AM ligiabernardet ***@***.***> wrote:
@uturuncoglu <https://github.com/uturuncoglu> I have not received any
reports of the model hanging. @llpcarson <https://github.com/llpcarson>
Any insight wrt hanging on Cheyenne?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#190 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB2OWIXCOZPBY3E3MHYOJBTSFEC23ANCNFSM4RFMB6EA>
.
|
@ufuk We are waiting on a PR of the updated build so we can merge it onto the release/public-v1 branch and conduct tests. |
@llpcarson those options are consistent. Anyway, I'll make PR soon and you could test it. All those strange things happen in my account. Maybe there is some thing wrong in there. Let's see what you find in your tests. |
@ligiabernardet I created PR in the app level. |
@uturuncoglu Does it hang all the time or occasionally? |
@ligiabernardet in my recent test all resolution failed with same way except C96 ones. |
@uturuncoglu Here is a suggestion from @climbfuji: Are we using threading? If yes: Can we test compiling without OpenMP, or even easier, run with one OpenMP thread only, and see if this solves the problem? |
@llpcarson is running some tests on Cheyenne. Laurie, let us know what you find out. |
@climbfuji we are not using threading at least for following test /glade/scratch/turuncu/SMS_Lh3.C192.GFSv15p2.cheyenne_intel.20200909_155451_1sg3p2 ant it still hang/fail when reading file. @ligiabernardet thanks. I hope I am the only one that have the issue. |
Ufuk, Ligia -
I ran the default MRW case, at C96, C384 and C768 and all 3 ran:
grib2 input, threaded (4), 20190829
I can try running the CIME reg-tests next (that's what that case is,
correct?)
Laurie
…On Thu, Sep 10, 2020 at 3:57 PM Ufuk Turunçoğlu ***@***.***> wrote:
@climbfuji <https://github.com/climbfuji> we are not using threading at
least for following test
/glade/scratch/turuncu/SMS_Lh3.C192.GFSv15p2.cheyenne_intel.20200909_155451_1sg3p2
ant it still hang/fail when reading file.
@ligiabernardet <https://github.com/ligiabernardet> thanks. I hope I am
the only one that have the issue.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#190 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB2OWIS5VJJPMQEXGGUYSGTSFFDUXANCNFSM4RFMB6EA>
.
|
@llpcarson That is great! Yes, if you run full test suite that would be great. Once you run the test suite (if you run without specifying compiler such as |
Partial results to report:
All of the C96, C192 and C384 jobs have completed successfully.
Three of the C786 jobs crashed in chgres_cube (and so the forecast jobs
were killed for dependency-failure)
5 of the C768 cases are still in the batch queue waiting to run (these ran
chgres_cube successfully)
Will let you know when the C768 jobs start running...
…On Fri, Sep 11, 2020 at 9:12 AM Ufuk Turunçoğlu ***@***.***> wrote:
@llpcarson <https://github.com/llpcarson> That is great! Yes, if you run
full test suite that would be great. Once you run the test suite (if you
run without specifying compiler such as --xml-compiler intel that will
run both Intel and GNU tests), please let me know the directory and I could
double check the results. Thanks for your help.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#190 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB2OWIRC5D4AZWJRODMLJETSFI47JANCNFSM4RFMB6EA>
.
|
@llpcarson i was having problem with C768 also on Cheyenne. Is this on Cheyenne? Probably couple of them will pass and couple of them will fail. We might need to increase the allocated resource for C768 because it is not stable at this point. What do you think @GeorgeGayno-NOAA? |
Is the model hanging or chgres_cube? I am more familiar with the latter. |
@GeorgeGayno-NOAA I think CHGRES is failing in C768. we are using 6 nodes with 6 core in per node as you suggested. It is running in some case and failing in other. So, not every C768 is failing. |
On cheyenne:
Yes, chgres cube is failing (seg-fault) for some of the C768 cases (but not
all).
The model/forecast jobs are still waiting in the queue (the ones that had a
successful chgres_cube)
Laurie
…On Fri, Sep 11, 2020 at 1:49 PM Ufuk Turunçoğlu ***@***.***> wrote:
@GeorgeGayno-NOAA <https://github.com/GeorgeGayno-NOAA> I think CHGRES is
failing in C768. we are using 6 nodes with 6 core in per node as you
suggested. It is running in some case and failing in other. So, not every
C768 is failing.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#190 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB2OWIURWVHBM5GYZ5FDAB3SFJ5N5ANCNFSM4RFMB6EA>
.
|
@llpcarson you could still check the run/INPUT folder of one of the cases to see CHGRES generated files. If they are there and the model will pick them and run. I hope it won't hang. What about other resolutions? Did you see any hang issue with the model? |
Are the failures happening with a certain input data, like grib2 or nemsio? |
@GeorgeGayno-NOAA The default input type is GRIB2 and the test suit uses that one. |
Yes, the chgres_cube run worked for these cases that are waiting to run the
model. Failed for others. All C768. All of the other resolutions ran
without issue (at least I think so!)
Rundir is: /glade/scratch/carson/ufs/*
App dir is: /glade/scratch/carson/ufs/mrw.test/ufs-mrweather-app
The logfile from chgres_cube shows:
- FATAL ERROR: READING GRIB2 FILE
- IOSTAT IS: 0
but - it's the same file successfully read in other cases? One time Intel,
one time Gnu. Perhaps I'll try another full-test to see if it's consistent
…On Fri, Sep 11, 2020 at 2:01 PM Ufuk Turunçoğlu ***@***.***> wrote:
@llpcarson <https://github.com/llpcarson> you could still check the
run/INPUT folder of one of the cases to see CHGRES generated files. If they
are there and the model will pick them and run. I hope it won't hang. What
about other resolutions? Did you see any hang issue with the model?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#190 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB2OWISABPTH3V767XGF6MLSFJ6ZFANCNFSM4RFMB6EA>
.
|
@llpcarson I could not find /glade/scratch/carson/ufs/. Is it correct? Yes, the error is strange it show that the file is missing or corrupted but all the cases use same file. Did you also run GNU tests? |
Yes, I ran both Gnu and Intel. Each had failure and success for
chgres_cube. Here's one of the run-dirs with a failure:
/glade/scratch/carson/ufs/SMS_Lh3_D.C768.GFSv15p2.cheyenne_gnu.G.20200911_091828_ou9in9/run
Does the _D part refer to a debug-mode compile? (just curious)
…On Fri, Sep 11, 2020 at 2:44 PM Ufuk Turunçoğlu ***@***.***> wrote:
@llpcarson <https://github.com/llpcarson> I could not find
/glade/scratch/carson/ufs/. Is it correct? Yes, the error is strange it
show that the file is missing or corrupted but all the cases use same file.
Did you also run GNU tests?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#190 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB2OWIVMGUJH27UEDKQLOFTSFKD2JANCNFSM4RFMB6EA>
.
|
If it does, then this only applies to the model I guess, because chgres_cube is compiled as part of NCEPLIBS, which compiles in "production" mode.
I looked at your run directory:
cat chgres_cube.200911-105345.log
...
- CALL FieldScatter FOR INPUT GRID LONGITUDE.
- CALL FieldScatter FOR INPUT GRID LATITUDE.
#0 0x2b4035126aff in ???
#1 0x2b40357ae9bb in ???
#0 0x2ab63cbdbaff in ???
#1 0x2ab63d2639bb in ???
#0 0x2ab63cbdbaff in ???
#1 0x2ab63d2639bb in ???
#0 0x2ab63cbdbaff in ???
#1 0x2ab63d2639bb in ???
#0 0x2b2d04fcdaff in ???
#1 0x2b2d056559bb in ???
#0 0x2b2d04fcdaff in ???
#1 0x2b2d056559bb in ???
#0 0x2b2d04fcdaff in ???
#1 0x2b2d056559bb in ???
#0 0x2b2d04fcdaff in ???
#1 0x2b2d056559bb in ???
#0 0x2b2d04fcdaff in ???
#0 0x2b2d04fcdaff in ???
#1 0x2b2d056559bb in ???
#1 0x2b2d056559bb in ???
MPT ERROR: MPI_COMM_WORLD rank 21 has terminated without calling MPI_Finalize()
aborting job
MPT: Received signal 11
I also checked PET21.ESMF_LogFile for the mpi rank that reported the crash (first), but there is no useful useful information in the file.
Let me compile chgres_cube manually with debugging flags on, then copy your run directory and run the preprocessing step manually.
… On Sep 11, 2020, at 3:01 PM, Laurie Carson ***@***.***> wrote:
Yes, I ran both Gnu and Intel. Each had failure and success for
chgres_cube. Here's one of the run-dirs with a failure:
/glade/scratch/carson/ufs/SMS_Lh3_D.C768.GFSv15p2.cheyenne_gnu.G.20200911_091828_ou9in9/run
Does the _D part refer to a debug-mode compile? (just curious)
On Fri, Sep 11, 2020 at 2:44 PM Ufuk Turunçoğlu ***@***.***>
wrote:
> @llpcarson <https://github.com/llpcarson> I could not find
> /glade/scratch/carson/ufs/. Is it correct? Yes, the error is strange it
> show that the file is missing or corrupted but all the cases use same file.
> Did you also run GNU tests?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#190 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AB2OWIVMGUJH27UEDKQLOFTSFKD2JANCNFSM4RFMB6EA>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#190 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB5C2RLFX6SXOMU43DQ32S3SFKF4DANCNFSM4RFMB6EA>.
|
I just re-ran the reg-test for the C768 cases only, and all 8 tests ran
chgres_cube without error (forecast/model are still in the queue). Very
frustrating!
And, unfortunately, even with a failed run directory, a re-run (with a
simple qsub script) completes without error.
Will check back later tonight to see if any of the model runs hang/finish :)
…On Fri, Sep 11, 2020 at 3:00 PM Laurie Carson ***@***.***> wrote:
Yes, I ran both Gnu and Intel. Each had failure and success for
chgres_cube. Here's one of the run-dirs with a failure:
/glade/scratch/carson/ufs/SMS_Lh3_D.C768.GFSv15p2.cheyenne_gnu.G.20200911_091828_ou9in9/run
Does the _D part refer to a debug-mode compile? (just curious)
On Fri, Sep 11, 2020 at 2:44 PM Ufuk Turunçoğlu ***@***.***>
wrote:
> @llpcarson <https://github.com/llpcarson> I could not find
> /glade/scratch/carson/ufs/. Is it correct? Yes, the error is strange it
> show that the file is missing or corrupted but all the cases use same file.
> Did you also run GNU tests?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#190 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AB2OWIVMGUJH27UEDKQLOFTSFKD2JANCNFSM4RFMB6EA>
> .
>
|
@llpcarson D is debug mode. I am not sure about the potions that are changed but if you need I could check it. |
Yes, I check your directory and it seems there is no build error but all C768 test are failed due to the failure in CHGRES. |
@llpcarson yes, in some cases if you run the model again CHGRES process without any problem. I am not sure but it could be node allocation on Cheyenne. It might be nice to check in the other platforms. |
@llpcarson do you have a failed chgres cube that we can use to check a) namelist and b) GRIB2 inventory ./chgres.inv? @uturuncoglu I confirm that we also have a failure of chgres_cube C768 on Orion.
|
There's a set of cases on cheyenne
here: /glade/scratch/carson/ufs/mrw.test/stack/
Fails: SMS_Lh3_D.C768.GFSv16beta.cheyenne_intel.20200915_090650_qga1v1/run/
Runs: SMS_Lh3_D.C768.GFSv15p2.cheyenne_intel.20200915_090650_qga1v1/run/
Both chgres.inv files are identical. Both namelist files are identical.
Laurie
…On Thu, Sep 17, 2020 at 1:16 PM ligiabernardet ***@***.***> wrote:
@llpcarson <https://github.com/llpcarson> do you have a failed chgres
cube that we can use to check a) namelist and b) GRIB2 inventory
./chgres.inv?
@uturuncoglu <https://github.com/uturuncoglu> I confirm that we also have
a failure of chgres_cube C768 on Orion.
Results from other platforms:
- Jet: C768 passed (only 1 run tested - it takes more than a day in
the queue due to reservations, so hard to do many runs)
- Hera: C768 passed (only 1 run tested)
- Orion: 1/19 tests that are part of RT crashed on chgres_cube (#194
<#194>)
- Stampede: waiting RT results from @climbfuji
<https://github.com/climbfuji>
- Gaea: waiting C768 results from @climbfuji
<https://github.com/climbfuji>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#190 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB2OWIWUOVYCEVBOL5VK45TSGJOA5ANCNFSM4RFMB6EA>
.
|
@ligiabernardet thanks for the update. I am not sure the where the source of problem. While we have also running cases, I am suspecting from chgres and it might have memory leak etc. |
@GeorgeGayno-NOAA @climbfuji @uturuncoglu @arunchawla-NOAA @climbfuji @rsdunlapiv @llpcarson Do we have any other hypothesis or idea of what to try to get chgres_cube to work in CIME consistently? Cheyenne: Occasional crashes of chgres_cube C768 when reading GRB2 file |
Is it always failing at the same spot (model_grid.F90 line 640)? And the failure occurs randomly? Do the RT tests run in sequence or simultaneously? |
@ligiabernardet @GeorgeGayno-NOAA @climbfuji @uturuncoglu @arunchawla-NOAA @jedwards4b @rsdunlapiv @llpcarson I just got a successful run of all regression tests on Cheyenne with Intel. This is what I did:
PRs:
|
Ok, here we go ... just got one failure with Intel 18.0.5 on Cheyenne in my second round of tests (when running both Intel and GNU tests with the same command). Super annoying. Will see how the rest works out. |
@uturuncoglu is there a way to force the tests to run serially, i.e. only one regression test running at a time? |
@uturuncoglu another question, how do I change the default MPI job size for chgres in cime? I want the regression tests to run on a different number of nodes with a different number of tasks per node, still 36 tasks in total for C768. Thanks ... |
@jedwards4b Ufuk seems to be out or busy today, can you answer my basic questions in #190 (comment) and #190 (comment) by any chance? Thanks ... |
Ufuk will be back this afternoon. He was on PTO this morning.
…On Fri, Sep 18, 2020 at 11:40 AM Dom Heinzeller ***@***.***> wrote:
@jedwards4b <https://github.com/jedwards4b> Ufuk seems to be out or busy
today, can you answer my basic questions in #190 (comment)
<#190 (comment)>
and #190 (comment)
<#190 (comment)>
by any chance? Thanks ...
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#190 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB4XCE6KBDXGE43OGC55DZLSGOLP5ANCNFSM4RFMB6EA>
.
--
Mariana Vertenstein
CESM Software Engineering Group Head
National Center for Atmospheric Research
Boulder, Colorado
Office 303-497-1349
Email: mvertens@ucar.edu
|
Default job size and pelayout are set in config_workflow.xml You can run an individual test by naming that test on the create_test command line, for example: |
@uturuncoglu i am back. I am not sure about that. They are just building and submitting to the queue. Maybe the submission script can be sourced to create dependency between them but I am not sure it is currently supported and also if one of them fails the dependent jobs also killed. @jedwards4b is there any way to run the test one by one. |
@climbfuji yomu could change number of processor of CHGRES with ./xmlchange task_count=144 --subgroup case.chgres but don't forget to run ./preview_namelist after it. |
@climbfuji BTW, i don't think the compiler version is the problem. We are seeing similar issue in different platforms. Again, it could be a memory leak or something else in the CHGRES part. Did you have a change to look at it? |
I am looking, yes. The Intel 18 test is because we don't see any issues on stampede, hera and jet (all Intel 18), but on Orion (Intel 19). I just got another successful pass through the regression tests on Cheyenne with Intel 18, using |
@climbfuji but if I understand correctly you got error with Intel 18 on Cheyenne also. So, we have still problem in here and it could lead to a failure in MR app. If you need to change from 66 to 312 it is very easy and I could make the required changes under buildnml. |
This formatting in GitHub ... I know, I changed buildnml to use 3 times 12 instead of 6 times 6 tasks. |
20190918: Ok, I got one successful pass through all Intel tests with this configuration. Will launch a few more over the weekend with both Intel and GNU. Updatae 20200919: I got one more successful pass through all Intel tests with this configuration. Updatae 20200919 II: I got another successful pass through all Intel tests with this configuration. I will do one more, if that one passes as well I recommend merging (and retagging, where necessary) NOAA-EMC/NCEPLIBS-external#70, ESMCI/cime#3713, and ufs-community/ufs-weather-model#204 with its dependencies. Update 20200920: a fourth successful pass, this time running Intel and GNU at the same time on Cheyenne. We should definitely merge all of the above PRs. On the long run, we need to find out if, and if yes why, |
I was just looking thru model_grid.F90. I noticed that the call to wgrib2 (grb2_mk_inv) that creates chgres.inv is done on all mpi tasks. It is possible one task may not finish before another task tries to use chgres.inv at line 640. A quick test would be to create the chgres.inv on one task, then add a barrer before all tasks try to read it. Just a wild guess. But that might explain the random nature of the error. |
ufs_release_v1.1: remove unnecessary/incorrect configuration options for Cheyenne for the UFS; downgrade Intel 19.x.y to 18.m.n on Cheyenne, Gaea, Orion ### Description This PR removes a whole bunch of unnecessary/incorrect configuration options for Cheyenne for the UFS and reverts Intel back to Intel-18.0.5 on Cheyenne, Gaea, Orion. Beforehand, the CIME regression tests failed frequently in `chgres_cube.exe` for the highest resolution cases, see description here: ufs-community/ufs-mrweather-app#190 With this change and the associated change in the ufs-weather-model and NCEPLIBS-external (documentation only), the regression tests ran successfully (tried one time thus far, will conduct multiple runs to make sure I wasn't just lucky). ``` export UFS_DRIVER=nems export UFS_INPUT=$CESMDATAROOT export UFS_SCRATCH=/glade/work/heinzell/fv3/ufs-mrweather-app/ufs_scratch qcmd -l walltime=3:00:00 -- "export UFS_DRIVER=nems; CIME_MODEL=ufs ./create_test --xml-testlist ../../src/model/FV3/cime/cime_config/testlist.xml --xml-machine cheyenne --xml-compiler intel --workflow ufs-mrweather_wo_post -j 4 --walltime 03:00:00" ... heinzell@cheyenne3:/glade/work/heinzell/fv3/ufs-mrweather-app/ufs-mrweather-app-release-public-v1/cime/scripts> /glade/work/heinzell/fv3/ufs-mrweather-app/ufs_scratch/cs.status.20200917_210120_2ap63n | grep Overall ERS_Lh11.C96.GFSv15p2.cheyenne_intel (Overall: PASS) details: ERS_Lh11.C96.GFSv16beta.cheyenne_intel (Overall: PASS) details: PET_Lh11.C96.GFSv15p2.cheyenne_intel (Overall: PASS) details: SMS_Lh3.C192.GFSv15p2.cheyenne_intel (Overall: PASS) details: SMS_Lh3.C192.GFSv16beta.cheyenne_intel (Overall: PASS) details: SMS_Lh3.C384.GFSv15p2.cheyenne_intel (Overall: PASS) details: SMS_Lh3.C384.GFSv16beta.cheyenne_intel (Overall: PASS) details: SMS_Lh3.C768.GFSv15p2.cheyenne_intel (Overall: PASS) details: SMS_Lh3.C768.GFSv16beta.cheyenne_intel (Overall: PASS) details: SMS_Lh3.C96.GFSv15p2.cheyenne_intel (Overall: PASS) details: SMS_Lh3.C96.GFSv16beta.cheyenne_intel (Overall: PASS) details: SMS_Lh3_D.C192.GFSv15p2.cheyenne_intel (Overall: PASS) details: SMS_Lh3_D.C192.GFSv16beta.cheyenne_intel (Overall: PASS) details: SMS_Lh3_D.C384.GFSv15p2.cheyenne_intel (Overall: PASS) details: SMS_Lh3_D.C384.GFSv16beta.cheyenne_intel (Overall: PASS) details: SMS_Lh3_D.C768.GFSv15p2.cheyenne_intel (Overall: PASS) details: SMS_Lh3_D.C768.GFSv16beta.cheyenne_intel (Overall: PASS) details: SMS_Lh3_D.C96.GFSv15p2.cheyenne_intel (Overall: PASS) details: SMS_Lh3_D.C96.GFSv16beta.cheyenne_intel (Overall: PASS) details: ``` The PR also replaces a few tabs with whitespaces for consistent formatting in `config/ufs/machines/config_machines.xml`. Note that we will need to (re-)create tag `ufs-v1.1.0` (or whatever the CIME convention is; this is what we use for all the UFS components and external libraries, and what is also consistent with MRW App release 1.0).
Just so that everyone is up to date. After downgrading the compiler, we still got crashes on orion (but not on Cheyenne). I implemented George's fix in UFS_UTILS and this seems to solve the problems on orion. I will test those changes (see NOAA-EMC/NCEPLIBS#118) on other platforms as well before we merge them and roll out a retagged version of NCEPLIBS on all platforms. I will update the testing information in this comment as I am progressing:
|
this is now fixed |
@climbfuji @ligiabernardet i am having trouble with the model on Cheyenne and it is hanging when it is reading static input files such as global_shdmin.0.144x0.144.grb for the resolution > C96. This was also case for the new buildlib and I think it is not related with the build. Have you ever experience same problem? This is also reported previously on #184 (comment). Do we need increase the resources that are used by the model? For example, C192 is hanging/failing without any particular error and I am using following configuration options,
ntiles = 6
layout = 4, 6
write_groups: 1
write_tasks_per_group: 36
and total 180 processor.
The text was updated successfully, but these errors were encountered: