MPI_Type_contiguous Encounters Invalid Count #2227

spanNOAA · 2024-04-04T21:48:04Z

Description

An MPI-related fatal error occurred during the execution of the code, leading to job cancellation.

To Reproduce:

Compilers: intel/2022.1.2, impi/2022.1.2, stack-intel/2021.5.0, stack-intel-oneapi-mpi/2021.5.1
Platform: Hera (Rocky 8)

copy the canned test case from /scratch2/BMC/wrfruc/Sijie.Pan/ufs-ar/comroot/fcst.test
submit slurm job submit_ufs_model.sh.
check the output file slurm-${jobid}.out

Additional context

The problem specifically arises on the 2304th core.

Output

ufs_model_crash.log

jkbk2004 · 2024-04-05T12:22:54Z

2304: Abort(805961730) on node 2304 (rank 2304 in comm 0): Fatal error in PMPI_Type_contiguous: Invalid count, error stack:
2304: PMPI_Type_contiguous(271): MPI_Type_contiguous(count=-2056576882, MPI_BYTE, new_type_p=0x7ffd536cb594) failed
2304: PMPI_Type_contiguous(238): Negative count, value is -2056576882

MPI_Type_contiguous count number can not be negative. There is no direct call of MPI_Type_contiguous inside model code base. @spanNOAA can you run exactly same canned case on other machine like orion/hercules? so we can see if we can isolate a root cause or mpi package installation issue or not?

jkbk2004 · 2024-04-05T12:37:51Z

2304: Abort(805961730) on node 2304 (rank 2304 in comm 0): Fatal error in PMPI_Type_contiguous: Invalid count, error stack:
2304: PMPI_Type_contiguous(271): MPI_Type_contiguous(count=-2056576882, MPI_BYTE, new_type_p=0x7ffd536cb594) failed
2304: PMPI_Type_contiguous(238): Negative count, value is -2056576882
MPI_Type_contiguous count number can not be negative. There is no direct call of MPI_Type_contiguous inside model code base. @spanNOAA can you run exactly same canned case on other machine like orion/hercules? so we can see if we can isolate a root cause or mpi package installation issue or not?

@junwang-noaa @DusanJovic-NOAA @spanNOAA I don't know if compiling with -traceback might be a good option to trace in this case.

JessicaMeixner-NOAA · 2024-04-05T15:07:55Z

Just wanted to post here that I also got this error as did @ChristianBoyer-NOAA from the physics team trying to run a C768 test case from the g-w (develop branch as of today).

DusanJovic-NOAA · 2024-04-05T15:17:51Z

Just wanted to post here that I also got this error as did Christian Boyer from the physics team trying to run a C768 test case from the g-w (develop branch as of today).

Do they also see this error on Hera? Could it be related to an update of the OS? Has anyone made a successful C768 run on Hera recently?

DusanJovic-NOAA · 2024-04-05T15:19:02Z

2304: Abort(805961730) on node 2304 (rank 2304 in comm 0): Fatal error in PMPI_Type_contiguous: Invalid count, error stack:
2304: PMPI_Type_contiguous(271): MPI_Type_contiguous(count=-2056576882, MPI_BYTE, new_type_p=0x7ffd536cb594) failed
2304: PMPI_Type_contiguous(238): Negative count, value is -2056576882
MPI_Type_contiguous count number can not be negative. There is no direct call of MPI_Type_contiguous inside model code base. @spanNOAA can you run exactly same canned case on other machine like orion/hercules? so we can see if we can isolate a root cause or mpi package installation issue or not?
@junwang-noaa @DusanJovic-NOAA @spanNOAA I don't know if compiling with -traceback might be a good option to trace in this case.

We do compile the code with -traceback flag by default.

JessicaMeixner-NOAA · 2024-04-05T15:26:53Z

@DusanJovic-NOAA - @ChristianBoyer-NOAA has not been able to successfully run C768 since the rocky8 transition. I just ran a case and got the same error and then saw this issue that reported the same problem. I have asked a few people and I don't know if anyone has successfully run C768 on hera since the rocky8 transition.

SamuelTrahanNOAA · 2024-04-08T11:47:13Z

I can run the global static nest configuration with both my modified global-workflow and the HAFS workflow. I haven't tried a globe without a nest.

EDIT: Those are both atmosphere-only forecast-only cases.

zhanglikate · 2024-04-08T14:48:22Z

Just wanted to post here that I got the same issue when I ran the C768 in Rocky 8 Hera.
I git clone the latest version of global work flow for Rocky 8 (April 2 version, commit c54fe98c4fe8d811907366d4ba6ff16347bf174c) and try the C768 run with ATM only, however, it always crash by showing the following information at Hera Rocky 8. While I did not see this issue in C384 and C96. This is the log file /scratch1/BMC/gsd-fv3-dev/NCEPDEV/global/Kate.Zhang/fv3gfs/comrot/TC768/logs/2020070100/gfsfcst.log

This is the job submit directory: /scratch2/BMC/gsd-fv3-dev/NCEPDEV/global/Kate.Zhang/fv3gfs/expdir/TC768

@JessicaMeixner-NOAA @DusanJovic-NOAA @spanNOAA @junwang-noaa

SamuelTrahanNOAA · 2024-04-08T14:54:57Z

Here are the relevant lines of @zhanglikate's log file.

EDIT: Here is just the error message:

4608: Abort(470417410) on node 4608 (rank 4608 in comm 0): Fatal error in PMPI_Type_contiguous: Invalid count, error stack:
4608: PMPI_Type_contiguous(271): MPI_Type_contiguous(count=-2057309534, MPI_BYTE, new_type_p=0x7ffc225e8994) failed
4608: PMPI_Type_contiguous(238): Negative count, value is -2057309534
   0: slurmstepd: error: *** STEP 58066771.0 ON h3c39 CANCELLED AT 2024-04-08T06:27:30 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: h15c49: tasks 5080-5119: Killed

Expand to see modules, versions, prologue, epilogue, etc.

Begin fcst.sh at Mon Apr  8 06:20:19 UTC 2024

... many lines of stuff ...

Running "module reset". Resetting modules to system default. The following $MODULEPATH directories have been removed: /scratch1/BMC/gmtb/software/modulefiles/generic /scratch2/NCEPDEV/nwprod/NCEPLIBS/modulefiles

Currently Loaded Modules:
  1) contrib                          42) sp/2.5.0
  2) intel/2022.1.2                   43) ip/4.3.0
  3) stack-intel/2021.5.0             44) grib-util/1.3.0
  4) impi/2022.1.2                    45) g2tmpl/1.10.2
  5) stack-intel-oneapi-mpi/2021.5.1  46) gsi-ncdiag/1.1.2
  6) gettext/0.19.8.1                 47) crtm-fix/2.4.0.1_emc
  7) libxcrypt/4.4.35                 48) git-lfs/2.10.0
  8) zlib/1.2.13                      49) crtm/2.4.0.1
  9) sqlite/3.43.2                    50) openblas/0.3.24
 10) util-linux-uuid/2.38.1           51) py-setuptools/63.4.3
 11) python/3.11.6                    52) py-numpy/1.23.4
 12) hpss/hpss                        53) bufr/11.7.0
 13) gempak/7.4.2                     54) gmake/3.82
 14) ncl/6.6.2                        55) wgrib2/2.0.8
 15) libjpeg/2.1.0                    56) py-cftime/1.0.3.4
 16) jasper/2.0.32                    57) py-netcdf4/1.5.8
 17) libpng/1.6.37                    58) libyaml/0.2.5
 18) openjpeg/2.3.1                   59) py-pyyaml/6.0
 19) eccodes/2.32.0                   60) py-markupsafe/2.1.3
 20) fftw/3.3.10                      61) py-jinja2/3.1.2
 21) nghttp2/1.57.0                   62) py-bottleneck/1.3.7
 22) curl/8.4.0                       63) py-numexpr/2.8.4
 23) proj/8.1.0                       64) py-et-xmlfile/1.0.1
 24) udunits/2.2.28                   65) py-openpyxl/3.1.2
 25) cdo/2.2.0                        66) py-pytz/2023.3
 26) R/3.5.0                          67) py-pyxlsb/1.0.10
 27) perl/5.38.0                      68) py-xlrd/2.0.1
 28) pkg-config/0.27.1                69) py-xlsxwriter/3.1.7
 29) hdf5/1.14.0                      70) py-xlwt/1.3.0
 30) snappy/1.1.10                    71) py-pandas/1.5.3
 31) zstd/1.5.2                       72) py-six/1.16.0
 32) c-blosc/1.21.5                   73) py-python-dateutil/2.8.2
 33) netcdf-c/4.9.2                   74) g2c/1.6.4
 34) netcdf-fortran/4.6.1             75) netcdf-cxx4/4.3.1
 35) antlr/2.7.7                      76) met/9.1.3
 36) gsl/2.7.1                        77) metplus/3.1.1
 37) nco/5.0.6                        78) py-packaging/23.1
 38) bacio/2.4.1                      79) py-xarray/2023.7.0
 39) w3emc/2.10.0                     80) prepobs/1.0.1
 40) prod_util/2.1.1                  81) fit2obs/1.0.0
 41) g2/3.4.5                         82) module_base.hera

... many lines of stuff ...

+ exglobal_forecast.sh[153]: unset OMP_NUM_THREADS
+ exglobal_forecast.sh[158]: /bin/cp -p /scratch1/BMC/gsd-fv3-dev/lzhang/Rocky8/global-workflow/exec/ufs_model.x /scratch1/NCEPDEV/stmp2/Kate.Zhang/RUNDIRS/TC768/fcst.424943/
+ exglobal_forecast.sh[159]: srun -l --export=ALL -n 6560 /scratch1/NCEPDEV/stmp2/Kate.Zhang/RUNDIRS/TC768/fcst.424943/ufs_model.x
   0:
   0:
   0: * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * .
   0:      PROGRAM ufs-weather-model HAS BEGUN. COMPILED       0.00     ORG: np23
   0:      STARTING DATE-TIME  APR 08,2024  06:21:38.287   99  MON   2460409
   0:
   0:
   0: MPI Library = Intel(R) MPI Library 2021.5 for Linux* OS
   0:
   0: MPI Version = 3.1

... many lines of stuff ...

   0: PASS: fcstRUN phase 1, n_atmsteps =              113 time is         1.956091
   0: PASS: fcstRUN phase 2, n_atmsteps =              113 time is         0.027836
   4:  ncells=           5
   4:  nlives=          12
   4:  nthresh=   18.0000000000000
4608: Abort(470417410) on node 4608 (rank 4608 in comm 0): Fatal error in PMPI_Type_contiguous: Invalid count, error stack:
4608: PMPI_Type_contiguous(271): MPI_Type_contiguous(count=-2057309534, MPI_BYTE, new_type_p=0x7ffc225e8994) failed
4608: PMPI_Type_contiguous(238): Negative count, value is -2057309534
   0: slurmstepd: error: *** STEP 58066771.0 ON h3c39 CANCELLED AT 2024-04-08T06:27:30 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: h15c49: tasks 5080-5119: Killed
srun: Terminating StepId=58066771.0
srun: error: h15c48: tasks 5040-5079: Killed
srun: error: h22c04: tasks 5320-5359: Killed

... many lines of stuff ...

+ exglobal_forecast.sh[1]: postamble exglobal_forecast.sh 1712557241 137
+ preamble.sh[70]: set +x
End exglobal_forecast.sh at 06:27:31 with error code 137 (time elapsed: 00:06:50)
+ JGLOBAL_FORECAST[1]: postamble JGLOBAL_FORECAST 1712557222 137
+ preamble.sh[70]: set +x
End JGLOBAL_FORECAST at 06:27:31 with error code 137 (time elapsed: 00:07:09)
+ fcst.sh[1]: postamble fcst.sh 1712557219 137
+ preamble.sh[70]: set +x
End fcst.sh at 06:27:31 with error code 137 (time elapsed: 00:07:12)
_______________________________________________________________
Start Epilog on node h3c39 for job 58066771 :: Mon Apr  8 06:27:31 UTC 2024
Job 58066771 finished for user Kate.Zhang in partition hera with exit code 137:0
_______________________________________________________________
End Epilogue Mon Apr  8 06:27:31 UTC 2024

junwang-noaa · 2024-04-08T14:58:07Z

@XiaqiongZhou-NOAA Please see issue here. My understanding is that you got the same error on wcoss2 and Orion. Would you please try the 3/11 model version (5b62e1a) on wcoss2 to see if you still got this error? Thanks

From Kate:
I got the model crash on both WCOSS2 and Orion with the same error information. The UFS model is the March 22 version. I also got the same error on Hercules with the UFS Feb.21 version.

Abort(1007294466) on node 2304 (rank 2304 in comm 0): Fatal error in PMPI_Type_contiguous: Invalid count, error stack:
2304: PMPI_Type_contiguous(275): MPI_Type_contiguous(count=-2056678757, MPI_BYTE, new_type_p=0x7ffe7b8d3b54) failed
2304: PMPI_Type_contiguous(243): Negative count, value is -2056678757

The log files are here:
/work2/noaa/stmp/xzhou/c768/logs/2020010200 (Hercules)
/lfs/h2/emc/ptmp/xiaqiong.zhou/c768_ctl/logs/2020010200 (WCOSS2)
/work/noaa/stmp/xzhou/c768/logs/2020010200 (Orion)

SamuelTrahanNOAA · 2024-04-08T15:03:59Z

My successful runs use an older version of the scripts, but they do use the latest code.

junwang-noaa · 2024-04-08T16:39:58Z

@SamuelTrahanNOAA are you running the C768 global in your global static nest configuration case?

zhanglikate · 2024-04-08T16:41:11Z

Judy has a GSL version working before, which was based on the EMC Jan2024 version: https://github.com/NOAA-GSL/global-workflow/tree/gsl_ufs_rt . However, it can not run after the OS transition to Rocky 8.

My successful runs use an older version of the scripts, but they do use the latest code.

spanNOAA · 2024-04-08T17:38:56Z

2304: Abort(805961730) on node 2304 (rank 2304 in comm 0): Fatal error in PMPI_Type_contiguous: Invalid count, error stack:
2304: PMPI_Type_contiguous(271): MPI_Type_contiguous(count=-2056576882, MPI_BYTE, new_type_p=0x7ffd536cb594) failed
2304: PMPI_Type_contiguous(238): Negative count, value is -2056576882
MPI_Type_contiguous count number can not be negative. There is no direct call of MPI_Type_contiguous inside model code base. @spanNOAA can you run exactly same canned case on other machine like orion/hercules? so we can see if we can isolate a root cause or mpi package installation issue or not?

I've attempted the canned case on Orion, and unfortunately, the same issue persists. Specifically, it still occurs on processor 2304. However, I have no problem with running C384.

SamuelTrahanNOAA · 2024-04-08T17:44:29Z

@SamuelTrahanNOAA are you running the C768 global in your global static nest configuration case?

I've run the C96, C192, and C384 with the latest version of my workflow.
In an hour or two, I'll test the C768 with the latest version. (I have to copy the new fix files and ICs I generated and regenerate the expdir.)

I have not merged the latest develop scripts. I'm still using older scripts, but I am using newer ufs-weather-model code. My code has two bug fixes, but they are unlikely to be related to this problem (#2201)

JessicaMeixner-NOAA · 2024-04-08T17:45:48Z

Has anyone opened a hera help desk ticket on this issue by any chance?

kayeekayee · 2024-04-08T19:13:39Z

GSL real time experiments ran the C768 case until 4/3 when the OS completely updated to Rocky8: /scratch1/BMC/gsd-fv3/rtruns/UFS-CAMsuite. Here is the vesion that works for C768 in our realtime:
12Jan24 global-workflow
UFS: 29Jan24, 625ac02
FV3: 29Jan24, bd38c56 (GSL: 28Feb24 , a439cc7)
UPP: 07Nov23, 78f369b
UFS_UTILS: 22Dev23, ce385ce
You can see the gfsfcst log here:
/scratch1/BMC/gsd-fv3/rtruns/UFS-CAMsuite/FV3GFSrun/rt_v17p8_ugwpv1_mynn/logs/2024040200/gfsfcst.log

junwang-noaa · 2024-04-08T21:10:12Z

@kayeekayee Thanks for the information. So model version on Jan 29, 2024 works fine.

I am wondering anyone runs C768 model with a more recent version. Since the same error showed up on wcoss2 and orion, I am thinking if it's the code updates that cause the problem.

SamuelTrahanNOAA · 2024-04-08T21:31:36Z

I'm able to run with this version of the code:

bug fixes: kchunk3d ignored, hailwat uninitialized in dycore, tile_num wrong for nests #2201

My test is a C768 resolution globe rotated and stretched, with a nest added inside one global tile. (The script calls it CASE=W768.) It won't run without the fixes in that PR due to some bugs in the nesting framework which break GFS physics.

EDIT: I can give people instructions on how to run the nested global configuration if you want to try my working test case. It uses the global-workflow, but an older version, and forecast-only.

junwang-noaa · 2024-04-09T00:26:54Z

Thanks, @SamuelTrahanNOAA. How many tasks are you using for the C768 global domain?

@spanNOAA @JessicaMeixner-NOAA @zhanglikate @XiaqiongZhou-NOAA Would you like to try Sam's version to build the executable and see if you can run the C768 test case?

SamuelTrahanNOAA · 2024-04-09T00:33:12Z

I'm using 2 threads. This is the task geometry:

Globe: six tiles of 16x10 (48 nodes)
Nest: one tile of 48x45 (108 nodes)
Write: two groups of 540 (54 nodes)

I don't know why the write groups need 27 compute nodes each, but they run out of memory if I give them less, even without the post.

The reason for this vast 210 node task geometry is that it finishes a five day forecast in under eight hours.

JessicaMeixner-NOAA · 2024-04-09T00:40:31Z

@ChristianBoyer-NOAA would you have time to try this? I will not have time to try this until next week, but will try it then.

zhanglikate · 2024-04-09T01:05:38Z

Sure. I can give a try. Please let me know how to test it in the global workflow environment. Thanks. Kate

…

On Mon, Apr 8, 2024 at 6:27 PM Jun Wang ***@***.***> wrote: Thanks, @SamuelTrahanNOAA <https://github.com/SamuelTrahanNOAA>. How many tasks are you using for the C768 global domain? @spanNOAA <https://github.com/spanNOAA> @JessicaMeixner-NOAA <https://github.com/JessicaMeixner-NOAA> @zhanglikate <https://github.com/zhanglikate> @XiaqiongZhou-NOAA <https://github.com/XiaqiongZhou-NOAA> Would you like to try Sam's version to build the executable and see if you can run the C768 test case? — Reply to this email directly, view it on GitHub <#2227 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/APJPDRDSYR22A57RJGTPKTDY4MYWJAVCNFSM6AAAAABFYBVFGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBTHEZDSNBVGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

SamuelTrahanNOAA · 2024-04-09T02:00:30Z

I doubt my PR will fix the problem, but you can try it if you wish. It should be a drop-in replacement for the sorc/ufs_model.fd directory in the global-workflow.

zhanglikate · 2024-04-09T02:11:09Z

@SamuelTrahanNOAA Can you send your code path to me? Thanks.

lisa-bengtsson · 2024-04-09T02:20:08Z

I wonder if it is related to the physics suite? Sam is running the global_nest_v1 suite, I'm not sure which physics suite GSL is running in their experiments referenced above at C768, but it would be interesting to know if it is specifically the GFS physics suite?

SamuelTrahanNOAA · 2024-04-09T02:41:16Z

I wonder if it is related to the physics suite? Sam is running the global_nest_v1 suite, I'm not sure which physics suite GSL is running in their experiments referenced above at C768, but it would be interesting to know if it is specifically the GFS physics suite?

No.

My successful global-workflow runs used the GFS suite.
My successful HAFS AR workflow runs used the global_nest_v1 suite.

Also: The crash is coming from the write component, not the compute ranks.

lisa-bengtsson · 2024-04-09T03:03:31Z

I wonder if it is related to the physics suite? Sam is running the global_nest_v1 suite, I'm not sure which physics suite GSL is running in their experiments referenced above at C768, but it would be interesting to know if it is specifically the GFS physics suite?

No.

My successful global-workflow runs used the GFS suite.

My successful HAFS AR workflow runs used the global_nest_v1 suite.

Also: The crash is coming from the write component, not the compute ranks.

Ok, thanks for clarifying!

SamuelTrahanNOAA · 2024-04-09T03:06:54Z

Can you send your code path to me? Thanks

It is better for you to compile it yourself. This might work:

cd global-workflow/sorc/ufs_model.fd
git stash
git remote add sam https://github.com/SamuelTrahanNOAA/ufs-weather-model
git fetch sam
git checkout -b nesting-fixes sam/nesting-fixes
git submodule sync
git submodule update --init --recursive --force
cd ..
./build_ufs.sh

SamuelTrahanNOAA · 2024-04-09T21:49:58Z

The failure was a negative length. Is it possible something is using signed 32-bit integers for lengths and a communication length went over 2**31?

WalterKolczynski-NOAA · 2024-04-09T23:20:08Z

If using -1 is appropriate for all resolutions, we can make that change in global workflow pretty quickly. Right now we are using a multiple of the resolution:

local restile=${CASE:1}

local ICHUNK2D=$((4*restile))
local JCHUNK2D=$((2*restile))
local ICHUNK3D=$((4*restile))
local JCHUNK3D=$((2*restile))
local KCHUNK3D=1

zhanglikate · 2024-04-09T23:23:29Z

If using -1 is appropriate for all resolutions, we can make that change in global workflow pretty quickly. Right now we are using a multiple of the resolution:
local restile=${CASE:1}

local ICHUNK2D=$((4*restile))
local JCHUNK2D=$((2*restile))
local ICHUNK3D=$((4*restile))
local JCHUNK3D=$((2*restile))
local KCHUNK3D=1

Can this also be applied to other low resolutions, e.g. C384 or C96? Thanks.

SamuelTrahanNOAA · 2024-04-09T23:43:40Z

If scripts know the ideal chunking then specifying that chunking would be the best option. Using -1 is a workaround. I'd restrict that to the resolutions that need it. Presently that appears to be C768 and higher. The nested globe configurations will have to use -1 since the tiles are different sizes.

junwang-noaa · 2024-04-10T01:17:50Z

@SamuelTrahanNOAA Thanks for figuring out a solution. Since this was working before, I am wondering what has been changed.
@WalterKolczynski-NOAA I remember using different chunksizes could impact the reading speed of downstream jobs (UPP, DA). You may need to check the run time of these jobs in G-W.

SamuelTrahanNOAA · 2024-04-10T02:35:48Z

Try this at the top of ush/parsing_model_configure_FV3.sh. It may provide a more satisfactory solution:

local restile=${CASE:1}
if [[ "${restile}" -gt 384 ]] ; then
  restile=384
fi

EDIT: I haven't tried that myself. The point is to try a smaller chunk size so it is likely to fit under the unknown constraint.

DusanJovic-NOAA · 2024-04-10T18:34:06Z

I made a change in fv3atm which will allow us to use both compression and relatively large chunk sizes. The change limits the value of kchunk3d to the minimum value of user specified kchunk3d value in model_configure (most of the time it's 1) and actual number of vertical levels for a given field.

Can you please rerun your tests with the code from 'fix_kchunk3d' branch in my ufs-wm and fv3atm forks:

https://github.com/DusanJovic-NOAA/ufs-weather-model/tree/fix_kchunk3d

SamuelTrahanNOAA · 2024-04-10T22:59:49Z

I'm running this now. I had to merge my nesting fixes, but it has already passed the failure point. My job takes about 7.8 hours, so I'll report back later.

SamuelTrahanNOAA · 2024-04-11T13:40:33Z

@DusanJovic-NOAA - I merged your branch into PR #2201 because I cannot continue my work without your changes. (That PR has other bug fixes.) I've updated my PR to indicate your fix is present as well. I hope this is okay.

If you want to do a separate PR, please let me know when you have it so I can reference it in #2202.

I'm rerunning regression tests in the combined PR now.

DusanJovic-NOAA · 2024-04-11T14:18:39Z

@DusanJovic-NOAA - I merged your branch into PR #2201 because I cannot continue my work without your changes. (That PR has other bug fixes.) I've updated my PR to indicate your fix is present as well. I hope this is okay.

If you want to do a separate PR, please let me know when you have it so I can reference it in #2202.

I'm rerunning regression tests in the combined PR now.

Thank you for merging it into your PR, we do not need separate PR. Hopefully #2201 will be merged soon.

junwang-noaa · 2024-04-11T14:21:40Z

@DusanJovic-NOAA @SamuelTrahanNOAA Thanks for fixing the issue!

SamuelTrahanNOAA · 2024-04-11T14:24:28Z

Can someone please test this pull request to confirm it fixes the problem?

#2201

I'd like someone else to confirm that branch works ASAP. Once they do, I'll ask @jkbk2004 and his friends to move the PR to the top of the queue.

Dusan's fix is in there, plus some fixes to other bugs. It is up to date with the head of develop.

zhanglikate · 2024-04-11T14:27:02Z

I can help with C768. Please let me know the commit number to confirm that I am testing the correct version.

…

On Apr 11, 2024, at 8:24 AM, Samuel Trahan (NOAA contractor) ***@***.***> wrote: Can someone please test this pull request to confirm it fixes the problem? #2201 <#2201> I'd like someone else to confirm that branch works ASAP. Once they do, I'll ask @jkbk2004 <https://github.com/jkbk2004> and his friends to move the PR to the top of the queue. Dusan's fix is in there, plus some fixes to other bugs. It is up to date with the head of develop. — Reply to this email directly, view it on GitHub <#2227 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/APJPDRFFL4RCGXE5SZYHKCDY42MLHAVCNFSM6AAAAABFYBVFGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBZHAYTMNJZG4>. You are receiving this because you were mentioned.

SamuelTrahanNOAA · 2024-04-11T14:29:46Z

The PR to fix this has been moved to the top of the queue.

@DusanJovic-NOAA @zhanglikate @kayeekayee @spanNOAA @ChristianBoyer-NOAA - Please test this branch ASAP to confirm it fixes your problem.

bug fixes: kchunk3d ignored, hailwat uninitialized in dycore, tile_num wrong for nests #2201

You can clone the branch like so:

git clone --recursive --branch nesting-fixes https://github.com/SamuelTrahanNOAA/ufs-weather-model

It will be merged soon, and we want to make sure it works. Hashes are:

ufs-weather-model: 5bf44a92
FV3: cbd207b
GFDL_atmos_cubed_sphere: d14b3cb

DusanJovic-NOAA · 2024-04-11T15:23:15Z

The c768 test I was running before finished 24h successfully using code from #2201 (commit 5bf44a9). It is still running but it will reach 30min time limit soon.

zhanglikate · 2024-04-11T15:28:13Z

The PR to fix this has been moved to the top of the queue.

@DusanJovic-NOAA @zhanglikate @kayeekayee @spanNOAA @ChristianBoyer-NOAA - Please test this branch ASAP to confirm it fixes your problem.

bug fixes: kchunk3d ignored, hailwat uninitialized in dycore, tile_num wrong for nests #2201

You can clone the branch like so:
git clone --recursive --branch nesting-fixes ssh://git@github.com/SamuelTrahanNOAA/ufs-weather-model
It will be merged soon, and we want to make sure it works. Hashes are:

ufs-weather-model: 5bf44a92

FV3: cbd207b

GFDL_atmos_cubed_sphere: d14b3cb

I finished my testing for more than 84 hrs using the April 2 version global workflow, which is working well. Thanks for all your helps. @SamuelTrahanNOAA @DusanJovic-NOAA @junwang-noaa @WalterKolczynski-NOAA

SamuelTrahanNOAA · 2024-04-11T15:29:50Z

Anyone who can confirm the PR 2201 version works, please do a review and approve here:

#2201

zhanglikate · 2024-04-11T15:34:27Z

I can not click the reviewer part, may need someone to add me. Thanks. Kate

…

On Apr 11, 2024, at 9:30 AM, Samuel Trahan (NOAA contractor) ***@***.***> wrote: Anyone who can confirm the PR 2201 version works, please do a review and approve here: #2201 <#2201> — Reply to this email directly, view it on GitHub <#2227 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/APJPDRAL77SAFHIFSEIEQ4TY42UANAVCNFSM6AAAAABFYBVFGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBZHE3DMOJTGE>. You are receiving this because you were mentioned.

SamuelTrahanNOAA · 2024-04-11T15:36:27Z

You don't need to click the reviewer part.

If you go to this page:

https://github.com/ufs-community/ufs-weather-model/pull/2201/files

You should see a green "Review Changes" button in the upper left.

zhanglikate · 2024-04-11T15:39:31Z

I did. Thanks very much.

…

On Apr 11, 2024, at 9:36 AM, Samuel Trahan (NOAA contractor) ***@***.***> wrote: You don't need to click the reviewer part. If you go to this page: https://github.com/ufs-community/ufs-weather-model/pull/2201/files <https://github.com/ufs-community/ufs-weather-model/pull/2201/files> You should see a green "Review Changes" button in the upper left. — Reply to this email directly, view it on GitHub <#2227 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/APJPDRC4JSBRNUIJTA3Z663Y42UZHAVCNFSM6AAAAABFYBVFGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBZHE4DCMRUGQ>. You are receiving this because you were mentioned.

ChristianBoyer-NOAA · 2024-04-11T15:44:34Z

@SamuelTrahanNOAA - I am unable to to clone the branch. It gives me a permissions/access error that I have pasted below. I'm not sure why it won't allow me to clone it. The changes were working when I changed the files myself in my workflow yesterday and this morning.

Once I can clone it, I will also run a test case immediately.

Cloning into 'ufs-weather-model'...
Warning: Permanently added the ECDSA host key for IP address '140.82.113.3' to the list of known hosts.
git@github.com: Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights and the repository exists.

SamuelTrahanNOAA · 2024-04-11T15:48:03Z

@SamuelTrahanNOAA - I am unable to to clone the branch. It gives me a permissions/access error that I have pasted below. I'm not sure why it won't allow me to clone it. The changes were working when I changed the files myself in my workflow yesterday and this morning.

Once I can clone it, I will also run a test case immediately.

Cloning into 'ufs-weather-model'... Warning: Permanently added the ECDSA host key for IP address '140.82.113.3' to the list of known hosts. git@github.com: Permission denied (publickey). fatal: Could not read from remote repository.

Please make sure you have the correct access rights and the repository exists.

Sorry, I typed "ssh://git@" as a force of habit. You can use "https://" instead.

git clone --recursive --branch nesting-fixes https://github.com/SamuelTrahanNOAA/ufs-weather-model

spanNOAA added the bug Something isn't working label Apr 4, 2024

This was referenced Apr 11, 2024

bug fixes: kchunk3d ignored, hailwat uninitialized in dycore, tile_num wrong for nests NOAA-EMC/fv3atm#806

Merged

bug fixes: kchunk3d ignored, hailwat uninitialized in dycore, tile_num wrong for nests #2201

Merged

junwang-noaa added this to Model infrastructure development FY2024Q3 Apr 12, 2024

junwang-noaa assigned DusanJovic-NOAA Apr 12, 2024

zach1221 closed this as completed in #2201 Apr 15, 2024

github-project-automation bot moved this to Done in Model infrastructure development FY2024Q3 Apr 15, 2024

JessicaMeixner-NOAA mentioned this issue Apr 17, 2024

Update ufs-weather-model hash in g-w NOAA-EMC/global-workflow#2490

Closed

guoqing-noaa mentioned this issue May 27, 2024

Need to update ufs_model to hash#281b32f or newer NOAA-EMC/global-workflow#2630

Closed

MPI_Type_contiguous Encounters Invalid Count #2227

MPI_Type_contiguous Encounters Invalid Count #2227

Comments

spanNOAA commented Apr 4, 2024 • edited Loading

Description

To Reproduce:

Additional context

Output

jkbk2004 commented Apr 5, 2024

jkbk2004 commented Apr 5, 2024

JessicaMeixner-NOAA commented Apr 5, 2024 • edited Loading

DusanJovic-NOAA commented Apr 5, 2024

DusanJovic-NOAA commented Apr 5, 2024

JessicaMeixner-NOAA commented Apr 5, 2024

SamuelTrahanNOAA commented Apr 8, 2024 • edited Loading

zhanglikate commented Apr 8, 2024

SamuelTrahanNOAA commented Apr 8, 2024 • edited Loading

junwang-noaa commented Apr 8, 2024 • edited Loading

SamuelTrahanNOAA commented Apr 8, 2024

junwang-noaa commented Apr 8, 2024

zhanglikate commented Apr 8, 2024

spanNOAA commented Apr 8, 2024

SamuelTrahanNOAA commented Apr 8, 2024

JessicaMeixner-NOAA commented Apr 8, 2024

kayeekayee commented Apr 8, 2024 • edited Loading

junwang-noaa commented Apr 8, 2024

SamuelTrahanNOAA commented Apr 8, 2024 • edited Loading

junwang-noaa commented Apr 9, 2024 • edited Loading

SamuelTrahanNOAA commented Apr 9, 2024

JessicaMeixner-NOAA commented Apr 9, 2024

zhanglikate commented Apr 9, 2024 via email

SamuelTrahanNOAA commented Apr 9, 2024

zhanglikate commented Apr 9, 2024

lisa-bengtsson commented Apr 9, 2024

SamuelTrahanNOAA commented Apr 9, 2024

lisa-bengtsson commented Apr 9, 2024

SamuelTrahanNOAA commented Apr 9, 2024 • edited Loading

SamuelTrahanNOAA commented Apr 9, 2024

WalterKolczynski-NOAA commented Apr 9, 2024

zhanglikate commented Apr 9, 2024

SamuelTrahanNOAA commented Apr 9, 2024

junwang-noaa commented Apr 10, 2024 • edited Loading

SamuelTrahanNOAA commented Apr 10, 2024 • edited Loading

DusanJovic-NOAA commented Apr 10, 2024

SamuelTrahanNOAA commented Apr 10, 2024

SamuelTrahanNOAA commented Apr 11, 2024

DusanJovic-NOAA commented Apr 11, 2024

junwang-noaa commented Apr 11, 2024

SamuelTrahanNOAA commented Apr 11, 2024

zhanglikate commented Apr 11, 2024 via email

SamuelTrahanNOAA commented Apr 11, 2024 • edited Loading

DusanJovic-NOAA commented Apr 11, 2024

zhanglikate commented Apr 11, 2024

SamuelTrahanNOAA commented Apr 11, 2024

zhanglikate commented Apr 11, 2024 via email

SamuelTrahanNOAA commented Apr 11, 2024

zhanglikate commented Apr 11, 2024 via email

ChristianBoyer-NOAA commented Apr 11, 2024

SamuelTrahanNOAA commented Apr 11, 2024

spanNOAA commented Apr 4, 2024 •

edited

Loading

JessicaMeixner-NOAA commented Apr 5, 2024 •

edited

Loading

SamuelTrahanNOAA commented Apr 8, 2024 •

edited

Loading

SamuelTrahanNOAA commented Apr 8, 2024 •

edited

Loading

junwang-noaa commented Apr 8, 2024 •

edited

Loading

kayeekayee commented Apr 8, 2024 •

edited

Loading

SamuelTrahanNOAA commented Apr 8, 2024 •

edited

Loading

junwang-noaa commented Apr 9, 2024 •

edited

Loading

SamuelTrahanNOAA commented Apr 9, 2024 •

edited

Loading

junwang-noaa commented Apr 10, 2024 •

edited

Loading

SamuelTrahanNOAA commented Apr 10, 2024 •

edited

Loading

SamuelTrahanNOAA commented Apr 11, 2024 •

edited

Loading