-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfaults when running cpld aerosol model with GNU 9 and 10 #1115
Comments
@MinsukJi-NOAA May I ask what is the error message when running cpld_control_c96_p8 with GNU compiler? The line 321 in ufs-weather-model/GOCART/ESMF/UFS/Aerosol_Cap.F90 is % wrap % maplCap = MAPL_Cap("MAPL Driver", aerosol_routine_SS, cap_options=maplCapOptions, _RC) @rmontuoro FYI. |
@MinsukJi-NOAA, @junwang-noaa: I have built the current revision (e028578) of the ufs-weather-model on Hera using
This points to an issue with the Thompson MP scheme and/or its input files. |
Does the file exist in the input directory? |
@rmontuoro When I try running The error messages I posted earlier are from running
As @junwang-noaa pointed out, line 321 of Aerosol_Cap.F90 is: |
I ran the following on Cheyenne.gnu:
Both failed with the error:
|
To summarize, there appears to be two separate issues. On hera.gnu, I ran the following tests using develop
The cpld_debug_noaero_p8 test completed. The cpld_debug_p8 test failed at Using the new wave cap (which can be compiled and run in debug mode), I ran the following using 4d3650d
Both tests failed with
The field_table, diag_table, input.nml and qr_acr_qgV2.dat were identical (except for cplwav, cpwav2atm) between the cases run at develop and the cases run with the new wave cap. |
For this error: module_mp_thompson.F90 (unit = 63, file = 'qr_acr_qgV2.dat') |
Why would that only be the case for this particular test? The file is used all over the place in many other regression tests. |
Is the wave model also trying to read/write something with unit=63? (I'm trying to look in the wave code and haven't found this yet, so maybe not?) |
Denise tried to use an available unit, but it does not resolve the issue.
…On Wed, Apr 13, 2022 at 11:57 AM Jessica Meixner ***@***.***> wrote:
Is the wave model also trying to read/write something with unit=63? (I'm
trying to look in the wave code and haven't found this yet, so maybe not?)
—
Reply to this email directly, view it on GitHub
<#1115 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AI7D6TPRIC23QAIKCIMMKEDVE3VERANCNFSM5QZFYMHQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@climbfuji I did a test by simply adding "convert=big_endian" in the open(63,...) in module_mp_thompson.F90 and I got the 'end of file during read' error. Not sure if similar compiler option was added during compiling for these particular tests. |
WW3 uses -fconvert=big-endian although I wouldn't think that would effect non-wave files |
Then, add |
@DusanJovic-NOAA Do you mean that when WW3 sets 'big endian', it propagates to other components? The failure happens only on intel. |
It does propagate to all other components that depend on WW3, including the top-level driver. WW3 should not set compiler flags as PUBLIC properties of the ww3_lib target:
Why is this done this way? No other component is doing that. All other components are either appending component specific flags to already defined |
Instead of adding |
I think it is best to explicitly add the endiannness in module_mp_thompson - way safer than relying on external (build) compiler options. |
Maybe some day we'll get away from binary files and not need to set big-endian, but for the time being we still need to be consistent. I don't think we can realistically switch it for WW3, but if we need to set it to PRIVATE so it does not propagate out, I'll be happy to attempt to figure out how to do that and make sure it gets updated in WW3. Just let me know. However I also agree with @climbfuji it's probably best to add the endianness to the module file that needs it. |
@climbfuji @DusanJovic-NOAA @JessicaMeixner-NOAA @DeniseWorthen Yes, the safer way is to add endianness in the module_mp_thompson.F90. I will ask Greg Thompson if he agrees to make changes to the code. |
I agree with @DusanJovic-NOAA, I am worried that the compile options set for ww3 are populated to other components, so the test running with/without ww3 will have different compile options and could generate different results, even though the reading file issue could be temporarily fixed by adding endiannness in module_mp_thompson. Anytime ww3 compile option change could impact atm run results, that is not good coding practice to me. Can we try if changing the ww3_lib from PUBLIC to PRIVATE resolves the issue? @kgerheiser Can we set ww3_lib to PRIVATE as Dusan suggested? |
Yes, I will fix the flags and make a PR. I think @DusanJovic-NOAA is correct about using |
I added 'convert=little_endian' to all the open statements in module_mp_thompson and re-ran the S2SWA test using the new cap. I no longer get the end-of-file error, but I do now get the cap error ( I will also test Kyle's solution. BTW, in ozinterp.f90 and h2ointerp.f90, there are existing 'convert' statements in the file open statements. |
I tried Kyle's cmake solution but the model appears to hang. I'm not sure what exactly is happening. The run directory is here /scratch1/NCEPDEV/stmp2/Denise.Worthen/FV3_RT/rt_gnumesh/test_flags |
@DeniseWorthen I understand now. Hopefully it'll run with the new mod_defs without further code modifications. It does look like w3igrmd is using the big_endian, but are all WW3 files? Because w3iogo w3iors and others also use binary files and need to be consistently compiled. |
@JessicaMeixner-NOAA Reverting the change to w3iogrmd.F90 and using your new mod_def produces a model hang (I've been giving it 5 minutes to start up and killing when it has not progressed). I don't understand why the convert modifier is required if the compiler is including it. |
This change seems odd to me. Especially as it's a one-line change and not changing for example the write above it, which should likely also be changed along with every other binary read or write in WW3. I wonder what NDSM is? Is the issue that somehow NDSM=63 and that's the issue? |
W3IOGR is being called in wav_shel_inp here:
NDSF(7) is unit 17 (which I confirmed). I've also confirmed now that if I compile the same code w/ Intel and Kyle's build fix (but no fix to mp_module_thompson, no fix to w3iogr), the model runs fine. |
So we still have an issue with gnu without the w3iogr fix though? I wonder if the mod_def needs to be created with a gnu executable? |
@DeniseWorthen I made mod_defs with gnu: |
That's right. We can't turn on Aerosols for GNU because of the issue in the the aerosol_cap. But I've been trying to add waves to the test (currently it is run w/o waves on gnu). If we resolve the propagation of compiler flags from WW3 to other components w/ Kyle's build fix, even though we're compiling the w3iogr with big_endian, I need to add that modifier or the model hangs for GNU. One thing that does seem odd is that it hangs rather than seg-faults. It seems in the past when I've run into endian-ness issues, the code generally seg-faults because it is getting nonsense numbers. |
The gnu mod_def didn't make a difference. |
I think I understand what may be happening. I did two final tests, with and without Kyle's build fix (to prevent propagation of the big-endian flag to all components). I did not change the open statements in either module_mp_thompson or w3iogrmd, but I added an inquiry to write the endian-ness of the open unit. With Kyle's build fix, all files report being 'little_endian', even w3iogrmd, even though we are specifically compiling WW3 with Without Kyle's build fix, as expected, the thompson files are reported as being big_endian and the model fails with the end-of-file error. It fails before getting to the ww3 startup. Checking the GNU documentation, it states that So, when we allow the big_endian flag from WW3 to go to all components, the main program also gets compiled with big_endian, and so WW3 is happy (but mp_thompson is not). When we try to force only WW3 to be compiled with big_endian, it has no impact because the endianness isn't being set for the main program. Does this make sense @kgerheiser ? |
Thanks @DeniseWorthen that's really interesting! Theoretically as long as WW3 is consistently compiled we could change to little endian, but that would have big consequences for operations and the community users, so we should think about possible ways to minimize that impact if that's something we want to consider. @DeniseWorthen let me know if you want a "little-endian" set of GNU mod_defs for any testing in the meantime. |
I actually think the safest solution might be to specify the |
@DeniseWorthen that does make sense. Good work on finding that.
That is non-standard Fortran. Ifort (and maybe gfortran) have extensions, but they are compiler dependent. |
Is there any particular reason why WW3 uses big-endian unformatted files on a little-endian system? |
@arunchawla-NOAA might know more of that history. We've always used big-endian on WW3. |
@kgerheiser Do you mean that the words you specify for convert setting in the open statement is compiler dependent? I wrote 'big-endian' but it looks like both ifort and gnu use "convert=big_endian". |
I mean that the |
Thanks for the clarification. I think we're all agreed that WW3 compiler flags should not be propagated to all components, so we want to use Kyle's build fix. Are these our two options then?
|
The way that the CMAKE is introduced in WW3 it's not just choosing the "fconvert=big_endian " for WW3/UWM it's doing that for everyone and everyone might not agree to that. It's also pretty impactful for operations for a GNU-only issue. That being said, putting in non-standard fortran in the code for a compiler issue seems like not a great option, but perhaps still preferable if our options are only 1 or 2. |
@JessicaMeixner-NOAA Thanks for the clarification regarding how big_endian is not configurable for just UWM. What solution would you propose? |
If the choices are 1 or 2, I personally think 2 is the way to go. The other solutions I've thought of adds unnecessary complication. Should I make an issue in WW3 to make sure others are okay with this addition? |
I think specifying It makes the option explicit in each open statement that needs it, and cuts out setting the endian flag for each compiler. I couldn't find a compiler that doesn't support this setting. Even the NAG Fortran compiler has it, which is really strict about non-standard code. |
we chose big_endian because we were on multiple different types of platforms and this allowed us to pass binary files acros platforms and not worry about endian. Choice of big_endian was arbitrary. Maybe we just remove it and move to creating mod def files on each platform ? |
Or do what Kyle suggests :). Maybe now is when we switch to a netcdf mod_def file and do away with binary outputs :) |
That'd be my vote and hope we will when we get the time! |
I made a branch based off of develop (https://github.com/kgerheiser/WW3/tree/explicit-big-endian) that adds Commit here: https://github.com/kgerheiser/WW3/commit/109f9fc546be7a5ee95296a8aed70d6524f465e2 |
Description
Currently, CI includes tests for
cpld_control_c96_p8
and its debug run.Due to changes introduced in #1071, the
cpld_control_c96_p8
test with GNU results in segfaults originating fromufs-weather-model/GOCART/ESMF/UFS/Aerosol_Cap.F90:321
for runs both in a Docker container and on Hera.Solution
Switch
cpld_control_c96_p8
withcpld_control_c96_noaero_p8
in CI. Also make changes toopnReqTest
script and CI (library update and input data update) that are needed after the aerosol update.Update
CI tests have been made and merged into the UFS weather model via #1087. However, this issue will be kept open to address the segfault issue when running the coupled aerosol model with GNU 9 and 10. The issue title has been updated accordingly.
The text was updated successfully, but these errors were encountered: