-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FVCOM_tools fails when using GNU #596
Comments
@GeorgeVandenberghe-NOAA I know you are very busy, but could we get a bit of your time? @dmwright526 is getting an odd seg fault when running with GNU and we are looking for some guidance. I will let him explain the problem. |
I would recommend compiling and running with debug flags |
Trying to build the tests, but the function call is missing some arguments:
And compared to the function defintion:
Looks like it's missing 2 arguments: |
@kgerheiser Use this hash instead: This was before the recent update we had to push through quickly due to wrong initialization of surface roughness length causing the MYNN scheme to crash. It will take me a bit to update the tests and dummy data to include the fields, so this will at least allow to see if you can diagnose the problem. |
Thanks, I was able to build and reproduce the problem. Intel debug flags didn't catch anything. |
@kgerheiser We know you are super busy. Any progress on this? Should we ask @arunchawla-NOAA for some of your time? |
Seems like this is a problem in the FVCOM code and should be fixed by the programmers who submitted the code with the missing parameter. Or am I missing something... |
@edwardhartnett This is legacy code that was developed for the HRRR to ingest the lake surface conditions. It was developed purely on NOAA machines with Intel compilers and porting it to GNU is an unexpected challenge without an obvious solution. The code was pushed to get this feature into RRFS development sooner for testing over the winter. I am putting more time into it this week to try to find a solution. |
Who is the original author(s) of the code. The problem in testing reported, was a binding mismatch the intel compiler let slide (for now). But it's fundamental. If the original author isn't available we will probably have to do a thorough analysis of the code for other possible errors Intel is letting slide and gnu is catching. I don't have a lot of time until after WCOSS2 goes live, after that I should be able to go after it. |
Thanks @GeorgeVandenberghe-NOAA. Like As I mentioned above, this code is modified from HRRR for ingesting lake surface conditions which was developed on Intel compilers. While it is likely I added additional errors to the code, to my knowledge it was never ported to GNU. |
OK, have you tried with the address santizer in GNU? Build with CFLAGS='-g -fsanitize=address -fno-omit-frame-pointer' and rerun the test. You will get much more info about why it fails. |
Is there a PR up that demonstrates this problem? Aren't the FVCOM tests run in CI? If so, how come CI is passing? If not, can we get them to run in CI? Then we could all see the failure, and also we already have address sanitizer running in CI... |
The FVCOM unit tests are currently commented out in the CMakeList.txt file in the test directory as to not keep raising the known error with GNU. Adding this line back in will allow the unit test to be compiled and then run to show the error. The address sanitizer flags, as I tested them, only specifies the fcst_nwp object (called named "fcst" in the code) as the culprit. I have tried digging deeper with both the flags mentioned in this thread and running through the GDB debugger, but have yet to be successful in isolating the problem. |
@GeorgeVandenberghe-NOAA @edwardhartnett @kgerheiser Do any of you have availability to take a look at this bug some more? I have been attempting to debug it over the past few days using the unit test (tests/fvcom_tools). When compiling with both Intel and GNU, the .nc files are able to be read in correctly in the sorc/fvcom_tools/module_ncio.f90 (routines used for directly reading in netCDF data) and sorc/fvcom_tools/module_nwp.f90 (routines used to populate specific surface variables). The unit test will segfault with GNU when attempting to use variables (fv3sst, fv3ice, fv3iceT, etc.) passed out of sorc/fvcom_tools/module_nwp.f90 (subroutine read_nwp) to tests/fvcom_tools/ftst_readfvcomnetcdf.F90. |
Can you please rebuild with both C and Fortran FLAGS including -fsantize=address? |
@edwardhartnett thank you for the suggestion! I tried with this flag on, but it did not change the segfault output or give any more information on the problem within the object. |
Is there a test that demonstrates this problem? |
@edwardhartnett The unit test currently in the repository will produce the error (make test) |
and to run its unit test on Hera. Part of ufs-community#596.
To reproduce the problem on Hera, use my branch at 9ea1e2d. I get the following error:
The segfault occurs at "type(fcst_nwp)" declaration line of the test:
|
Commenting out some of the procedure statements (0de2503) eliminates the 'seg fault'. The test fails later, as expected.
|
Adding a new - but simple - subroutine to the class (0de2503) results in a seg fault:
So, there is something about those class procedures/subroutines that GNU does not like. |
FVCOM_tools and the corresponding unit test creates a segfault when running the code using the GNU compilers. Code was originally tested and developed using Intel compilers, so the exact reason for the error is unclear. The segfault points to the fcst_nwp object and potentially one of the pointers within the object.
The text was updated successfully, but these errors were encountered: