Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iasi debug fix #790

Merged
merged 15 commits into from
Sep 20, 2024
Merged

Iasi debug fix #790

merged 15 commits into from
Sep 20, 2024

Conversation

wx20jjung
Copy link
Contributor

@wx20jjung wx20jjung commented Sep 9, 2024

Description

When the GSI is built in debug mode, the code failed when the read_iasi routine called combine_radobs. In tracking down this problem several other minor problems were discovered and fixed.

Resolves #789

The first group of commits are some write format changes I missed in the output of the runtime directory. I also found where the maximum channel number was set to 3000. I changed this array to be dynamic and allocated for each instrument.

The second group of commits resolves 2 errors I found while trying to fix the debug issue. The first was an error some of the IASI temperature values were "NAN"s. I traced this back to some of the cscale values (a scaling factor for IASI radiances, in the BUFR file), were missing values. The missing cscale values are associated with the shortwave side of the water vapor region and the shortwave channels. None of these channels are currently used. These were only found in the direct broadcast data. I did not find an instance where the operational data had missing values. The second error was an integer overflow problem with the variable nread being passed into combine radobs. Nread is used in most satellite instrument reads but the main failure was from read_iasing (to be added later). Nread is a counter and is ultimately number_of_profiles * number_of_channels. This number exceeded the memory for an integer and was ultimately a negative number. In combine_radobs, nread is compared to the total number of elements of the thinning box or itxmax to the number of elements of the task thinning box. In this case, the task thinning box was negative. I added logic to the various read_* routines to pass the number of profiles kept on each task. This makes nread and itxmax consistent. This change caused the total counts to be different for some instruments. Here is an example from the gsistat file for IASI.
previous:
o-g 01 rad metop-b iasi 331026696 6148957 690460 0.24889E+06 0.24889E+06 0.36047 0.36047
new:
o-g 01 rad metop-b iasi 44632896 6148957 690460 0.24889E+06 0.24889E+06 0.36047 0.36047
The total counts after thinning and total counts used are identical along with the other statistics.

The actual failure of read_iasi in debug mode came down to putting an if() cycle in a different place. Read_iasi is now consistent with read_cris.

Type of change

  • [X ] Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

The main testing and cycling experiments were conducted on S4 at C192 resolution. After a single cycle spinup I setup a "control" and "experiment" and run both independently changing the GSIEXEC variable to point to the gsi master branch (control) and the IASI_debug_fix branch (experiment). The control and experiment were run through 4 cycle. Static tests were also conducted on Jet.

The ctests on Hera all passed:
[Jim.Jung@hfe10 build]$ ctest -j 6
Test project /scratch1/NCEPDEV/jcsda/Jim.Jung/save/ctests/update/build
Start 1: global_4denvar
Start 6: global_enkf
Start 2: rtma
Start 3: rrfs_3denvar_rdasens
Start 4: hafs_4denvar_glbens
Start 5: hafs_3denvar_hybens
1/6 Test #3: rrfs_3denvar_rdasens ............. Passed 492.94 sec
2/6 Test #2: rtma ............................. Passed 970.72 sec
3/6 Test #5: hafs_3denvar_hybens .............. Passed 1043.42 sec
4/6 Test #6: global_enkf ...................... Passed 1118.55 sec
5/6 Test #4: hafs_4denvar_glbens .............. Passed 1159.69 sec
6/6 Test #1: global_4denvar ................... Passed 1909.17 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) = 1909.21 sec

The rrfs ctest timed out on jet, all others passed.
Test project /lfs5/HFIP/hfv3gfs/Jim.Jung/noscrub/ctests/update/build
Start 1: global_4denvar
Start 2: rtma
Start 3: rrfs_3denvar_rdasens
Start 4: hafs_4denvar_glbens
Start 5: hafs_3denvar_hybens
Start 6: global_enkf
1/6 Test #6: global_enkf ...................... Passed 1779.50 sec
2/6 Test #2: rtma ............................. Passed 1938.22 sec
3/6 Test #1: global_4denvar ................... Passed 1991.97 sec
4/6 Test #5: hafs_3denvar_hybens .............. Passed 2000.38 sec
5/6 Test #4: hafs_4denvar_glbens .............. Passed 2310.42 sec

The rrfs test also failed on hercules
[jjung@hercules-login-3 build]$ ctest
Test project /work/noaa/nesdis-rdo1/jjung/noscrub/ctests/update/build
Start 1: global_4denvar
1/6 Test #1: global_4denvar ................... Passed 1741.09 sec
Start 2: rtma
2/6 Test #2: rtma ............................. Passed 965.23 sec
Start 3: rrfs_3denvar_rdasens
3/6 Test #3: rrfs_3denvar_rdasens .............***Failed 484.58 sec
Start 4: hafs_4denvar_glbens
4/6 Test #4: hafs_4denvar_glbens .............. Passed 1161.19 sec
Start 5: hafs_3denvar_hybens
5/6 Test #5: hafs_3denvar_hybens .............. Passed 1094.42 sec
Start 6: global_enkf
6/6 Test #6: global_enkf ...................... Passed 847.14 sec

83% tests passed, 1 tests failed out of 6

Total Test time (real) = 6293.65 sec

The following tests FAILED:
3 - rrfs_3denvar_rdasens (Failed)
The memory for rrfs_3denvar_rdasens_loproc_updat is 1106724 KBs. This has exceeded maximum allowable memory of 1098631 KBs, resulting in Failure memthresh of the regression test.

Checklist

  • [X ] My code follows the style guidelines of this project
  • [X ] I have performed a self-review of my own code
  • [X ] I have commented my code, particularly in hard-to-understand areas
  • [X ] New and existing tests pass with my changes
  • [X ] Any dependent changes have been merged and published

Please add @ADCollard, @DavidHuber-NOAA and @InnocentSouopgui-NOAA as reviewers.

…his change allows the number of channels to be defined at run time.
…were caused by the radiance scaling factor being missing in the DBNet data. Corrected sending in profile counts into combine_radobs instead of total channel counts in most satellite data read routines. Not all of them were tested. Several no longer exist.
Copy link
Collaborator

@DavidHuber-NOAA DavidHuber-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a couple minor suggestions.

src/gsi/combine_radobs.f90 Show resolved Hide resolved
src/gsi/read_iasi.f90 Outdated Show resolved Hide resolved
@wx20jjung
Copy link
Contributor Author

@DavidHuber-NOAA
I have been struggling with this. I do not know what the true intent of nread is. Was it developed for 2D (horizontal) observations or 3D data like the radiances? I interpret the code as being for the former. My argument is based on the allocate statement for nloc and icrit a few lines down. Both of these variables use itxmax, which is the total counts of a 2D spatial (horizontal) grid. To be consistent, I feel ncounts1 (and nread) should also be from a 2D spatial grid and not multiplied by the number of channels.

An alternative would be to add another argument to combine_radobs specifically for ncounts and ncounts1, which would all be long integers from and to the various read routines.

@DavidHuber-NOAA
Copy link
Collaborator

@wx20jjung I agree with you on the intent; that it was developed for 2D observations and so it is unlikely to every overflow with your changes. That said, I think I like the alternative approach you proposed just to be safe.

@wx20jjung
Copy link
Contributor Author

As per a conversation with @DavidHuber-NOAA, I've separated the total channel counts read (nread) from the thinning box (itxmax). Now, if the channel counts read becomes larger than the integer size, it will only affect the write statements. I've re-run the ctests on hera.

Test project /scratch1/NCEPDEV/jcsda/Jim.Jung/scrub/ctests/update/build
Start 2: rtma
Start 4: hafs_4denvar_glbens
Start 5: hafs_3denvar_hybens
Start 1: global_4denvar
Start 6: global_enkf
Start 3: rrfs_3denvar_rdasens
1/6 Test #3: rrfs_3denvar_rdasens ............. Passed 496.10 sec
2/6 Test #2: rtma ............................. Passed 972.37 sec
3/6 Test #5: hafs_3denvar_hybens .............. Passed 1106.76 sec
4/6 Test #6: global_enkf ...................... Passed 1159.74 sec
5/6 Test #4: hafs_4denvar_glbens ..............***Failed 1345.07 sec
6/6 Test #1: global_4denvar ................... Passed 1969.79 sec

83% tests passed, 1 tests failed out of 6

Total Test time (real) = 1969.84 sec

The hafs_4denvar_glbens test failed the time limit test. I restarted the test and it passed.

ctest --rerun-failed --output-on-failure
Test project /scratch1/NCEPDEV/jcsda/Jim.Jung/scrub/ctests/update/build
Start 4: hafs_4denvar_glbens
1/1 Test #4: hafs_4denvar_glbens .............. Passed 1348.48 sec
100% tests passed, 0 tests failed out of 1
Total Test time (real) = 1348.51 sec

I will run the ctests on other machines when I get feedback from the rest of the reviewers.

@RussTreadon-NOAA
Copy link
Contributor

WCOSS2 (Dogwood) ctests

Install wx20jjung:IASI_debug_fix at a41a78a on Dogwood. Use develop at 9f44c87 as the control. ctests results are as follows

Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr790/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens .............   Passed  729.88 sec
2/6 Test #6: global_enkf ......................   Passed  972.64 sec
3/6 Test #5: hafs_3denvar_hybens ..............   Passed  1153.84 sec
4/6 Test #4: hafs_4denvar_glbens ..............   Passed  1334.73 sec
5/6 Test #1: global_4denvar ...................   Passed  1864.83 sec
6/6 Test #2: rtma .............................   Passed  1869.75 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) = 1869.76 sec

Copy link
Contributor

@ADCollard ADCollard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a major improvement to me.
Thanks, Jim.

Copy link
Contributor

@RussTreadon-NOAA RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, @wx20jjung. Just a few minor comments.

src/enkf/innovstats.f90 Outdated Show resolved Hide resolved
src/gsi/combine_radobs.f90 Show resolved Hide resolved
src/gsi/radinfo.f90 Show resolved Hide resolved
src/gsi/read_iasi.f90 Outdated Show resolved Hide resolved
src/gsi/read_seviri.f90 Outdated Show resolved Hide resolved
Co-authored-by: David Huber <69919478+DavidHuber-NOAA@users.noreply.github.com>
… make this routine consistent with other satellite data read routines. The other two files are formatting changes.
@wx20jjung
Copy link
Contributor Author

I've made the various changes, tested everything both in the current branch (IASI_debug_fix) and in my IASI-NG branch in a single cycle test. Both results are as expected. The various satellite parameters in the gsistat (counts, bias, std dev, etc) are identical with (iasi_debug_fix) and without (develop) these changes. I found no more '****' fields in any of the output.

Copy link
Contributor

@RussTreadon-NOAA RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @wx20jjung for the updates. Question remains for src/gsi/read_seviri.f90

src/gsi/read_seviri.f90 Outdated Show resolved Hide resolved
…vious changes in combine_radobs. Some minor changes remain, lile in read_cris, where now nrec = 999999 to make it consistent with other read routines. The changes associated in fixing the read_iasi failure in debug mode remain.
Copy link
Contributor

@RussTreadon-NOAA RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wx20jjung, one question and one request for change.

Copy link
Contributor

@RussTreadon-NOAA RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wx20jjung . I installed wx20jjung:IASI_debug_fix on Cactus this afternoon with num_profiles removed from read_seviri.f90 in my working copy. All ctests passed.

Copy link
Collaborator

@DavidHuber-NOAA DavidHuber-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good. Thanks @wx20jjung!

@wx20jjung
Copy link
Contributor Author

ctests on herculese completed as expected
[jjung@hercules-login-4 build]$ ctest -j 6
Test project /work/noaa/nesdis-rdo1/jjung/noscrub/ctests/update/build
Start 1: global_4denvar
Start 2: rtma
Start 3: rrfs_3denvar_rdasens
Start 4: hafs_4denvar_glbens
Start 5: hafs_3denvar_hybens
Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens ............. Passed 613.04 sec
2/6 Test #6: global_enkf ...................... Passed 785.35 sec
3/6 Test #2: rtma ............................. Passed 1084.97 sec
4/6 Test #5: hafs_3denvar_hybens .............. Passed 1152.51 sec
5/6 Test #4: hafs_4denvar_glbens .............. Passed 1278.23 sec
6/6 Test #1: global_4denvar ................... Passed 1741.15 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) = 1741.16 sec

Results from jet had 2 problems:
Test project /lfs5/HFIP/hfv3gfs/Jim.Jung/noscrub/ctests/update/build
Start 1: global_4denvar
Start 2: rtma
Start 3: rrfs_3denvar_rdasens
Start 4: hafs_4denvar_glbens
Start 5: hafs_3denvar_hybens
Start 6: global_enkf
1/6 Test #6: global_enkf ...................... Passed 1225.07 sec
2/6 Test #2: rtma ............................. Passed 1391.81 sec
3/6 Test #5: hafs_3denvar_hybens .............. Passed 1637.28 sec
4/6 Test #4: hafs_4denvar_glbens ..............***Failed 1697.58 sec
5/6 Test #1: global_4denvar ................... Passed 2410.05 sec

The rrfs tests timed out and did not complete.
The runtime for hafs_4denvar_glbens_loproc_updat is 335.324241 seconds. This has exceeded maximum allowable threshold time of 328.320275 seconds,
resulting in Failure time-thresh of the regression test.

My hera tests have been in the queue for the past 12 hours. I removed them.

… initialized twice in read_gmi.f90, once before and after it was allocated. I removed the first instance.
@RussTreadon-NOAA
Copy link
Contributor

WCOSS2 (Cactus) ctests
Install wx20jjung:IASI_debug_fix at 74947da on Dogwood. Run ctests with following results:

Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr790/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens .............   Passed  734.97 sec
2/6 Test #6: global_enkf ......................   Passed  864.32 sec
3/6 Test #2: rtma .............................   Passed  971.09 sec
4/6 Test #5: hafs_3denvar_hybens ..............   Passed  1215.19 sec
5/6 Test #4: hafs_4denvar_glbens ..............   Passed  1334.75 sec
6/6 Test #1: global_4denvar ...................   Passed  1683.76 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) = 1683.77 sec

Copy link
Contributor

@RussTreadon-NOAA RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @wx20jjung for working through my comments. Looks good.

Approve.

@RussTreadon-NOAA RussTreadon-NOAA merged commit 2b1d706 into NOAA-EMC:develop Sep 20, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

IASI combine_radobs failure in debug mode
4 participants