Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MERRA2 aerosol options for UFS and the coupled model #200

Closed
AnningCheng-NOAA opened this issue Nov 30, 2020 · 43 comments · Fixed by #254
Closed

MERRA2 aerosol options for UFS and the coupled model #200

AnningCheng-NOAA opened this issue Nov 30, 2020 · 43 comments · Fixed by #254
Assignees
Labels
feature New feature or request

Comments

@AnningCheng-NOAA
Copy link
Contributor

add 0.5 degree by 0.625 degree by 72 level ten-year MERRA2 aerosol climatological data as an option to replace the 5 degree by 5 degree OPAC aerosol data to drive radiation and microphysics. Initial tests have been performed using CCPP SCM. One year C768L127 free forecast run is being performed in DELL, HERA, and Orion, respectively.

@KateFriedman-NOAA KateFriedman-NOAA linked a pull request Feb 3, 2021 that will close this issue
@KateFriedman-NOAA KateFriedman-NOAA added the feature New feature or request label Feb 3, 2021
@AnningCheng-NOAA
Copy link
Contributor Author

new tests needed for changing CCPP=NO, interval = '24:00:00', export FHMAX_GFS_00=384, for DELL, Orion, HERA, and a few of more cases. Is cycled testing needed? is "Waves" needed to be on?

@KateFriedman-NOAA
Copy link
Member

KateFriedman-NOAA commented Feb 3, 2021

@AnningCheng-NOAA For the Merra2 changes please run 2 tests on each platform:

Test 1 - combine several settings into a single 2.5 cycle cycled test:

  1. interval=24 (gfs_cyc=1), make sure one of the full cycles in the test is 00z (suggest starting with 18z half cycle), feel free to set gfs_cyc=4 when running setup scripts though if there are no reasons to not run the gfs for 06z, 12z, or 18z
  2. FHMAX_GFS_00=384 (to make sure you don't hit walltime)
  3. DO_WAVE=YES
  4. RUN_CCPP=YES

That should invoke the parts of the system I need to see tested for this.

Test 2 - a 1.5 cycle cycled test on each platform with RUN_CCPP=NO to make sure adding support for Merra2 does not break the remaining support for IPD (I will be dropping IPD support in coming months but not just yet)

Let me know if you run into any issues with either test that you need help with. Thanks!

@KateFriedman-NOAA
Copy link
Member

I was just told that it's known that waves don't work with CCPP on. Therefore test 1 can be done with DO_WAVE=NO and I am ok with DO_WAVE=NO in config.base when RUN_CCPP=YES.

@yangfanglin
Copy link
Contributor

yangfanglin commented Feb 3, 2021 via email

@KateFriedman-NOAA
Copy link
Member

I was hoping to get the final v16 changes from NCO into develop before removing the final IPD support but with the implementation delay I see I can't do that...so yes I guess it's time to remove all IPD related definitions in the workflow.

I'll update my PR review comments for this. Thanks!

@KateFriedman-NOAA
Copy link
Member

@AnningCheng-NOAA Ok since waves don't work with CCPP right now I adjust my test request: please run a 2.5 cycle cycled test on each platform with RUN_CCPP=YES, DO_WAVE=NO, and FHMAX_GFS_00=384 so we can make sure it is ok in cycled mode. No test #2 anymore. Thanks! Sorry for the confusion on my end!

@AnningCheng-NOAA
Copy link
Contributor Author

AnningCheng-NOAA commented Feb 4, 2021 via email

@KateFriedman-NOAA
Copy link
Member

@AnningCheng-NOAA I've started pulling them into the WCOSS-Dell $FIX_DIR ($FIX_DIR/fix_aer and $FIX_DIR/fix_lut) from your WCOSS_Dell set. Then I will rsync them to the FIX_DIRs on the Crays, Hera, Jet, and Orion...and make a new HPSS tarball of $FIX_DIR. I see the fix_aer files are quite large so the copy/rsyncs will take a while. Will report back when done, thanks!

@AnningCheng-NOAA
Copy link
Contributor Author

AnningCheng-NOAA commented Feb 4, 2021 via email

@KateFriedman-NOAA
Copy link
Member

@AnningCheng-NOAA I see your Mars set is now dated February 4th and the Venus set (what I pulled from) is December 24th. I should pull from your Mars set then? Please confirm, thanks!

@AnningCheng-NOAA
Copy link
Contributor Author

AnningCheng-NOAA commented Feb 4, 2021 via email

@KateFriedman-NOAA
Copy link
Member

@AnningCheng-NOAA The new fix files are now in all $FIX_DIRs on WCOSS-Dell, WCOSS-Cray, Hera, and rzdm. I'm copying them to Orion and Jet this morning. Below is the listing of them on Hera under $FIX_DIR/fix_aer and $FIX_DIR/fix_lut. I'm also putting a fresh copy of $FIX_DIR on HPSS for our archival. You can now remove the paths to FIX_AER and FIX_LUT in config.base.emc.dyn, thanks.

-bash-4.2$ ll /scratch1/NCEPDEV/global/glopara/fix/
total 108
-rwxr-xr-x  1 glopara global   160 Oct  3  2019 0readme
drwxr-sr-x  2 glopara global  4096 Feb  4 17:05 fix_aer
drwxr-sr-x  5 glopara global 61440 Dec  2 18:18 fix_am
drwxr-sr-x  5 glopara global  4096 Jun 10  2019 fix_chem
drwxr-sr-x 10 glopara global  4096 Jul 28  2017 fix_fv3
drwxr-sr-x 10 glopara global  4096 Dec 31  2017 fix_fv3_gmted2010
drwxr-xr-x  6 glopara global  4096 Dec 13  2019 fix_gldas
drwxr-sr-x  2 glopara global  4096 Feb  4 15:37 fix_lut
drwxr-sr-x  2 glopara global  4096 Aug 31 14:11 fix_orog
drwxr-sr-x  2 glopara global  4096 Sep 13  2019 fix_sfc_climo
drwxr-sr-x  4 glopara global  4096 May 11  2018 fix_verif
drwxr-sr-x  2 glopara global  4096 Oct 26 14:59 fix_wave_gfs
-bash-4.2$ ll /scratch1/NCEPDEV/global/glopara/fix/fix_aer/
total 12693072
-rwxr-xr-x 1 glopara global 1018901352 Feb  4 15:49 merra2.aerclim.2003-2014.m01.nc
-rwxr-xr-x 1 glopara global 1018901352 Feb  4 15:50 merra2.aerclim.2003-2014.m02.nc
-rwxr-xr-x 1 glopara global 1018901352 Feb  4 15:43 merra2.aerclim.2003-2014.m03.nc
-rwxr-xr-x 1 glopara global 1018901352 Feb  4 15:53 merra2.aerclim.2003-2014.m04.nc
-rwxr-xr-x 1 glopara global 1018901352 Feb  4 15:41 merra2.aerclim.2003-2014.m05.nc
-rwxr-xr-x 1 glopara global 1018901352 Feb  4 15:54 merra2.aerclim.2003-2014.m06.nc
-rwxr-xr-x 1 glopara global 1018901352 Feb  4 15:47 merra2.aerclim.2003-2014.m07.nc
-rwxr-xr-x 1 glopara global 1018901352 Feb  4 15:50 merra2.aerclim.2003-2014.m08.nc
-rwxr-xr-x 1 glopara global 1018901352 Feb  4 15:42 merra2.aerclim.2003-2014.m09.nc
-rwxr-xr-x 1 glopara global 1018901352 Feb  4 15:52 merra2.aerclim.2003-2014.m10.nc
-rwxr-xr-x 1 glopara global 1018901352 Feb  4 15:46 merra2.aerclim.2003-2014.m11.nc
-rwxr-xr-x 1 glopara global 1018901352 Feb  4 15:44 merra2.aerclim.2003-2014.m12.nc
-rwxr-xr-x 1 glopara global   64218936 Feb  4 15:52 merra2C.aerclim.2003-2014.m01.nc
-rwxr-xr-x 1 glopara global   64218936 Feb  4 15:52 merra2C.aerclim.2003-2014.m02.nc
-rwxr-xr-x 1 glopara global   64218936 Feb  4 15:46 merra2C.aerclim.2003-2014.m03.nc
-rwxr-xr-x 1 glopara global   64218936 Feb  4 15:45 merra2C.aerclim.2003-2014.m04.nc
-rwxr-xr-x 1 glopara global   64218936 Feb  4 15:54 merra2C.aerclim.2003-2014.m05.nc
-rwxr-xr-x 1 glopara global   64218936 Feb  4 15:54 merra2C.aerclim.2003-2014.m06.nc
-rwxr-xr-x 1 glopara global   64218936 Feb  4 15:47 merra2C.aerclim.2003-2014.m07.nc
-rwxr-xr-x 1 glopara global   64218936 Feb  4 15:52 merra2C.aerclim.2003-2014.m08.nc
-rwxr-xr-x 1 glopara global   64218936 Feb  4 15:45 merra2C.aerclim.2003-2014.m09.nc
-rwxr-xr-x 1 glopara global   64218936 Feb  4 15:46 merra2C.aerclim.2003-2014.m10.nc
-rwxr-xr-x 1 glopara global   64218936 Feb  4 15:45 merra2C.aerclim.2003-2014.m11.nc
-rwxr-xr-x 1 glopara global   64218936 Feb  4 15:45 merra2C.aerclim.2003-2014.m12.nc
-bash-4.2$ ll /scratch1/NCEPDEV/global/glopara/fix/fix_lut/
total 73428
-rwxr-xr-x 1 glopara global   202000 Jun 24  2019 optics_BC.v1_3.dat
-rwxr-xr-x 1 glopara global   461637 Jun 24  2019 optics_DU.v15_3.dat
-rwxr-xr-x 1 glopara global 73711072 Jun 24  2019 optics_DU.v15_3.nc
-rwxr-xr-x 1 glopara global   202000 Jun 24  2019 optics_OC.v1_3.dat
-rwxr-xr-x 1 glopara global   502753 Jun 24  2019 optics_SS.v3_3.dat
-rwxr-xr-x 1 glopara global   101749 Jun 24  2019 optics_SU.v1_3.dat

@AnningCheng-NOAA
Copy link
Contributor Author

AnningCheng-NOAA commented Feb 5, 2021 via email

@KateFriedman-NOAA
Copy link
Member

Anning, I took a look and your jobs have appropriate resource requests. My most recent C768C384L127 test on Orion used more resources since I ran with waves on so yours are fewer so it should be good. It's possible the queues are very busy and/or the compute account allocation you're using is nearing full. I see you're using fv3-cpu, it looks like it's close to its allocation (via saccount_params command):

        Project: fv3-cpu
                LevelFairshare=0.705    Core Hours Used (30 days)=2634324.7,30-day Allocation=2812246
                Partition Access: ALL
                Available QOSes: batch,debug,novel,urgent,windfall

You could try another compute account if you have access to another (check via saccount_params command) but if the queues are busy you'll keep waiting.

@KateFriedman-NOAA
Copy link
Member

@AnningCheng-NOAA @yangfanglin FYI after discussing the upcoming commit plan for develop with the other global-workflow code managers we have decided we are going to hold this work and PR #254 for a bit (~2-3 weeks). Since this PR moves the ufs-weather-model version forward to one that supports hpc-stack we want to get the other hpc-stack changes into develop before this. Please complete current testing and keep your branch synced with develop changes. You may leave the PR open. Thanks!

@AnningCheng-NOAA
Copy link
Contributor Author

AnningCheng-NOAA commented Feb 9, 2021 via email

@KateFriedman-NOAA
Copy link
Member

You need some IC files for the analysis, the missing file is one of them. This is missing:

/scratch1/NCEPDEV/stmp2/Anning.Cheng/ROTDIRS/mcyc/gdas.20200204/00/atmos/gdas.t00z.abias_air

Where did you get your ICs? You'll need to pull out the following files that came from the same or a companion tarball:

  • gdas.t00z.abias
  • gdas.t00z.abias_air
  • gdas.t00z.abias_pc
  • gdas.t00z.radstat

Point me to your IC source and I'll see where those four files are. Thanks!

@AnningCheng-NOAA
Copy link
Contributor Author

AnningCheng-NOAA commented Feb 9, 2021 via email

@AnningCheng-NOAA
Copy link
Contributor Author

AnningCheng-NOAA commented Feb 9, 2021 via email

@KateFriedman-NOAA
Copy link
Member

Ah I see you have the files there, they are just not in the atmos folder:

-bash-4.2$ ll /scratch1/NCEPDEV/stmp2/Anning.Cheng/ROTDIRS/mcyc/gdas.20200204/00/
total 3286744
drwxr-xr-x 5 Anning.Cheng climate      20480 Feb  9 16:31 atmos
-rw-r--r-- 1 Anning.Cheng stmp        859665 Feb  9 07:24 gdas.t00z.abias
-rw-r--r-- 1 Anning.Cheng stmp       1082939 Feb  9 07:24 gdas.t00z.abias_air
-rw-r--r-- 1 Anning.Cheng stmp        859665 Feb  9 07:24 gdas.t00z.abias_int
-rw-r--r-- 1 Anning.Cheng stmp        917490 Feb  9 07:24 gdas.t00z.abias_pc
-rw-r--r-- 1 Anning.Cheng stmp             0 Feb  9 07:24 gdas.t00z.loginc.txt
-rwxr-x--- 1 Anning.Cheng stmp    3361832960 Feb  9 07:32 gdas.t00z.radstat

Move all of those files (abias, abias_air, abias_int, abias_pc, loginc.txt, radstat) down into that atmos folder. Then retry your failed jobs.

@AnningCheng-NOAA
Copy link
Contributor Author

AnningCheng-NOAA commented Feb 17, 2021 via email

@KateFriedman-NOAA
Copy link
Member

@AnningCheng-NOAA The cause of the error isn't jumping out at me, we usually see that error in the forecast jobs and not the analysis. @CatherineThomas-NOAA @CoryMartin-NOAA would you mind taking a look at Anning's failed analysis job on Orion? See log below. He is testing the system after adding support for MERRA2. Thanks!

/work/noaa/stmp/acheng/ROTDIRS/mcyc/logs/2020020406/gdasanal.log

@CoryMartin-NOAA
Copy link
Contributor

I took a look, not totally sure but it seems like there is a problem reading the netCDF surface forecast files. Is there anything different in the sfcfNNN.nc files in this run than in a standard version? Are you able to rerun the gdasfcst from the previous cycle and try it again? This looks like the error we were having before where the model would write out 'bad' netCDF files that were then unreadable by GSI.

@CatherineThomas-NOAA
Copy link
Contributor

I was just getting ready to say the same thing. The values of tref in the sfcfNNN.nc files look reasonable at least. @KateFriedman-NOAA does Orion have similar netCDF problems as Hera?

@KateFriedman-NOAA
Copy link
Member

does Orion have similar netCDF problems as Hera?

@CatherineThomas-NOAA Not a frequently as Hera but yes. I looked back at my Orion runs since last May and found HDF errors in the efcs jobs of a CCPP run (last November) and in analysis jobs while I was testing port2orion last June. No HDF errors in any of the short cycled runs I've done since then. I'm starting to test the full system using hpc-stack so I'm keeping my eye out for these errors on both machines.

@AnningCheng-NOAA
Copy link
Contributor Author

AnningCheng-NOAA commented Feb 18, 2021 via email

@AnningCheng-NOAA
Copy link
Contributor Author

AnningCheng-NOAA commented Feb 22, 2021 via email

@KateFriedman-NOAA
Copy link
Member

KateFriedman-NOAA commented Feb 22, 2021

My run of the system (feature/hpc-stack) on Hera using components all building with hpc-stack was successful. I did not see any HDF5 errors...but I only ran 2.5 cycles so far. The GSI master doesn't yet support hpc-stack on other machines so I can't perform the same test on Orion yet.

@CatherineThomas-NOAA @CoryMartin-NOAA Is there a GSI branch with stack support for Orion that I can try? Thanks!

@RussTreadon-NOAA
Copy link
Contributor

RussTreadon-NOAA commented Feb 22, 2021 via email

@KateFriedman-NOAA
Copy link
Member

Thanks @RussTreadon-NOAA ! I'll try that branch on Orion and WCOSS-Dell to test global-workflow feature/hpc-stack.

@RussTreadon-NOAA
Copy link
Contributor

RussTreadon-NOAA commented Feb 22, 2021 via email

@RussTreadon-NOAA
Copy link
Contributor

Note that /work/noaa/global/acheng/gfsv16_ccpp/modulefiles/module_base.orion loads

module use /apps/contrib/NCEPLIBS/orion/modulefiles
module load hdf5_parallel/1.10.6
module use /apps/contrib/NCEPLIBS/lib/modulefiles
module load netcdfp/4.7.4

when it executes gdasanal.

In contrast, /work/noaa/global/acheng/gfsv16_ccpp/sorc/gsi.fd builds DA with

module use /apps/contrib/NCEPLIBS/lib/modulefiles
module load netcdfp/4.7.4.release

The NOAA-EMC/GSI master also builds DA on Orion using netcdfp/4.7.4.release.

Might the difference between the workflow build and run modules cause problems?

@RussTreadon-NOAA
Copy link
Contributor

FYI, a stand-alone GSI run script successfully ran the 2020020406 case on Orion using a global_gsi.x built from NOAA-EMC/GSI tag release/gfsda.v16.0.0. The run script loads modulefile.ProdGSI.orion found in gsi.fd/modulefiles.

@AnningCheng-NOAA
Copy link
Contributor Author

AnningCheng-NOAA commented Feb 23, 2021 via email

@RussTreadon-NOAA
Copy link
Contributor

RussTreadon-NOAA commented Feb 23, 2021 via email

@AnningCheng-NOAA
Copy link
Contributor Author

AnningCheng-NOAA commented Feb 23, 2021 via email

@RussTreadon-NOAA
Copy link
Contributor

RussTreadon-NOAA commented Feb 23, 2021 via email

@RussTreadon-NOAA
Copy link
Contributor

The following test has been run on Orion.

  • copy "/work/noaa/global/acheng/para_gfs/mcyco" to "/work/noaa/da/Russ.Treadon/para_gfs/mcyco". Update to run under in my PTMP using acheng HOMEgfs
  • populate "/work/noaa/stmp/rtreadon/ROTDIRS/mcyco" with files from "/work/noaa/stmp//acheng/ROTDIRS/mcyco"
  • rocotorewind and rocotoboot 2021020406 gdasanal. Job requested 125 nodes with lengthy estimated queue wait time so scancel and reduce analysis job to 50 nodes and resubmit

The job successfully ran up to the specified 1 hour wall clock limit. The global_gsi.x was 2/3 of the way through the second outer loop when the system killed the job. No netcdf or hdf5 errors in job log file.

Anning's run used 125 nodes for gdasanal. I reverted back to this, regenerated the xml, and resubmitted the 2021020406 gdasanal. The job is waiting in the queue.

@AnningCheng-NOAA
Copy link
Contributor Author

AnningCheng-NOAA commented Feb 24, 2021 via email

@RussTreadon-NOAA
Copy link
Contributor

My rerun of mcyco gdasanal for 2020020406 using 125 nodes ran overnight without any errors. The previous 50 node job was terminated after hitting to one hour wall clock. Based on minimization stats in the log file it was reproducing the output from the 125 node job. This makes sense. GSI results do not vary with task count. The queue wait time for a 50 node job is less than a 125 node job. You should examine resource settings in your parallel. You might get better throughput if you reduce the node (task) count and appropriately increase the wall clock limit.

Based on your comments, Anning, it seems the gdasanal problem was not DA but something in the workflow or compilation. Is this correct?

@AnningCheng-NOAA
Copy link
Contributor Author

AnningCheng-NOAA commented Feb 25, 2021 via email

@RussTreadon-NOAA
Copy link
Contributor

Thanks for the confirmation. I'll stand down on this issue.

@KateFriedman-NOAA
Copy link
Member

PR #254 has been submitted and has closed this issue. Thank you @AnningCheng-NOAA for this addition and thank you @lgannoaa for testing/reviewing! Will send announcement to glopara listserv shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants