Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate Jet to /lfs5 #2841

Closed
5 of 17 tasks
KateFriedman-NOAA opened this issue Aug 16, 2024 · 35 comments · Fixed by #2878
Closed
5 of 17 tasks

Migrate Jet to /lfs5 #2841

KateFriedman-NOAA opened this issue Aug 16, 2024 · 35 comments · Fixed by #2878
Assignees
Labels
feature New feature or request

Comments

@KateFriedman-NOAA
Copy link
Member

KateFriedman-NOAA commented Aug 16, 2024

What new functionality do you need?

The /lfs4 filesystem has become unusable and users need to migrate to /lfs5. The global-workflow, libraries, and components (both internal and external) will need to be updated to use /lfs5.

What are the requirements for the new functionality?

The following need to be updated to use /lfs5:

  • spack-stack (need install of v1.6.0 on /lfs5) - /contrib/spack-stack/spack-stack-1.6.0/envs/gsi-addon-intel/install/modulefiles/Core
  • global-workflow
  • fix/FIX_DIR - relocated to /lfs5/HFIP/hfv3gfs/glopara/FIX/fix

The following need to be updated to use the migrated spack-stack install before global-workflow can be fully migrated:

  • gdas (?)
  • gfs-utils
  • gsi-enkf
  • gsi-monitor
  • gsi-utils
  • UPP
  • UFS_UTILS
  • ufs-weather-model
  • EMC_verif-global
  • wxflow (?)
  • obsproc - obsproc.v1.2.0-rd-gfsv17 tag cut and installed everywhere (Jet: /lfs5/HFIP/hfv3gfs/glopara/git/obsproc/v1.2.0)
  • prepobs - prepobs.v1.1.0-rd-gfsv17 tag cut and installed everywhere (Jet: /lfs5/HFIP/hfv3gfs/glopara/git/prepobs/v1.1.0)
  • Fit2Obs - v1.1.3 tag cut and installed everywhere (Jet: /lfs5/HFIP/hfv3gfs/glopara/git/Fit2Obs/v1.1.3)
  • tracker

Acceptance Criteria

All components build and run on /lfs5

Suggest a solution (optional)

No response

@KateFriedman-NOAA KateFriedman-NOAA added feature New feature or request triage Issues that are triage labels Aug 16, 2024
@DavidHuber-NOAA
Copy link
Contributor

@InnocentSouopgui-NOAA Would you be able to help with this after spack-stack has been installed on /lfs5?

@DavidHuber-NOAA
Copy link
Contributor

The spack-stack installation is being tracked here: JCSDA/spack-stack#1250

@InnocentSouopgui-NOAA
Copy link
Contributor

Sure, I will do that. I am already tracking the migration of spack-stack on Jet.

@KateFriedman-NOAA
Copy link
Member Author

Thanks @InnocentSouopgui-NOAA ! I can take care of Fit2Obs, obsproc, and prepobs. I can probably help with other components too after those are done. I am already working on unrelated updates to obsproc and prepobs so I'll fold the Jet updates into those efforts.

@KateFriedman-NOAA
Copy link
Member Author

A new spack-stack/1.6.0 install is now available under /contrib on Jet (equivalent to the gsi-addon-dev env we had before): /contrib/spack-stack/spack-stack-1.6.0/envs/gsi-addon-intel/install/modulefiles/Core

@InnocentSouopgui-NOAA
Copy link
Contributor

Thanks @InnocentSouopgui-NOAA ! I can take care of Fit2Obs, obsproc, and prepobs. I can probably help with other components too after those are done. I am already working on unrelated updates to obsproc and prepobs so I'll fold the Jet updates into those efforts.

@KateFriedman-NOAA, where are you with the external dependencies?
I built all the other components (that get build with build_all.sh scripts inside sorc) of Global Workflow, and want to start testing the cycling.

@KateFriedman-NOAA
Copy link
Member Author

@InnocentSouopgui-NOAA Fit2Obs is done and installed on Jet here (note the new v1.1.3 version): /lfs5/HFIP/hfv3gfs/glopara/git/Fit2Obs/v1.1.3

Obsproc is in review (see NOAA-EMC/obsproc#92). We'll be going to v1.2 with this. I will let you know when it is installed on Jet.

I am planning to work on prepobs today and combine the work with our move to the new v1.1.0 version that went into ops. Will also install this on Jet when ready and inform you.

KateFriedman-NOAA added a commit to KateFriedman-NOAA/global-workflow that referenced this issue Aug 26, 2024
- Update to obsproc/v1.2.0 and prepobs/v1.1.0
- Revert back to glopara installs on Orion/Hercules
- Remove default version for obsproc in config.base

Refs NOAA-EMC#2291
Refs NOAA-EMC#2840
Refs NOAA-EMC#2841
@KateFriedman-NOAA
Copy link
Member Author

@InnocentSouopgui-NOAA Updated obsproc/v1.2 is now installed on Jet: /lfs5/HFIP/hfv3gfs/glopara/git/obsproc/v1.2.0

I will have a PR shortly that will update develop to v1.2 everywhere so you can either ingest that from develop into a branch you're using or update obsproc_run_ver=1.2.0 in run.spackver` in advance.

@KateFriedman-NOAA
Copy link
Member Author

From Jet admins:

Jet /lfs4 migration to /lfs5 
Due to ongoing issues with /lfs4 we request that all users migrate their active data from /lfs4 to /lfs5 
before next Wednesday 9/4. All projects that have quota on /lfs4 have been given quota on /lfs5.  
Weather permitting, on Wednesday 9/4 we plan to have an /lfs4 test outage from 1000  to ~1600 MT
to verify all /lfs4 dependences have been removed. If this test outage is successful we plan to make 
/lfs4 read only for 2 more weeks, then unmounting it ~9/17. 

@KateFriedman-NOAA
Copy link
Member Author

KateFriedman-NOAA commented Aug 28, 2024

@InnocentSouopgui-NOAA Fit2Obs, obsproc, and prepobs are now ready and installed on Jet. See the checklist in the main issue comment for paths. You'll need to update fit2obs_ver=1.1.3, obsproc_run_ver=1.2.0, and prepobs_run_ver=1.1.0 in versions/run.spack.ver to use them. Also update BASE_GIT in workflow/hosts/jet.yaml to be /lfs5/HFIP/hfv3gfs/glopara/git.

@InnocentSouopgui-NOAA
Copy link
Contributor

@InnocentSouopgui-NOAA Fit2Obs, obsproc, and prepobs are now ready and installed on Jet. See the checklist in the main issue comment for paths. You'll need to update fit2obs_ver=1.1.3, obsproc_run_ver=1.2.0, and prepobs_run_ver=1.1.0 in versions/run.spack.ver to use them. Also update BASE_GIT in workflow/hosts/jet.yaml to be /lfs5/HFIP/hfv3gfs/glopara/git.

Thanks Kate,
With this I will start testing the whole Global Workflow system.

@InnocentSouopgui-NOAA
Copy link
Contributor

@KateFriedman-NOAA, There is an environmental variable if GSI module file that references a space on /lfs4, see below. Are you in charge of that one as well?

pushenv("GSI_BINARY_SOURCE_DIR", "/mnt/lfs4/HFIP/hfv3gfs/glopara/git/fv3gfs/fix/gsi/20240208")

@KateFriedman-NOAA
Copy link
Member Author

@KateFriedman-NOAA, There is an environmental variable if GSI module file that references a space on /lfs4, see below. Are you in charge of that one as well?

pushenv("GSI_BINARY_SOURCE_DIR", "/mnt/lfs4/HFIP/hfv3gfs/glopara/git/fv3gfs/fix/gsi/20240208")

If you're talking about the fix/gsi/20240208 folder then yes. That should now be /lfs5/HFIP/hfv3gfs/glopara/FIX/fix/gsi/20240208. Ideally though that shouldn't be a hardcoded path. @RussTreadon-NOAA is there a way to make this not hardcoded?

@RussTreadon-NOAA
Copy link
Contributor

@InnocentSouopgui-NOAA , please follow the change control procedure described under GSI: How to Make Changes to update the GSI_BINARY_SOURCE_DIR path in modulefiles/gsi_jet.intel.lua. (FYI, I am not the GSI code manager. We do not have a GSI code manager.)

@KateFriedman-NOAA , GSI_BINARY_SOURCE_DIR was added when EIB split the gerrit GSI fix into ASCII & Binary files. ASCII files are managed via a GSI-fix submodule hash included as GSI fix. Binary fix files are managed in EIB space. Fortunately GSI binary fix files do not change frequently. I don't have a good suggestion as to how to remove a fixed path for GSI_BINARY_SOURCE_DIR.

A softening in the hardcoded path approach would be to have GSI_BINARY_SOURCE_DIR point at an EIB maintained link. The link points at the the most recent set of GSI binary fix files. For example, GSI modulefiles/gsi_hera.intel.lua could be updated to read

pushenv("GSI_BINARY_SOURCE_DIR", "/scratch1/NCEPDEV/global/glopara/fix/gsi/latest")

where latest is a soft link currently pointing at /scratch1/NCEPDEV/global/glopara/fix/gsi/20240208.

This way EIB can change the directory at which soft link latest points at without any changes to GSI_BINARY_SOURCE_DIR.

The disadvantage of this approach is that is not readily apparent to GSI developers which snapshot of the GSI binary fix files they are using. Also, we still have a hardcoded path for GSI_BINARY_SOURCE_DIR ... it's just that we don't need to change the date string when GSI binary fix files are updated.

@KateFriedman-NOAA
Copy link
Member Author

@RussTreadon-NOAA Thanks for the refresher on how GSI_BINARY_SOURCE_DIR came to be what it is. Sounds like updating the hardcoded path to the new hardcoded path is the easiest thing right now. I'm not a fan of the latest symlink for the disadvantage you outlined.

@InnocentSouopgui-NOAA
Copy link
Contributor

@KateFriedman-NOAA, a couple of things that require your attention

/parm/config/gfs/config.aero point to /lfs4/HFIP/hfv3gfs/glopara/data which has not yet moved.

  • AERO_INPUTS_DIR="/lfs4/HFIP/hfv3gfs/glopara/data/gocart_emissions"

You must be busy, but whenever, you can, please provide TC_tracker too. I need this for a full test of global workflow.

@InnocentSouopgui-NOAA
Copy link
Contributor

@KateFriedman-NOAA, just a though here. will it be better to have the external dependencies of Global Workflow on /contrib so that they are storage independent?
That just crossed my mind.

@InnocentSouopgui-NOAA
Copy link
Contributor

@KateFriedman-NOAA other data that need to move.
in summary, I thin the whole /lfs4/HFIP/hfv3gfs/glopara should be copied.

All the following set in workflow/hosts/jet.yaml reference a subdirectory of /lfs4/HFIP/hfv3gfs/glopara

  • DMPDIR: '/lfs4/HFIP/hfv3gfs/glopara/dump'
  • BASE_IC: '/mnt/lfs4/HFIP/hfv3gfs/glopara/data/ICSDIR'
  • PACKAGEROOT: '/lfs4/HFIP/hfv3gfs/glopara/nwpara'
  • COMINsyn: '/lfs4/HFIP/hfv3gfs/glopara/com/gfs/prod/syndat'
  • COMINecmwf: /mnt/lfs4/HFIP/hfv3gfs/glopara/data/external_gempak/ecmwf
  • COMINnam: /mnt/lfs4/HFIP/hfv3gfs/glopara/data/external_gempak/nam
  • COMINukmet: /mnt/lfs4/HFIP/hfv3gfs/glopara/data/external_gempak/ukmet

@KateFriedman-NOAA
Copy link
Member Author

@InnocentSouopgui-NOAA Here is the status of the various glopara folders moving from /lfs4 to /lfs5:

  • DMPDIR - it's on the move to /lfs5/HFIP/hfv3gfs/glopara/dump, it's quite large so it will take a few days if not more
  • BASE_IC - also on the move to /lfs5/HFIP/hfv3gfs/glopara/data/ICSDIR
  • COM (/lfs4/HFIP/hfv3gfs/glopara/com) - now in place /lfs5/HFIP/hfv3gfs/glopara/com

You don't need to worry about the other COMINs for gempak, we don't run/support gempak outside of WCOSS2. Same goes for PACKAGEROOT/nwpara folder, you can ignore that.

will it be better to have the external dependencies of Global Workflow on /contrib so that they are storage independent?

We don't have access to install on /contrib. We're also considering making the external packages into submodules of g-w develop so that would moot the need to install them.

please provide TC_tracker too

Will do, stay tuned...

@InnocentSouopgui-NOAA
Copy link
Contributor

@KateFriedman-NOAA , can you add /lfs4/HFIP/hfv3gfs/glopara/git/fv3gfs to the list of data to move?

@KateFriedman-NOAA
Copy link
Member Author

KateFriedman-NOAA commented Aug 30, 2024

@KateFriedman-NOAA , can you add /lfs4/HFIP/hfv3gfs/glopara/git/fv3gfs to the list of data to move?

@InnocentSouopgui-NOAA What do you need from that location? The fix were under there, I moved them to here on /lfs5: /lfs5/HFIP/hfv3gfs/glopara/FIX/fix

Last night I copied /lfs4/HFIP/hfv3gfs/glopara/git to here /lfs5/HFIP/hfv3gfs/glopara/git_lfs4. Just for safe keeping while we sort out the new space, I plan to remove this folder when done.

@InnocentSouopgui-NOAA
Copy link
Contributor

Thank you @KateFriedman-NOAA . I missed the fact that you already relocated fix to /lfs5/HFIP/hfv3gfs/glopara/FIX/fix
It is just what I needed.
Thanks again.

@InnocentSouopgui-NOAA
Copy link
Contributor

InnocentSouopgui-NOAA commented Sep 5, 2024

I am having a problem with cleanup jobs after a few cycles. after 24 hours (4 cycles of ENKF), all cleanup jobs start failing with the following message:

+ exglobal_cleanup.sh[46]: find_exclude_string+=' -name *prepbufr* -or -name *prepbufr* -or -name *cnvstat* -or -name *atmanl.nc -or'
+ exglobal_cleanup.sh[49]: find_exclude_string=' -name *prepbufr* -or -name *prepbufr* -or -name *cnvstat* -or -name *prepbufr* -or -name *prepbufr* -or -name *cnvstat* -or -name *atmanl.nc '
+ exglobal_cleanup.sh[52]: find /lfs5/NESDIS/nesdis-rdo2/Innocent.Souopgui/com/test46/gdas.20211222/18 -type f -not '(' -name '*prepbufr*' -or -name '*prepbufr*' -or -name '*cnvstat*' -or -name '*prepbufr*' -or -name '*prepbufr*' -or -name '*cnvstat*' -or -name '*atmanl.nc' ')' -delete
find: Failed to save initial working directory: No such file or directory
+ exglobal_cleanup.sh[1]: postamble exglobal_cleanup.sh 1725475572 1
+ preamble.sh[70]: set +x
End exglobal_cleanup.sh at 18:46:15 with error code 1 (time elapsed: 00:00:03)
+ JGLOBAL_CLEANUP[1]: postamble JGLOBAL_CLEANUP 1725475561 1
+ preamble.sh[70]: set +x
End JGLOBAL_CLEANUP at 18:46:15 with error code 1 (time elapsed: 00:00:14)
+ cleanup.sh[1]: postamble cleanup.sh 1725475554 1
+ preamble.sh[70]: set +x
End cleanup.sh at 18:46:15 with error code 1 (time elapsed: 00:00:21)

@DavidHuber-NOAA
Copy link
Contributor

I'm working on a fix for this. PR coming shortly.

@InnocentSouopgui-NOAA
Copy link
Contributor

I'm working on a fix for this. PR coming shortly.

So can we ignore the problem for now, and move on with other testing in the migration?

@DavidHuber-NOAA
Copy link
Contributor

Yes, I think so.

@DavidHuber-NOAA
Copy link
Contributor

PR open: #2893

@InnocentSouopgui-NOAA
Copy link
Contributor

What should we do of verif-global? It still depends on hpc-stack.

@malloryprow
Also it looks for data from spaces. Can you move those data to /lfs5 ? especially:

  • /lfs4/HFIP/hfv3gfs/Mallory.Row/archive
  • /lfs4/HFIP/hfv3gfs/Mallory.Row/prepbufr
  • /lfs4/HFIP/hfv3gfs/Mallory.Row/obdata/ccpa_accum24hr

I am opening an issue on verif_gloal.

@DavidHuber-NOAA
Copy link
Contributor

The statistics generated by verif-global during the execution of the global-workflow should run without loading hpc-stack modules. If that's not the case for Jet, then something is wrong.

However, the standalone mode still references those modules. The plan is to update verif-global after the installation of spack-stack v1.8.0. Until then, standalone mode requires some sort of manual intervention. This is true on almost all platforms (except S4, IIRC).

@InnocentSouopgui-NOAA
Copy link
Contributor

@DavidHuber-NOAA
it was very suspicious to me ask well, the whole global workflow ran without problem for more than 24 hours, at resolution C96/48 and C192/96.
At resolution C384/192, it ran the first 00Z cycle, and failed on the second 00Z cycle. The failing task is gfsmetpg2o1.
That is what prompted me to look around and found the warnings for missing files.

I can't figured out while the task gfsmetpg2o1 failed in the fist place. When you have a minute, you can check it out at
/lfs5/NESDIS/nesdis-rdo2/Innocent.Souopgui/expe/test46
/lfs5/NESDIS/nesdis-rdo2/Innocent.Souopgui/com/test46

@malloryprow
Copy link
Contributor

What should we do of verif-global? It still depends on hpc-stack.

@malloryprow Also it looks for data from spaces. Can you move those data to /lfs5 ? especially:

  • /lfs4/HFIP/hfv3gfs/Mallory.Row/archive
  • /lfs4/HFIP/hfv3gfs/Mallory.Row/prepbufr
  • /lfs4/HFIP/hfv3gfs/Mallory.Row/obdata/ccpa_accum24hr

I am opening an issue on verif_gloal.

I copied over the data.

@InnocentSouopgui-NOAA
Copy link
Contributor

@KateFriedman-NOAA , don't forget about TC_Tracker, we don't have it yet.
I am using a personal version for all the tests.

@KateFriedman-NOAA
Copy link
Member Author

@InnocentSouopgui-NOAA Please see the email thread with the tracker folks. I installed a copy of @HananehJafary-NOAA 's branch here on Jet for testing: /lfs5/HFIP/hfv3gfs/glopara/git/TC_tracker/test_tracker

KateFriedman-NOAA added a commit to KateFriedman-NOAA/global-workflow that referenced this issue Sep 17, 2024
@InnocentSouopgui-NOAA
Copy link
Contributor

@InnocentSouopgui-NOAA Please see the email thread with the tracker folks. I installed a copy of @HananehJafary-NOAA 's branch here on Jet for testing: /lfs5/HFIP/hfv3gfs/glopara/git/TC_tracker/test_tracker

It ran successfully at resolution C96 and C384.

@KateFriedman-NOAA
Copy link
Member Author

It ran successfully at resolution C96 and C384.

@InnocentSouopgui-NOAA I just spoke with @HananehJafary-NOAA . We are going to stick with the test copy of TC_tracker on Jet for now. She is working on CMake-ing TC_tracker and finishing the spack-stack updates for Orion. Once we get the updated version from her that is CMake'd and supports all of the platforms via spack-stack I'll install it everywhere and move g-w to use it.

So for now for your work, I have renamed the "test_tracker" install to "v1.1.15.7" (/lfs5/HFIP/hfv3gfs/glopara/git/TC_tracker/v1.1.15.7). Please update versions/run.jet.ver to add:

export ens_tracker_ver=v1.1.15.7

Please retest with that change and let me know if it still works on Jet. The other platforms will continue using export ens_tracker_ver=feature-GFSv17_com_reorg as set in versions/spack.ver for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants