Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[production/AQM.v7] Add ecFlow option to production branch for AQM #821

Merged
merged 27 commits into from
Jun 8, 2023

Conversation

chan-hoo
Copy link
Collaborator

@chan-hoo chan-hoo commented Jun 4, 2023

DESCRIPTION OF CHANGES:

  • Since PR [production/AQM.v7] Code merge for ecflow updates into the production/AQM.v7 #cf184119 #812 does not follow the structure of the current workflow, this PR integrates ecFlow scripts into the production branch separately.
  • To follow the NCO standards, the NCO variables such as COMROOT, DCOMROOT, OPSROOT, etc defined in config_default.yaml are renamed with the suffix _dfv (default value) and these variables are defined in job cards (/include/envir-p1.h).
  • DCOMIN variables can also be defined in envir-p1.h and they will replace the default values defined in ush/machine/wcoss2.
  • KEEPDATA, which is defined in /include/head.h is changed to a TRUE/FALSE flag in envir-p1.h to meet the standard in the current workflow.
  • In job cards, the current script /ush/load_modules_run_task.sh is used to load the necessary modules for each task as the rocoto workflow does in order for us to manage the module list in one place (/modulefiles/tasks/wcoss2).
  • The ecFlow scripts are automatically generated from their templates located in parm/ecflow to the home directory to follow the structure in /include/head.h when running `python3 generate_FV3LAM_wflow.py (Step 7 in my document below).
  • The quick start guide for both rocoto and ecFlow can be found here (Section 1.3, pp15 (rocoto), pp17 (ecflow)): https://drive.google.com/file/d/1zyCvHmuL_LWx0Yto6B1GXF04m-n6Kbs2/view?usp=drive_link

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

  • The results of rocoto and ecFlow are bit-to-bit identical.
  • hera.intel
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite
  • comprehensive tests (specify which if a subset was used)

ISSUE:

Fixes issue mentioned in #797

CONTRIBUTORS:

@lgannoaa

@chan-hoo chan-hoo linked an issue Jun 5, 2023 that may be closed by this pull request
@chan-hoo chan-hoo marked this pull request as ready for review June 6, 2023 15:34
@chan-hoo
Copy link
Collaborator Author

chan-hoo commented Jun 6, 2023

@lgannoaa, I got the following error in the 'aqm_manager' task:

++ exaqm_manager.sh[81]: ls '/lfs/h2/emc/ptmp/chan-hoo.jeon/ecflow_aqm/para/com/aqm/v7.0/aqm_nco_aqmna13km.20230605/00/RESTART/*coupler.res'
ls: cannot access '/lfs/h2/emc/ptmp/chan-hoo.jeon/ecflow_aqm/para/com/aqm/v7.0/aqm_nco_aqmna13km.20230605/00/RESTART/*coupler.res': No such file or directory
+ exaqm_manager.sh[81]: aqm_forecast_RESTART_file=0

Please let me know how to fix this.

@chan-hoo
Copy link
Collaborator Author

chan-hoo commented Jun 6, 2023

@lgannoaa, @JianpingHuang-NOAA, as described above, you can find the quick start guide here (Ch 1.3 , pp 15): https://drive.google.com/file/d/1zyCvHmuL_LWx0Yto6B1GXF04m-n6Kbs2/view?usp=drive_link
If you have any problems or any idea to improve it, please let me know.

@chan-hoo
Copy link
Collaborator Author

chan-hoo commented Jun 6, 2023

I compared the results of 'dyn/phy_006' in the first cycle ('00') by rocoto and ecFlow. They are bit-to-bit identical.

@lgannoaa
Copy link

lgannoaa commented Jun 6, 2023

I recall management's guidance is for @chan-hoo to learn how to test AQM ecflow for merge PR #812
(1) Why created this PR 821 without merge PR 812 first?
(2) The COM directory name used in this PR 821 is not following NCO guidance in COM directory naming.
aqm_nco_aqmna13km.20230605/00
(3) We prefer the developer to run cycled parallel to test the AQM system to ensure quality and reproducibility. This PR only did a one cycle test. According to the developer, only one file "phy_006" in DATA directory was used to compare. It is not sufficient.

Please see the recommendation of the action below:
(1) Work with Jianping and Lin to finish PR #812
(2) Finish merge PR #812 task.
(3) Establish a testing platform using the AQM package merged with PR #812
(4) Test any new enhancement/modification this testing platform before PR approval.

@lgannoaa
Copy link

lgannoaa commented Jun 6, 2023

@lgannoaa, I got the following error in the 'aqm_manager' task:

++ exaqm_manager.sh[81]: ls '/lfs/h2/emc/ptmp/chan-hoo.jeon/ecflow_aqm/para/com/aqm/v7.0/aqm_nco_aqmna13km.20230605/00/RESTART/*coupler.res'
ls: cannot access '/lfs/h2/emc/ptmp/chan-hoo.jeon/ecflow_aqm/para/com/aqm/v7.0/aqm_nco_aqmna13km.20230605/00/RESTART/*coupler.res': No such file or directory
+ exaqm_manager.sh[81]: aqm_forecast_RESTART_file=0

Please let me know how to fix this.

This cycle 20230605/00 is the starting cycle of the run. aqm_manager job should not be run for this cycle. Please enter the following CMD:
ecflow_client --force=complete /emc_aqm/primary/00/aqm/v1.0/aqm_manager/aqm_manager
ecflow_client --force=complete /emc_aqm/primary/00/aqm/v1.0/aqm_manager/data_cleanup
ecflow_client --alter change event release_next_cycle set /emc_aqm/primary/00/aqm/v1.0/aqm_manager/aqm_manager

@chan-hoo
Copy link
Collaborator Author

chan-hoo commented Jun 6, 2023

@lgannoaa, (1) As mentioned above, PR 812 ignores the structure of the current UFS SRW App workflow. It just put the ecFlow scripts into the existing (pre-generated) workflow. Whenever a new workflow is generated for a different experiment, the scripts should be modified manually. (2) The COM directory name is based on the default value of "RUN" in the configuration file (config.yaml). If you change it in the configuration file, it will be changed to what you want. As can be seen in Table 1 (pp.4) of the NCO standards, COMIN=$COMROOT/$NET/$model_ver/$RUN.$PDY/$cyc. In the sample script, RUN="aqm_nco_aqmna13km". (3) I agree. You are the developer of the 'aqm_manager' task script. So I am asking for you to take a look at the error message.

@lgannoaa
Copy link

lgannoaa commented Jun 6, 2023

(1-1) It just put the ecFlow scripts into the existing (pre-generated) workflow.
The answer is: ecflow script is under ecf directory within the AQM package. This is a static location. This kind of solution is the same for all model running in NCO.
(1-2) Whenever a new workflow is generated for a different experiment, the scripts should be modified manually.
The answer is: If AQM ecflow is to be used for development. AQM workflow developer should modify experiment control generator python script to create the ecflow experiment configuration file when generate a new workflow for developer. For production, NCO use static experiment configuration file. This approach is the same with other model implementation.
(2) If COM can be configured, why not configure it to follow the NCO standard during testing review?
(3) You have agreed the (3) Establish a testing platform using the AQM package merged with PR #812
Please proceed with finish PR 812.

@chan-hoo
Copy link
Collaborator Author

chan-hoo commented Jun 7, 2023

@lgannoaa, I've fixed the issue on the aqm_manager script. It is working well now. I am running 1day-4cycle test again (with RUN=aqm).

@chan-hoo
Copy link
Collaborator Author

chan-hoo commented Jun 7, 2023

@lgannoaa, I am sorry if you feel like this PR ignores your PR. It is not at all. Your PR was really helpful for me to understand the structure of ecFlow and how it works. My concern is that your PR is too case-specific and it requires for a user to modify the scripts manually for his/her own run. I think that the other reviewers of UFS SRW App will not approve your approach when I add it to the develop branch. The next versions of AQM will continue to use the UFS SRW App. Therefore, we should follow the structure of the UFS SRW App even for ecFlow. I believe this PR follows the NCO standards as well. Please test this PR on your end and let me know if you find any issues.

@lgannoaa
Copy link

lgannoaa commented Jun 7, 2023

@chan-hoo, PR #812 represents our teamwork. @JianpingHuang-NOAA , Kai Wang and I took time to develop/fix issues in the AQM package. It was tested on both the rocoto workflow from Jianping and ecflow workflow from me. Jianping reviewed and approved the PR 812.

Management requested you to freeze the code change from the AQM package until the ecflow change (PR 812) is merged. You reported that you were not familiar with the ecflow workflow. Therefore, management asked me to teach you ecflow workflow and give you guidance for testing PR 812 in ecflow in order to merge the PR 812.
After you learned how to run ecflow workflow, you did not proceed with the management's assignment. Instead, you created a PR 821 and abandoned the PR 812.

If you want the package to generate an ecflow experiment control file automatically, you may work on adding this function to the Python utility. This utility currently generates only experiment control files for rocoto workflow. Therefore, it is in-line with the design to include ecflow function here. This isolated work can be done after merge ecflow PR 812. The work should not impact ecflow workflow in PR 812. In addition, this kind of approach will not work with NCO implementation because the experiment control files in production can be configured by NCO in static form. It is against NCO production policy to remove or modify it automatically.

Management have indicated in the ecflow merge meeting that any new change or enhancement can be done after the PR 812 have merged. Management also indicated that all change should be reviewed and approved by Jianping before merge for this implementation.

Any change request should be handled by github issue. Please proceed to merge PR 812. We have a few issue requests awaiting for merge PR 812 to be completed to submit. When can you finish merge work?

@lgannoaa
Copy link

lgannoaa commented Jun 7, 2023

@JianpingHuang-NOAA
Our PR 812 followed management assignment to:

  • Merge ecflow change into a code freeze version of the AQM package (production/AQM.v7 #cf1841).
  • Tested with both rocoto and ecflow workflow on cycled run and found identical.

Chan-Hoo introduced a new PR 821 by copy ecflow workflow source code from PR 812 and modified the workflow control mechanism. He did not proceed with merging PR 812.

I spot checked PR 821 using Chan-Hoo testing run location indicated below and found it:
[1]Contain error in job log file, and DATA directory naming assignment.
[2]Contain incorrect Experiment directory design.
[3]Contain incorrect forecast restart assignment.
[4]Contain out-of-scope modification to the code-freezed AQM package.
[5]PR submitted without full cycled test as reported above this PR comment.
My recommendation is following the management's direction to proceed with merging PR 812. You are the AQM implementation manager, would you please help to get management's decision on how we should proceed?

HOMEaqm: /lfs/h2/emc/lam/save/chan-hoo.jeon/prod_ecflow_test/ufs-srweather-appjob
log: /lfs/h2/emc/ptmp/chan-hoo.jeon/ecflow_aqm/para/output
expdir: /lfs/h2/emc/lam/save/chan-hoo.jeon/prod_ecflow_test/expt_dirs/aqm_ecflow_test
data: /lfs/h2/emc/ptmp/chan-hoo.jeon/ecflow_aqm/para/tmp

PR 821 review:
(1-1) Four job log files do not follow NCO production job log naming standard. There are:
get_extrn_ics, get_extrn_lbcs, nexus_gfs_sfc, and run_fcst
All jobs naming in PR 812 followed NCO standard.
(1-2) Four job DATA file directories do not follow NCO production naming standard. There are:
get_extrn_ics, get_extrn_lbcs, nexus_gfs_sfc, and run_fcst
All job DATA directories naming in PR 812 followed NCO standard.
(2) Experiment directory located in developer's personal directory. This is not going to work with NCO implementation.
PR 812 followed the NCO standard located within the AQM package under parm/config. Similar approach has been accepted in GFS implementation.
(3) Experiment directory contains rocoto workflow control files. This is not acceptable for NCO production. NCO production does not run the rocoto workflow.
PR 812 workflow control file does not contain rocoto workflow control files.
(4) ecflow workflow contain incorrect setup resulting unused empty directories structure "ecflow" under the Experiment directory:
---
cd /lfs/h2/emc/lam/save/chan-hoo.jeon/prod_ecflow_test/expt_dirs/aqm_ecflow_test/ecflow
ll
total 24K
drwxr-sr-x 2 chan-hoo.jeon lam 4.0K Jun 7 10:11 output
drwxr-sr-x 2 chan-hoo.jeon lam 4.0K Jun 7 10:11 lsf
drwxr-sr-x 2 chan-hoo.jeon lam 4.0K Jun 7 10:11 data
drwxr-sr-x 2 chan-hoo.jeon lam 4.0K Jun 7 10:11 com
find -type d
.
./lsf
./com
./data
./output
find -type f|wc -l
0
---
PR 812 does not have this error.
(5) var_defns.sh within the Experiment directory contains hardwired path to developer's personal location such as HOMEaqm, COMIN, COMOUT, DATAROOT. It does not follow implementation standards.
PR 812 Used relative directory assignment where the HOMEaqm, COMIN, and COMOUT value is assigned from ecflow workflow.
(6) WARMSTART_CYCLE_DIR variable in Experiment configuration file hardwired WARMSTART_CYCLE_DIR to developer's personal location. This approach is incorrect. First, there should be no hardwired directory naming point to the developer's personal location. Second, WARMSTART_CYCLE_DIR should not be needed. The job should look up the RESTART directory using workflow level COM location assignment. Third, the use of WARMSTART_CYCLE_DIR can be removed.
PR 812 does not need to use WARMSTART_CYCLE_DIR.
(7) Management requested AQM package freeze waiting for PR 812 to merge. However, the post job was modified to access forecast output from COM instead of from forecast DATA location. This modification should not be introduced at the current stage. Also, the fact that post job still reference to DATAFCST indicated the modification is not fully completed/tested.

@chan-hoo
Copy link
Collaborator Author

chan-hoo commented Jun 8, 2023

@lgannoaa,
(0) I'd like to focus on adding ecFlow to production branch in this PR. The work by @JianpingHuang-NOAA and Kai for the bias-correction scripts can be added in another PR. I don't want to discard their work at all either.
(1) I can't find any guidance for log file naming in the NCO standards. Please point me to the page of the NCO standards. (In your PR: aqm_run_fcst_00, in this PR: run_fcst).
(2) I think you don't understand how the workflow (AQMv.7/UFS-SRW-App) works yet. The location of EXPTDIR is defined in the configuration file (config.yaml). In your PR, you put a pre-generated var_defns.sh into 'PARM' directory. This approach is not acceptable because it ignores the workflow generation step in the workflow.
(3) Which files do you mean? I can't see any rocoto control files in my EXPTDIR.
(4) As you may know, this is not a workflow issue but an environment setup issue. Please correct me how I can set these environment variables.
(5) Similarly to (2), this is because I tested this PR in my directory and it is set in the 'workflow generation' step with 'config.yaml'. If you generate a workflow in an operation place, it will be changed.
(6) WARMSTART_CYCLE_DIR is defined in config.yaml, too. Moreover, it is not used for ecFlow in this PR either. It is used only for Rocoto.
(7) If it is a requirement by NCO. this can be modified easily.

@arunchawla-NOAA, @MatthewPyle-NOAA, @aerorahul, I'd like to hear your opinions on this PR.

@lgannoaa
Copy link

lgannoaa commented Jun 8, 2023

(0) @JianpingHuang-NOAA has already approved the PR 812. Both Jianping and Kai have already tested PR 812. Please merge PR 812 first.
(1) The log and DATA naming in PR 812 should not be changed because you can not find guidance.
(2) NCO uses a static workflow control file; that will be var_defns.sh. Using config.yaml to dynamically change var_defns.sh will not be acceptable for NCO implementation. This point has been communicated several times.
(3) The var_defns.sh is the only file that needs to be in the workflow control directory. The extra file used to dynamically generate the rocoto workflow should not be found in the workflow control directory. The forecast control files and name list should not be found in the workflow control directory.
(5)(6)(7) indicates this PR 821 is not ready.

@arunchawla-NOAA
Copy link

If this PR is able to run with ecflow and is not breaking the SRW app structure then please proceed to merge this PR to ensure ecflow capabilty is available and move on to the other tasks related to EE2 and science requirements. I do not want this to be dragged on anymore. Sorry I could not comment earlier.

@chan-hoo
Copy link
Collaborator Author

chan-hoo commented Jun 8, 2023

@arunchawla-NOAA, I am testing the ecflow option again now. Once it is complete (approx. in 2 hours), I'll merge this PR.

@chan-hoo
Copy link
Collaborator Author

chan-hoo commented Jun 8, 2023

@arunchawla-NOAA, @MatthewPyle-NOAA, @aerorahul, @BenjaminBlake-NOAA, @RatkoVasic-NOAA, @lgannoaa, @JianpingHuang-NOAA, I need at least one approval to merge this PR. Please approve this.

@RatkoVasic-NOAA
Copy link
Collaborator

@chan-hoo in nexus/*.ecf, shouldn't wallclock values be also parameters?

Copy link
Contributor

@aerorahul aerorahul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can give you an approval for merging, but it should come from the right person. I am not that.

@chan-hoo
Copy link
Collaborator Author

chan-hoo commented Jun 8, 2023

@RatkoVasic-NOAA, Your're right. It can be. I didn't have enough time to parameterize all the configuration variables for this production branch. I had to make this change in a week. As a quick solution, I just did it for the critical parameters to run the scripts. I'll parameterize all variables necessary for the ecf scripts in the develop branch later.

@chan-hoo
Copy link
Collaborator Author

chan-hoo commented Jun 8, 2023

@aerorahul @RatkoVasic-NOAA, thank you for your approvals!

@chan-hoo
Copy link
Collaborator Author

chan-hoo commented Jun 8, 2023

The results of ecflow are bit-to-bit identical to those of rocoto. Merging now.

@chan-hoo chan-hoo merged commit cac37b9 into ufs-community:production/AQM.v7 Jun 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants