Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/develop' into fix/c768_hera
Browse files Browse the repository at this point in the history
* origin/develop:
  Stage atmospheric backgrounds and UFS cubed-sphere history files (NOAA-EMC#2792)
  Check that a PR driver is still running before trying to kill it (NOAA-EMC#2799)
  Feature/get arch adds an empty archive job to GEFS system (NOAA-EMC#2772)
  Marine DA updates (NOAA-EMC#2802)
  Revert MSU FIX_DIRs back to glopara (NOAA-EMC#2811)
  Bugfix for updating label states in Jenkins (NOAA-EMC#2808)
  Clean-up temporary rundirs - take 2. (NOAA-EMC#2753)
  Change land surface for HR4 (NOAA-EMC#2787)
  Run METplus serially and correct the name of prod tasks (NOAA-EMC#2804)
  Update Java Agent launching script for Jenkins connections (NOAA-EMC#2762)
  Fix erroneous cdump addition (NOAA-EMC#2803)
  Update ocean post-processing triggers (NOAA-EMC#2784)
  Update the gfs_utils repository hash (NOAA-EMC#2801)
  Add fixes for metplus jobs when gfs_cyc=2 or 4 (NOAA-EMC#2791)
  Simplify resource-related variables, remove CDUMP where unneeded (NOAA-EMC#2727)
  Remove f000 from atmos rocoto tasks for replay cases (NOAA-EMC#2778)
  • Loading branch information
DavidHuber-NOAA committed Aug 9, 2024
2 parents fbd0660 + e2c0f06 commit 17efba2
Show file tree
Hide file tree
Showing 170 changed files with 2,438 additions and 2,862 deletions.
11 changes: 11 additions & 0 deletions .github/pull_request_template.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,17 @@
# Change characteristics
- Is this a breaking change (a change in existing functionality)? YES/NO
- Does this change require a documentation update? YES/NO
- Does this change require an update to any of the following submodules? YES/NO (If YES, please add a link to any PRs that are pending.)
- [ ] EMC verif-global
- [ ] GDAS
- [ ] GFS-utils
- [ ] GSI
- [ ] GSI-monitor
- [ ] GSI-utils
- [ ] UFS-utils
- [ ] UFS-weather-model
- [ ] wxflow


# How has this been tested?
<!-- Please list any test you conducted, including the machine.
Expand Down
4 changes: 2 additions & 2 deletions ci/Jenkinsfile
Original file line number Diff line number Diff line change
Expand Up @@ -282,10 +282,10 @@ pipeline {
steps {
script {
sh(script: """
labels=\$(gh pr view ${env.CHANGE_ID} --repo ${repo_url} --json labels --jq '.labels[].name')
labels=\$(${GH} pr view ${env.CHANGE_ID} --repo ${repo_url} --json labels --jq '.labels[].name')
for label in \$labels; do
if [[ "\$label" == *"${Machine}"* ]]; then
gh pr edit ${env.CHANGE_ID} --repo ${repo_url} --remove-label "\$label"
${GH} pr edit ${env.CHANGE_ID} --repo ${repo_url} --remove-label "\$label"
fi
done
""", returnStatus: true)
Expand Down
2 changes: 1 addition & 1 deletion ci/cases/gfsv17/ocnanal.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ base:
ACCOUNT: {{ 'HPC_ACCOUNT' | getenv }}

ocnanal:
SOCA_INPUT_FIX_DIR: {{ FIXgfs }}/gdas/soca/1440x1080x75/soca
SOCA_INPUT_FIX_DIR: {{ HOMEgfs }}/fix/gdas/soca/1440x1080x75/soca
SOCA_OBS_LIST: {{ HOMEgfs }}/sorc/gdas.cd/parm/soca/obs/obs_list.yaml
SOCA_NINNER: 100

Expand Down
8 changes: 4 additions & 4 deletions ci/scripts/check_ci.sh
Original file line number Diff line number Diff line change
Expand Up @@ -50,14 +50,14 @@ fi
export GH

rocotostat=$(command -v rocotostat)
if [[ -z ${rocotostat+x} ]]; then
if [[ -z ${rocotostat} ]]; then
echo "rocotostat not found on system"
exit 1
else
echo "rocotostat being used from ${rocotostat}"
fi
rocotocheck=$(command -v rocotocheck)
if [[ -z ${rocotocheck+x} ]]; then
if [[ -z ${rocotocheck} ]]; then
echo "rocotocheck not found on system"
exit 1
else
Expand All @@ -70,7 +70,7 @@ pr_list=""
if [[ -f "${pr_list_dbfile}" ]]; then
pr_list=$("${HOMEgfs}/ci/scripts/utils/pr_list_database.py" --dbfile "${pr_list_dbfile}" --list Open Running) || true
fi
if [[ -z "${pr_list+x}" ]]; then
if [[ -z "${pr_list}" ]]; then
echo "no PRs open and ready to run cases on .. exiting"
exit 0
fi
Expand Down Expand Up @@ -124,7 +124,7 @@ for pr in ${pr_list}; do

for pslot_dir in "${pr_dir}/RUNTESTS/EXPDIR/"*; do
pslot=$(basename "${pslot_dir}") || true
if [[ -z "${pslot+x}" ]]; then
if [[ -z "${pslot}" ]]; then
echo "No experiments found in ${pslot_dir} .. exiting"
exit 0
fi
Expand Down
11 changes: 7 additions & 4 deletions ci/scripts/driver.sh
Original file line number Diff line number Diff line change
Expand Up @@ -77,8 +77,9 @@ pr_list=$(${GH} pr list --repo "${REPO_URL}" --label "CI-${MACHINE_ID^}-Ready" -

for pr in ${pr_list}; do
pr_dir="${GFS_CI_ROOT}/PR/${pr}"
[[ ! -d ${pr_dir} ]] && mkdir -p "${pr_dir}"
db_list=$("${ROOT_DIR}/ci/scripts/utils/pr_list_database.py" --add_pr "${pr}" --dbfile "${pr_list_dbfile}")
output_ci_single="${GFS_CI_ROOT}/PR/${pr}/output_single.log"
output_ci_single="${pr_dir}/output_single.log"
#############################################################
# Check if a Ready labeled PR has changed back from once set
# and in that case completely kill the previose driver.sh cron
Expand Down Expand Up @@ -107,7 +108,9 @@ for pr in ${pr_list}; do
echo -e "${pstree_out}" | grep -Pow "(?<=\()[0-9]+(?=\))" | xargs kill
fi
else
ssh "${driver_HOST}" 'pstree -A -p "${driver_PID}" | grep -Eow "[0-9]+" | xargs kill'
# Check if the driver is still running on the head node; if so, kill it and all child processes
#shellcheck disable=SC2029
ssh "${driver_HOST}" "pstree -A -p \"${driver_PID}\" | grep -Eow \"[0-9]+\" | xargs kill || echo \"Failed to kill process with PID: ${driver_PID}, it may not be valid.\""
fi
{
echo "Driver PID: Requested termination of ${driver_PID} and children on ${driver_HOST}"
Expand Down Expand Up @@ -141,7 +144,7 @@ pr_list=""
if [[ -f "${pr_list_dbfile}" ]]; then
pr_list=$("${ROOT_DIR}/ci/scripts/utils/pr_list_database.py" --dbfile "${pr_list_dbfile}" --list Open Ready) || true
fi
if [[ -z "${pr_list+x}" ]]; then
if [[ -z "${pr_list}" ]]; then
echo "no PRs open and ready for checkout/build .. exiting"
exit 0
fi
Expand All @@ -155,7 +158,7 @@ fi
for pr in ${pr_list}; do
# Skip pr's that are currently Building for when overlapping driver scripts are being called from within cron
pr_building=$("${ROOT_DIR}/ci/scripts/utils/pr_list_database.py" --display "${pr}" --dbfile "${pr_list_dbfile}" | grep Building) || true
if [[ -z "${pr_building+x}" ]]; then
if [[ -n "${pr_building}" ]]; then
continue
fi
id=$("${GH}" pr view "${pr}" --repo "${REPO_URL}" --json id --jq '.id')
Expand Down
19 changes: 9 additions & 10 deletions ci/scripts/tests/test_rocotostat.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,17 +25,16 @@
database_destination = os.path.join(testdata_full_path, 'database.db')
wget.download(database_url, database_destination)

try:
rocotostat = which('rocotostat')
except CommandNotFoundError:
rocotostat_cmd = which('rocotostat')
if not rocotostat_cmd:
raise CommandNotFoundError("rocotostat not found in PATH")

rocotostat.add_default_arg(['-w', os.path.join(testdata_path, 'workflow.xml'), '-d', os.path.join(testdata_path, 'database.db')])
rocotostat_cmd.add_default_arg(['-w', os.path.join(testdata_path, 'workflow.xml'), '-d', os.path.join(testdata_path, 'database.db')])


def test_rocoto_statcount():

result = rocoto_statcount(rocotostat)
result = rocoto_statcount(rocotostat_cmd)

assert result['SUCCEEDED'] == 20
assert result['FAIL'] == 0
Expand All @@ -47,15 +46,15 @@ def test_rocoto_statcount():

def test_rocoto_summary():

result = rocotostat_summary(rocotostat)
result = rocotostat_summary(rocotostat_cmd)

assert result['CYCLES_TOTAL'] == 1
assert result['CYCLES_DONE'] == 1


def test_rocoto_done():

result = rocotostat_summary(rocotostat)
result = rocotostat_summary(rocotostat_cmd)

assert is_done(result)

Expand All @@ -79,10 +78,10 @@ def test_rocoto_stalled():
database_destination = os.path.join(testdata_full_path, 'stalled.db')
wget.download(database_url, database_destination)

rocotostat = which('rocotostat')
rocotostat.add_default_arg(['-w', xml, '-d', db])
rocotostat_cmd = which('rocotostat')
rocotostat_cmd.add_default_arg(['-w', xml, '-d', db])

result = rocoto_statcount(rocotostat)
result = rocoto_statcount(rocotostat_cmd)

assert result['SUCCEEDED'] == 11
assert is_stalled(result)
Expand Down
143 changes: 125 additions & 18 deletions ci/scripts/utils/launch_java_agent.sh
Original file line number Diff line number Diff line change
@@ -1,8 +1,70 @@
#!/bin/env bash

set -e

# ==============================================================================
# Script Name: launch_java_agent.sh
#
# Description:
# This script automates the process of launching a Jenkins agent
# on a specified machine. It ensures that the necessary
# prerequisites are met, such as the availability of JAVA_HOME,
# the Jenkins agent launch directory, and proper authentication
# with GitHub.
#
# It then proceeds to check if the Jenkins node is online and
# decides whether to launch the Jenkins agent based on the node's
# status. The agent is launched in the background,
# and its PID is logged for reference.
#
# Prerequisites:
# JAVA_HOME must be set to a valid JDK installation.
# Jenkins agent launch directory must exist and be specified.
# GitHub CLI (gh) must be installed and authenticated for messeging
# from the Jenkins controller to GitHub PR via shell commands.
# Jenkins agent launch directory must exist and be specified.
# TODO: Must use GitHub CLI v2.25.1 (newer versoins have issues)
# https://github.com/cli/cli/releases/download/v2.25.1/gh_2.25.1_linux_amd64.tar.gz
# Jenkins controller URL and authentication token must be provided.
# jenkins-secret-file:
# Must be present in the Jenkins agent launch directory.
# This file contains the secret key for the Jenkins agent
# established by the Jenkins administrator for each Node.
# jenkins_token:
# Must be present in the Jenkins agent launch directory.
# This file contains the user authentication token for the Jenkins controller
# to use the Remote API. This token can be generated by the user
# on the Jenkins controller.
# controller_user:
# Must be set to the Jenkins controller username corresponing to the jenkins_token.
#
# Usage: ./launch_java_agent.sh [now] [-f]
# The optional 'now' argument forces the script to launch the Jenkins
# agent without waiting before trying again.
# The optional 'force' argument forces the script to launch the Jenkins regarless of the node status.
#
# ==============================================================================

force_launch="False"
skip_wait="False"
while getopts ":fnh" flag; do
case "${flag}" in
f) force_launch="True";;
n) skip_wait="True";;
h) echo "Usage: ./launch_java_agent.sh [now] [force]
Two mutually exclusive optional arguments:
-n (now) causes the script to launch the Jenkins agent without waiting before trying again.
-f (force) forces the script to launch the Jenkins regarless of its connection status."
exit 0 ;;
*) echo "Unknown flag: ${flag}"
exit 1;;
esac
done

controller_url="https://jenkins.epic.oarcloud.noaa.gov"
controller_user="terry.mcguinness"
controller_user=${controller_user:-"terry.mcguinness"}
controller_user_auth_token="jenkins_token"

HOMEgfs="$(cd "$(dirname "${BASH_SOURCE[0]}")/../../.." >/dev/null 2>&1 && pwd )"
host=$(hostname)

Expand All @@ -13,12 +75,10 @@ host=$(hostname)
source "${HOMEgfs}/ush/detect_machine.sh"
case ${MACHINE_ID} in
hera | orion | hercules | wcoss2)
echo "Launch Jenkins Java Controler on ${MACHINE_ID}"
;;
echo "Launch Jenkins Java Controler on ${MACHINE_ID}";;
*)
echo "Unsupported platform. Exiting with error."
exit 1
;;
exit 1;;
esac

LOG=lanuched_agent-$(date +%Y%m%d%M).log
Expand All @@ -43,9 +103,16 @@ echo "JAVA VERSION: "
${JAVA} -version

export GH="${HOME}/bin/gh"
command -v "${GH}"
[[ -f "${GH}" ]] || echo "gh is not installed in ${HOME}/bin"
${GH} --version

check_mark=$(gh auth status -t 2>&1 | grep "Token:" | awk '{print $1}') || true
if [[ "${check_mark}" != "" ]]; then
echo "gh not authenticating with emcbot token"
exit 1
fi
echo "gh authenticating with emcbot TOKEN ok"

if [[ -d "${JENKINS_AGENT_LANUCH_DIR}" ]]; then
echo "Jenkins Agent Lanuch Directory: ${JENKINS_AGENT_LANUCH_DIR}"
else
Expand All @@ -56,22 +123,62 @@ cd "${JENKINS_AGENT_LANUCH_DIR}"

if ! [[ -f agent.jar ]]; then
curl -sO "${controller_url}/jnlpJars/agent.jar"
echo "Updated agent.jar downloaded"
fi

if [[ ! -f "${controller_user_auth_token}" ]]; then
echo "User Jenkins authetication TOKEN to the controller for using the Remote API does not exist"
exit 1
fi
JENKINS_TOKEN=$(cat "${controller_user_auth_token}")

cat << EOF > parse.py
#!/usr/bin/env python3
import json,sys
with open(sys.argv[1], 'r') as file:
data = json.load(file)
print(data.get('offline','True'))
EOF
chmod u+x parse.py

JENKINS_TOKEN=$(cat jenkins_token)
check_node_online() {
rm -f curl_response
curl_response=$(curl --silent -u "${controller_user}:${JENKINS_TOKEN}" "${controller_url}/computer/${MACHINE_ID^}-EMC/api/json?pretty=true") || true
if [[ "${curl_response}" == "" ]]; then
echo "ERROR: Jenkins controller not reachable. Exiting with error."
exit 1
fi
echo -n "${curl_response}" > curl_response
./parse.py curl_response
}

lauch_agent () {
echo "Launching Jenkins Agent on ${host}"
command="nohup ${JAVA} -jar agent.jar -jnlpUrl ${controller_url}/computer/${MACHINE_ID^}-EMC/jenkins-agent.jnlp -secret @jenkins-secret-file -workDir ${JENKINS_WORK_DIR}"
echo -e "Launching Jenkins Agent on ${host} with the command:\n${command}" >& "${LOG}"
${command} >> "${LOG}" 2>&1 &
nohup_PID=$!
echo "Java agent running on PID: ${nohup_PID}" >> "${LOG}" 2>&1
}

if [[ "${force_launch}" == "True" ]]; then
lauch_agent
exit
fi

#
offline=$(curl --silent -u "${controller_user}:${JENKINS_TOKEN}" "${controller_url}/computer/${MACHINE_ID^}-EMC/api/json?pretty=true" | grep '\"offline\"' | awk '{gsub(/,/,"");print $3}') || true
echo "Jenkins Agent offline setting: ${offline}"
offline=$(set -e; check_node_online)

if [[ "${offline}" == "true" ]]; then
echo "Jenkins Agent is offline. Lanuching Jenkins Agent on ${host}"
command="nohup ${JAVA} -jar agent.jar -jnlpUrl ${controller_url}/computer/${MACHINE_ID^}-EMC/jenkins-agent.jnlp -secret @jenkins-secret-file -workDir ${JENKINS_WORK_DIR}"
echo -e "Lanuching Jenkins Agent on ${host} with the command:\n${command}" >& "${LOG}"
${command} >> "${LOG}" 2>&1 &
nohup_PID=$!
echo "Java agent running on PID: ${nohup_PID}" >> "${LOG}" 2>&1
echo "Java agent running on PID: ${nohup_PID}"
if [[ "${offline}" != "False" ]]; then
if [[ "${skip_wait}" != "True" ]]; then
echo "Jenkins Agent is offline. Waiting 5 more minutes to check again in the event it is a temp network issue"
sleep 300
offline=$(set -e; check_node_online)
fi
if [[ "${offline}" != "False" ]]; then
lauch_agent
else
echo "Jenkins Agent is online (nothing done)"
fi
else
echo "Jenkins Agent is online (nothing done)"
fi
4 changes: 2 additions & 2 deletions docs/doxygen/mainpage.h
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ To setup an experiment, a python script <b>\c setup_expt.py</b> (located in <b>\
usage: setup_expt.py [-h] --pslot PSLOT
[--configdir CONFIGDIR] [--idate IDATE] [--icsdir ICSDIR]
[--resdetatmos RESDET] [--resensatmos RESENS] [--comroot COMROOT]
[--expdir EXPDIR] [--nens NENS] [--cdump CDUMP]
[--expdir EXPDIR] [--nens NENS] [--run RUN]
Setup files and directories to start a GFS parallel. Create EXPDIR, copy
config files Create ROTDIR experiment directory structure, link initial
Expand Down Expand Up @@ -52,7 +52,7 @@ To setup an experiment, a python script <b>\c setup_expt.py</b> (located in <b>\
(default: None)
--nens number of ensemble members
(default: 80)
--cdump CDUMP to start the experiment
--run RUN to start the experiment
(default: gdas)
The above script creates directories <b>\c EXPDIR</b> and <b>\c ROTDIR</b>. It will make links for initial conditions from a location provided via the <b>\c --icsdir</b> argument for a chosen resolution for the control <b>\c --resdetatmos</b> and the ensemble <b>\c --resensatmos</b>. Experiment name is controlled by the input argument <b>\c --pslot</b>. The script will ask user input in case any of the directories already exist. It will copy experiment configuration files into the <b>\c EXPDIR</b> from <b>\c CONFIGDIR</b>.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/init.rst
Original file line number Diff line number Diff line change
Expand Up @@ -384,7 +384,7 @@ The warm starts and other output from production are at C768 deterministic and C
What files should you pull for starting a new experiment with warm starts from production?
------------------------------------------------------------------------------------------

That depends on what mode you want to run -- forecast-only or cycled. Whichever mode, navigate to the top of your ``ROTDIR`` and pull the entirety of the tarball(s) listed below for your mode. The files within the tarball are already in the ``$CDUMP.$PDY/$CYC/$ATMOS`` folder format expected by the system.
That depends on what mode you want to run -- forecast-only or cycled. Whichever mode, navigate to the top of your ``ROTDIR`` and pull the entirety of the tarball(s) listed below for your mode. The files within the tarball are already in the ``$RUN.$PDY/$CYC/$ATMOS`` folder format expected by the system.

For forecast-only there are two tarballs to pull

Expand Down
4 changes: 2 additions & 2 deletions docs/source/setup.rst
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,7 @@ The following command examples include variables for reference but users should

cd workflow
./setup_expt.py gfs cycled --idate $IDATE --edate $EDATE [--app $APP] [--start $START] [--gfs_cyc $GFS_CYC]
[--resdetatmos $RESDETATMOS] [--resdetocean $RESDETOCEAN] [--resensatmos $RESENSATMOS] [--nens $NENS] [--cdump $CDUMP]
[--resdetatmos $RESDETATMOS] [--resdetocean $RESDETOCEAN] [--resensatmos $RESENSATMOS] [--nens $NENS] [--run $RUN]
[--pslot $PSLOT] [--configdir $CONFIGDIR] [--comroot $COMROOT] [--expdir $EXPDIR] [--icsdir $ICSDIR]

where:
Expand All @@ -170,7 +170,7 @@ where:
* ``$RESDETOCEAN`` is the resolution of the ocean component of the deterministic forecast [default: 0.; determined based on atmosphere resolution]
* ``$RESENSATMOS`` is the resolution of the atmosphere component of the ensemble forecast [default: 192]
* ``$NENS`` is the number of ensemble members [default: 20]
* ``$CDUMP`` is the starting phase [default: gdas]
* ``$RUN`` is the starting phase [default: gdas]
* ``$PSLOT`` is the name of your experiment [default: test]
* ``$CONFIGDIR`` is the path to the config folder under the copy of the system you're using [default: $TOP_OF_CLONE/parm/config/]
* ``$COMROOT`` is the path to your experiment output directory. Your ``ROTDIR`` (rotating com directory) will be created using ``COMROOT`` and ``PSLOT``. [default: $HOME]
Expand Down
Loading

0 comments on commit 17efba2

Please sign in to comment.