This document provides a comprehensive guide for setting up and running ERSAP workflows in JIRIAF. It covers the following key aspects:
- Project Identification: Defining a unique project ID for your workflow.
- Prometheus Setup: Deploying a Prometheus instance for monitoring using Helm.
- Workflow Deployment: Deploying workflows on EJFAT nodes and SLURM NERSC-ORNL nodes using Helm charts.
- JRM Setup on Local EJFAT Nodes: Setting up JRM on EJFAT nodes.
Before you begin, ensure you have the following prerequisites in place:
- Access to EJFAT, Perlmutter, and ORNL environments as required for your workflow.
- Kubernetes cluster set up and configured.
kubectl
command-line tool installed and configured to interact with your Kubernetes cluster.- Helm 3.x installed on your local machine.
- Prometheus Operator installed on your Kubernetes cluster.
- Access to the JIRIAF Fireworks repository (https://github.com/JeffersonLab/jiriaf-fireworks).
- Sufficient permissions to deploy services and create namespaces in the Kubernetes cluster.
- Basic understanding of Kubernetes, Helm, and SLURM concepts.
- SSH access to relevant nodes (EJFAT, Perlmutter, ORNL) for deployment and troubleshooting.
Ensure all these prerequisites are met before proceeding with the workflow setup and deployment.
For the simplest case to deploy ERSAP workflows, we ask to remove all the workflows and JRM instances, and then deploy the workflows.
-
Define a unique project ID:
export ID=jlab-100g-nersc-ornl
This ID will be used consistently across all deployment steps.
-
Set up EJFAT nodes:
./main/local-ejfat/init-jrm/launch-nodes.sh
For detailed usage and customization, refer to the EJFAT Node Initialization README.
-
Set up Perlmutter or ORNL nodes using JIRIAF Fireworks: Refer to the JIRIAF Fireworks repository for detailed instructions on setting up the nodes for workflow execution.
As this is the simplest case, we ask to deploy JRMs on NERSC first, and then ORNL.
Important: During this step, pay close attention to the port mappings created when deploying JRMs. These port assignments, specifically ERSAP_EXPORTER_PORT, PROCESS_EXPORTER_PORT, EJFAT_EXPORTER_PORT, and ERSAP_QUEUE_PORT, will need to be used in step 7 when deploying ERSAP workflows. Make sure to record these port assignments for each site (NERSC and ORNL) as they will be crucial for proper workflow deployment and monitoring.
-
Check if there is already a Prometheus instance for this ID:
kubectl get svc -n monitoring
If there is no Prometheus instance named
$ID-prom
, then deploy one by following the next step; otherwise, you can skip the next step. -
Deploy Prometheus (skip this step if there is already a Prometheus instance for this ID):
cd main/prom ID=jlab-100g-nersc-ornl helm install $ID-prom prom/ --set Deployment.name=$ID
For more information on Prometheus deployment and configuration, see the Prometheus Helm Chart README.
-
Deploy ERSAP workflow on EJFAT:
cd main/local-ejfat ./launch_job.sh
This script uses the following parameters:
ID=jlab-100g-nersc-ornl INDEX=1 # This should be a unique index for each workflow instance
You can modify these parameters in the script as needed. For more details on EJFAT workflow deployment, consult the Local EJFAT README.
-
Deploy ERSAP workflow on SLURM NERSC-ORNL:
cd main/slurm-nersc-ornl ./batch-job-submission.sh
This script uses the following default parameters:
ID="jlab-100g-nersc-ornl" SITE="perlmutter" ERSAP_EXPORTER_PORT_BASE=20000 JRM_EXPORTER_PORT_BASE=10000 TOTAL_NUMBER=2 # This is how many jobs will be deployed.
You can modify these ports in the script
batch-job-submission.sh
. For more information on SLURM NERSC-ORNL workflow deployment, refer to the SLURM NERSC-ORNL README.Critical: The port values (ERSAP_EXPORTER_PORT, PROCESS_EXPORTER_PORT, EJFAT_EXPORTER_PORT, and ERSAP_QUEUE_PORT) used here must match the port assignments made during JRM deployment in step 3. Ensure that these ports align with the configuration in your site's setup. Before deployment, verify these port assignments and update them if necessary. Mismatched ports will result in monitoring failures and potential workflow issues.
-
Check and delete deployed jobs:
To check the jobs that are deployed:
helm ls
To delete a deployed job on SLURM NERSC-ORNL:
helm uninstall $ID-job-$SITE-<number>
Replace
$ID-job-$SITE-<number>
with the name used during installation (e.g.,jlab-100g-nersc-ornl-job-perlmutter-0
).To delete a deployed job on EJFAT:
helm uninstall $ID-job-ejfat-$INDEX
Replace
$ID-job-ejfat-$INDEX
with the name used during installation (e.g.,jlab-100g-nersc-ornl-job-ejfat-0
).Important for EJFAT jobs: After uninstalling the Helm release, you must also manually delete the containers created by the Charts on the EJFAT nodes:
- Log in to each EJFAT node used in your deployment.
- List all containers:
docker ps -a
- Identify the containers related to your job.
- Remove these containers using:
docker rm -f <container-id>
This manual cleanup step is necessary because the containers are created directly on the EJFAT nodes and are not managed by Kubernetes.
The Experimental JLab Facility for AI and Test (EJFAT) nodes are initialized using scripts in the init-jrm
directory. These scripts set up the environment for deploying workflows.
Key components:
node-setup.sh
: Sets up individual EJFAT nodeslaunch-nodes.sh
: Launches multiple EJFAT nodes
For detailed information, see the EJFAT Node Initialization README.
These charts are used to deploy workflows on EJFAT nodes.
Key features:
- Main chart located in the
job/
directory - Customizable deployment through
values.yaml
- Includes templates for jobs, services, and monitoring
For usage instructions and details, refer to the Local EJFAT README.
These charts are designed for deploying workflows on Perlmutter and ORNL environments.
Key features:
- Supports site-specific configurations (Perlmutter and ORNL)
- Includes scripts for batch job submission
- Integrates with Prometheus monitoring
For detailed usage instructions, see the SLURM NERSC-ORNL README.
A custom Prometheus Helm chart sets up monitoring for the entire JIRIAF system.
Key components:
- Prometheus Server
- Persistent Volume for data storage
- Create Empty Dir for persistent storage
For in-depth information, consult the Prometheus Helm Chart README.
The system is designed for seamless integration of workflows across different environments:
- Initialize EJFAT nodes using the
init-jrm
scripts. - Deploy the Prometheus monitoring system using the provided Helm chart.
- Deploy workflows on EJFAT nodes using the Local EJFAT Helm charts.
- Deploy workflows on Perlmutter or ORNL using the SLURM NERSC-ORNL Helm charts.
All deployed workflows can be monitored by the single Prometheus instance, providing a unified view of the entire system.
Each component (EJFAT, SLURM NERSC-ORNL, Prometheus) can be customized through their respective values.yaml
files and additional configuration options. Refer to the individual README files for specific customization details.
- Use standard Kubernetes commands (
kubectl get
,kubectl logs
,kubectl describe
) to diagnose issues. - Check Prometheus metrics and alerts for system-wide monitoring.
For component-specific troubleshooting, consult the relevant README files linked above.