convert more from readme to rst

isaac-sim · Oct 28, 2024 · abb7002 · abb7002
1 parent 62f8d15
commit abb7002
Show file tree

Hide file tree

Showing 4 changed files with 62 additions and 13 deletions.
diff --git a/docs/source/features/ray.rst b/docs/source/features/ray.rst
@@ -32,7 +32,7 @@ Both resource-wrapped and tuning aggregate jobs dispatch individual jobs to a de
 cluster, which leverages the cluster's resources (e.g., a single workstation node or multiple nodes)
 to execute these jobs with workers in parallel and/or sequentially. By default, aggregate jobs use all \
 available resources on each available GPU-enabled node for each sub-job worker. This can be changed through
-specifying the ``--num_workers_per_node`` argument, especially critical for parallel aggregate
+specifying the ``--num_workers`` argument, especially critical for parallel aggregate
 job processing on local or virtual multi-GPU machines
 
 In resource-wrapped aggregate jobs, each sub-job and its
@@ -57,7 +57,7 @@ sweep configuration. This assumes homogeneous node resource composition for node
 
 The following script can be used to submit aggregate
 jobs to one or more Ray cluster(s), which can be used for
-running jobs on a remote cluster or simultaneous jobs with hetereogeneous
+running jobs on a remote cluster or simultaneous jobs with heterogeneous
 resource requirements:
 
 .. dropdown:: source/standalone/workflows/ray/submit_isaac_ray_job.py (submitting aggregate jobs)
@@ -76,6 +76,17 @@ The following script can be used to extract KubeRay Cluster information for aggr
     :language: python
     :emphasize-lines: 14-23
 
+The following script can be used to easily create clusters on Google GKE.
+
+.. dropdown:: source/standalone/workflows/ray/launch.py
+  :icon: code
+
+  .. literalinclude:: ../../../source/standalone/workflows/ray/launch.py
+    :language: python
+    :emphasize-lines: 14-44
+
+The following script can be used t
+
 **Installation**
 ----------------
 
@@ -164,7 +175,7 @@ Isaac SLURM support independent of Ray has been tested, unlike Ray SLURM.
 Provided that there is a Ray cluster running with a correct configuration, select one of the following guides
 that matches your Cluster configuration.
 
-Single-Node Ray Cluster (Local/VM)
+Single Ray Cluster (Local/VM)
 ''''''''''''''''''''''''''''''''''
 1.) Testing that the cluster works:
 
@@ -177,16 +188,20 @@ Single-Node Ray Cluster (Local/VM)
 .. code-block:: bash
 
   # Generic Templates-----------------------------------
+  ./isaaclab.sh -p source/standalone/workflows/ray/wrap_isaac_ray_resources.py -h
   # No resource isolation; no parallelization:
   ./isaaclab.sh -p source/standalone/workflows/ray/wrap_isaac_ray_resources.py
     --sub_jobs <JOB0>+<JOB1>+<JOB2>
-  # Automatic Resource Isolation; Option A: needed for parallelization
+  # Automatic Resource Isolation; Example A: needed for parallelization
   ./isaaclab.sh -p source/standalone/workflows/ray/wrap_isaac_ray_resources.py \
-	--num_workers_per_node <NUM_TO_DIVIDE_TOTAL_RESOURCES_BY> \
-	--jobs <JOB0>+<JOB1>
-  # Manual Resource Isolation; Option B:  needed for parallelization
-  ./isaaclab.sh -p source/standalone/workflows/ray/wrap_isaac_ray_resources.py --num_cpu_per_job <CPU> \
-	--num_gpu_per_job <GPU> --gb_ram_per_job <RAM> --jobs <JOB0>+<JOB1>
+	--num_workers <NUM_TO_DIVIDE_TOTAL_RESOURCES_BY> \
+	--sub_jobs <JOB0>+<JOB1>
+  # Manual Resource Isolation; Example B:  needed for parallelization
+  ./isaaclab.sh -p source/standalone/workflows/ray/wrap_isaac_ray_resources.py --num_cpu_per_worker <CPU> \
+	--gpu_per_worker <GPU> --ram_gb_per_worker <RAM> --sub_jobs <JOB0>+<JOB1>
+  # Manual Resource Isolation; Example C: Needed for parallelization, for heterogeneous workloads
+  ./isaaclab.sh -p source/standalone/workflows/ray/wrap_isaac_ray_resources.py --num_cpu_per_worker <CPU> \
+	--gpu_per_worker <GPU1> <GPU2> --ram_gb_per_worker <RAM> --sub_jobs <JOB0>+<JOB1>
 
   # Examples----------------------------------------
   # Two jobs, one after another
@@ -196,10 +211,18 @@ Single-Node Ray Cluster (Local/VM)
 
 .. code-block:: bash
 
+  # Example A:
   /isaaclab.sh -p source/standalone/workflows/ray/isaac_ray_tune.py \
 	--mode local
 	--cfg_file hyperparameter_tuning/vision_cartpole_cfg.py \
 	--cfg_class CartpoleRGBNoTuneJobCfg --storage_path ~/isaac_cartpole
+  # Example B: Resource Wrapped:
+  ./isaaclab.sh -p source/standalone/workflows/ray/wrap_isaac_ray_resources.py --num_cpu_per_job <CPU> \
+	--gpu_per_worker <GPU> --ram_gb_per_worker <RAM> \
+  --sub_jobs /isaaclab.sh -p source/standalone/workflows/ray/isaac_ray_tune.py \
+	--mode local
+	--cfg_file hyperparameter_tuning/vision_cartpole_cfg.py \
+	--cfg_class CartpoleRGBNoTuneJobCfg --storage_path ~/isaac_cartpol
 
 Multiple-Node Ray Cluster
 '''''''''''''''''''''''''
@@ -209,8 +232,24 @@ as well as functionality that is shared across both KubeRay and pure Ray cluster
 
 KubeRay Specific
 ~~~~~~~~~~~~~~~~
+`k9s <https://github.com/derailed/k9s>`_ is a great tool for monitoring your clusters.
+
+1.) Verify cluster access, and that the correct operators are installed
 
-1.) Verify cluster access with ``kubectl cluster-info``
+.. code-block:: bash
+  # Verify cluster access
+  kubectl cluster-info
+  # If using a manually managed cluster (not Autopilot or the like)
+  # verify that there are node pools
+  kubectl get nodes
+  # Check that the ray operator is installed on the cluster 
+  # should list rayclusters.ray.io , rayjobs.ray.io , and rayservices.ray.io
+  kubectl get crds | grep ray
+  # Check that the NVIDIA Driver Operator is installed on the cluster 
+  # should list clusterpolicies.nvidia.com
+  kubectl get crds | grep nvidia
+
+2.) Still being copied from README
 
 Multiple-Cluster Multiple-Node Ray
 ''''''''''''''''''''''''''''''''''
@@ -227,5 +266,5 @@ recreated! For KubeRay clusters, this can be done via
 
 .. code-block:: bash
 
-  kubectl get raycluster | egrep 'hyperparameter-tuner' | awk '{print $1}' | xargs kubectl delete raycluster
+  kubectl get raycluster | egrep 'isaacray' | awk '{print $1}' | xargs kubectl delete raycluster
   kubectl delete secret bucket-access
diff --git a/source/standalone/workflows/ray/isaac_ray_util.py b/source/standalone/workflows/ray/isaac_ray_util.py
@@ -301,7 +301,7 @@ def add_resource_arguments(
         defaults: The default values for GPUs, CPUs, RAM, and Num Workers
         cluster_create_defaults: Set to true to populate reasonable defaults for creating clusters.
     Returns:
-        _description_
+        The argparser with the standard resource arguments.
     """
     if defaults is None:
         if cluster_create_defaults:

diff --git a/source/standalone/workflows/ray/launch.py b/source/standalone/workflows/ray/launch.py
@@ -14,6 +14,16 @@
 
 """This script helps create one or more KubeRay clusters.
 
+This script assumes that there is an existing secret that provides credentials
+to access cloud storage. This secret could be created with 
+
+.. code-block:: bash
+    gcloud auth login # https://cloud.google.com/sdk/docs/install
+
+    kubectl create secret generic bucket-access \
+        --from-file=key.json=/home/<USERNAME>/.config/gcloud/application_default_credentials.json \
+        --namespace=<your-namespace>
+
 Usage:
 
 .. code-block:: bash

diff --git a/source/standalone/workflows/ray/wrap_isaac_ray_resources.py b/source/standalone/workflows/ray/wrap_isaac_ray_resources.py
@@ -17,7 +17,7 @@
 If the desired resources for each sub-job is specified,
 the maximum number of workers possible with the desired resources are created for each node
 with GPU(s) in the cluster. It is also possible to split available node resources for each node
-into the desired number of workers with the ``--num_workers_per_node`` flag, to be able to easily
+into the desired number of workers with the ``--num_workers`` flag, to be able to easily
 parallelize sub-jobs on multi-GPU nodes. Due to Isaac Lab requiring a GPU,
 this ignores all CPU only nodes such as loggers.