Skip to content

Commit

Permalink
move testing to individual steps
Browse files Browse the repository at this point in the history
  • Loading branch information
glvov-bdai committed Oct 28, 2024
1 parent 551003b commit 8159106
Show file tree
Hide file tree
Showing 7 changed files with 153 additions and 390 deletions.
169 changes: 145 additions & 24 deletions docs/source/features/ray.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,10 @@ and hyperparameter tuning, both on local and remote configurations.

This functionality is experimental.

.. contents:: Table of Contents
:depth: 3
:local:

Overview
--------

Expand Down Expand Up @@ -85,15 +89,13 @@ The following script can be used to easily create clusters on Google GKE.
:language: python
:emphasize-lines: 14-44

The following script can be used t

**Installation**
----------------

The Ray functionality requires additional dependencies be installed.

To use Ray without Kubernetes, like on a local computer or VM,
``kubectl`` is not required. For use on Kubernetes clusters,
``kubectl`` is not required. For use on Kubernetes clusters with KubeRay,
such as Google Kubernetes Engine or Amazon Elastic Kubernetes Service, ``kubectl`` is required, and can
be installed via the `Kubernetes website <https://kubernetes.io/docs/tasks/tools/>`_

Expand Down Expand Up @@ -125,6 +127,8 @@ multiple-GPU machines. This Ray server will run indefinitely until it is stopped
echo "import ray; ray.init(); import time; [time.sleep(10) for _ in iter(int, 1)]" | ./isaaclab.sh -p
More in-detail steps for job submission are provided below in Dispatching Jobs and Tuning - Simple Ray Cluster below.

KubeRay Clusters
''''''''''''''''
.. attention::
Expand All @@ -142,17 +146,19 @@ any cloud provider should work if one configures the following:
paths need to be configured correctly for everything to work. It's likely that
the example dockerfile will work out of the box and can be pushed to the registry, as
long as the base image has already been built as in the container guide
- A Kubernetes setup with available NVIDIA RTX (likely ``l4`` or ``l40``) GPU-passthrough node-pool resources,
- A Kubernetes setup with available NVIDIA RTX (likely ``l4`` or ``l40`` or ``tesla-t4`` or ``a10``) GPU-passthrough node-pool resources,
that has access to your container registry/storage bucket and has the Ray operator enabled with correct IAM
permissions. This can be easily achieved with services such as Google GKE or AWS EKS,
provided that your account or organization has been granted a GPU-budget. It is recommended
to use manual kubernetes services as opposed to "autopilot" services for cost-effective
experimentation as this way clusters can be completely shut down when not in use, although
this may installing the `Nvidia GPU Operator <https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/google-gke.html>`_
this may require installing the `Nvidia GPU Operator <https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/google-gke.html>`_
- A ``kuberay.yaml.ninja`` file that describes how to allocate resources (already included for
Google Cloud)
- A storage bucket to dump experiment logs/checkpoints to, that the cluster has ``read/write`` access to.

More in-detail setup steps for uploading the image, authentication, creating the cluster easily for Google GKE,
and job submission are provided as part of Dispatching Jobs and Tuning - Remote Ray Cluster Setup and Use below.

Ray Clusters (Without Kubernetes)
'''''''''''''''''''''''''''''''''
Expand All @@ -163,6 +169,9 @@ Ray Clusters (Without Kubernetes)
See the `Ray Clusters Overview <https://docs.ray.io/en/latest/cluster/getting-started.html>`_ or
`Anyscale <https://www.anyscale.com/product>`_ for more information

More in-detail setup steps for uploading the image, and job submission steps are provided as part of
Dispatching Jobs and Tuning - Remote Ray Cluster Setup and Use below.

SLURM Ray Cluster
'''''''''''''''''
See the `Ray Community SLURM support <https://docs.ray.io/en/latest/cluster/vms/user-guides/community/slurm.html#slurm-network-ray>`_
Expand All @@ -172,18 +181,22 @@ Isaac SLURM support independent of Ray has been tested, unlike Ray SLURM.
**Dispatching Jobs and Tuning**
-------------------------------

Provided that there is a Ray cluster running with a correct configuration, select one of the following guides
that matches your Cluster configuration.
Select one of the following guides that matches your desired Cluster configuration.

Simple Ray Cluster (Local/VM)
'''''''''''''''''''''''''''''

Single Ray Cluster (Local/VM)
''''''''''''''''''''''''''''''''''
1.) Testing that the cluster works:
This guide assumes that there is a Ray cluster already running, and that this script is run locally on the cluster, or
that the cluster job submission address is known.

1.) Testing that the cluster works can be done as follows.

.. code-block:: bash
./isaaclab.sh -p source/standalone/workflows/ray/wrap_isaac_ray_resources.py --test
2.) Submitting resource-wrapped sub-jobs to the cluster can be done as follows. **Ensure that sub-jobs are separated by the ``+`` delimiter.**
2.) Submitting resource-wrapped sub-jobs to the cluster can be done as follows.
**Ensure that sub-jobs are separated by the ``+`` delimiter.**

.. code-block:: bash
Expand All @@ -207,7 +220,25 @@ Single Ray Cluster (Local/VM)
# Two jobs, one after another
./isaaclab.sh -p source/standalone/workflows/ray/submit_isaac_ray_job.py --sub_jobs wrap_isaac_ray_resources.py --jobs ./isaaclab.sh -p source/standalone/workflows/rl_games/train.py --task Isaac-Cartpole-v0 --headless+./isaaclab.sh -p source/standalone/workflows/rl_games/train.py --task Isaac-Cartpole-RGB-Camera-Direct-v0 --headless --enable_cameras agent.params.config.max_epochs=150
3.) Submitting tuning aggregate jobs that create many individual sub-jobs can be tested as follows:
3.) For tuning jobs, specify the hyperparameter sweep similar to :class:`RLGamesCameraJobCfg` in the following file:

.. dropdown:: source/standalone/workflows/ray/isaac_ray_tune.py (submitting aggregate jobs)
:icon: code

.. literalinclude:: ../../../source/standalone/workflows/ray/isaac_ray_tune.py
:language: python
:emphasize-lines: 17-35, 160-239

For example, see the Cartpole Example configurations.

.. dropdown:: source/standalone/workflows/ray/hyperparameter_tuning/vision_cartpole_cfg.py (submitting aggregate jobs)
:icon: code

.. literalinclude:: ../../../source/standalone/workflows/ray/hyperparameter_tuning/vision_cartpole_cfg.py
:language: python
:emphasize-lines: 17-35, 160-239

Submitting tuning aggregate jobs that create many individual sub-jobs can be tested as follows.

.. code-block:: bash
Expand All @@ -224,17 +255,30 @@ Single Ray Cluster (Local/VM)
--cfg_file hyperparameter_tuning/vision_cartpole_cfg.py \
--cfg_class CartpoleRGBNoTuneJobCfg --storage_path ~/isaac_cartpol
Multiple-Node Ray Cluster
'''''''''''''''''''''''''
On a multiple-node Ray cluster, it is assumed that resources are homogeneous across GPU-enabled
nodes. The Isaac Ray integration includes utilities for managing KubeRay clusters,
as well as functionality that is shared across both KubeRay and pure Ray clusters.
Remote Ray Cluster Setup and Use
'''''''''''''''''''''''''''''''''
This guide assumes that one desires to create a cluster on a remote host or server. This
guide includes shared steps, and KubeRay or Ray specific steps. Follow all shared steps (part I and II), and then
only the KubeRay or Ray steps depending on your desired configuration, in order of shared steps part I, then
the configuration specific steps, then shared steps part II.

Shared Steps Between KubeRay and Pure Ray Part I
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

1.) Build the Isaac Ray image, and upload it to your container registry of choice.

.. code-block:: bash
# Login with NGC (nvcr.io) registry first, see docker steps in repo.
./isaaclab.sh -p docker/container.py start
docker build -t <REGISTRY/IMAGE_NAME> -f source/standalone/workflows/ray/cluster_configs/Dockerfile .
docker push <REGISTRY/IMAGE_NAME>
KubeRay Specific
~~~~~~~~~~~~~~~~
`k9s <https://github.com/derailed/k9s>`_ is a great tool for monitoring your clusters.
`k9s <https://github.com/derailed/k9s>`_ is a great tool for monitoring your clusters that can
easily be installed with ``snap install k9s --devmode``.

1.) Verify cluster access, and that the correct operators are installed
1.) Verify cluster access, and that the correct operators are installed.

.. code-block:: bash
# Verify cluster access
Expand All @@ -249,13 +293,90 @@ KubeRay Specific
# should list clusterpolicies.nvidia.com
kubectl get crds | grep nvidia
2.) Still being copied from README
2.) Create the KubeRay cluster. This can be done automatically for Google GKE,
where instricutions are included in the following creation file. More than once cluster
can be created at once. Each cluster can have heterogeneous resources if so desired.

Multiple-Cluster Multiple-Node Ray
''''''''''''''''''''''''''''''''''
Multiple Clusters can be used to enable simultaneous training runs with hetereogeneous resource requirements
.. dropdown:: source/standalone/workflows/ray/launch.py
:icon: code

.. literalinclude:: ../../../source/standalone/workflows/ray/launch.py
:language: python
:emphasize-lines: 14-44

3.) Fetch the KubeRay cluster IP addresses. This can be done automatically for KubeRay clusters,
where instructions are included in the following fetching file.

.. dropdown:: source/standalone/workflows/ray/grok_cluster_with_kubectl.py
:icon: code

.. literalinclude:: ../../../source/standalone/workflows/ray/grok_cluster_with_kubectl.py
:language: python
:emphasize-lines: 14-23

Ray Specific
~~~~~~~~~~~~

1.) Verify cluster access and storage bucket access.

2.) Create a ``~/.cluster_config`` file, where ``name: <NAME> address: http://<IP>:<PORT>`` is on
a new line for each unique cluster. For one cluster, there should only be one line in this file.


Shared Steps Between KubeRay and Pure Ray Part II
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1.) Test that your cluster is operational with the following.

.. code-block:: bash
# Test that NVIDIA GPUs are visible and that Ray is operation with the following command:
./isaaclab.sh -p source/standalone/workflows/ray/wrap_isaac_ray_resources.py
--jobs wrap_isaac_ray_resources.py --test
2.) Submitting Jobs can be done in the following manner, with the following script.

.. dropdown:: source/standalone/workflows/ray/submit_isaac_ray_job.py (submitting aggregate jobs)
:icon: code

.. literalinclude:: ../../../source/standalone/workflows/ray/submit_isaac_ray_job.py
:language: python
:emphasize-lines: 12-42

.. code-block:: bash
# General Template
./isaaclab.sh -p source/standalone/workflows/ray/wrap_isaac_ray_resources.py
--jobs <JOB0>+<JOB1>+<JOB2>
# Example 1: Submitting two jobs
# (To run within a container, replace ```./isaaclab.sh -p``` with ```/workspace/isaaclab/isaaclab.sh -p```
#and ```source/standalone/workflows/rl_games/train.py``` with ```/workspace/isaaclab/source/standalone/workflows/rl_games/train.py```)'
./isaaclab.sh -p source/standalone/workflows/ray/submit_isaac_ray_job.py --aggregate_jobs wrap_isaac_ray_resources.py --sub_jobs ./isaaclab.sh -p source/standalone/workflows/rl_games/train.py --task Isaac-Cartpole-v0 --headless+./isaaclab.sh -p source/standalone/workflows/rl_games/train.py --task Isaac-Cartpole-RGB-Camera-Direct-v0 --headless --enable_cameras agent.params.config.max_epochs=150
# For more than one cluster, and/or more than one aggregate job, separate aggregate jobs with the * delimiter.
3.) For tuning jobs, specify the hyperparameter sweep similar to :class:`RLGamesCameraJobCfg` in the following file:

.. dropdown:: source/standalone/workflows/ray/isaac_ray_tune.py (submitting aggregate jobs)
:icon: code

.. literalinclude:: ../../../source/standalone/workflows/ray/isaac_ray_tune.py
:language: python
:emphasize-lines: 17-35, 160-239

For example, see the Cartpole Example configurations.

.. dropdown:: source/standalone/workflows/ray/hyperparameter_tuning/vision_cartpole_cfg.py (submitting aggregate jobs)
:icon: code

.. literalinclude:: ../../../source/standalone/workflows/ray/hyperparameter_tuning/vision_cartpole_cfg.py
:language: python
:emphasize-lines: 17-35, 160-239

Tuning jobs can be submitted with the previous script as with other jobs.

.. code-block:: bash
./isaaclab.sh -p source/standalone/workflows/ray/submit_isaac_ray_job.py --aggregate_jobs
/workspace/isaaclab/isaaclab.sh -p /workspace/isaaclab/source/standalone/workflors/ray/isaac_ray_tune.py \
--cfg_file hyperparameter_tuning/vision_cartpole_cfg.py \
--cfg_class CartpoleRGBNoTuneJobCfg --storage_path <YOUR_STORAGE_BUCKET>
Still being copied from README
**Cluster Cleanup**
'''''''''''''''''''
Expand Down
Loading

0 comments on commit 8159106

Please sign in to comment.