Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[doc] Added MNIST training using KubeRay doc page #46123

Merged
merged 6 commits into from
Jul 17, 2024

Conversation

chungen04
Copy link
Contributor

Why are these changes needed?

Added a MNIST training page using KubeRay to document page. The training workload is based on (currently in kuberay repo):
https://github.com/ray-project/kuberay/tree/master/ray-operator/config/samples/pytorch-mnist

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(


```sh
# Download `ray-job.pytorch-mnist.yaml`
curl -LO https://raw.githubusercontent.com/ray-project/kuberay/v1.0.0/ray-operator/config/samples/pytorch-mnist/ray-job.pytorch-mnist.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solved

@chungen04 chungen04 force-pushed the mnist-example branch 3 times, most recently from fc154fd to f212de7 Compare June 21, 2024 04:08

Feel free to adjust the `NUM_WORKERS` field and the `replicas` field under `workerGroupSpecs` in `rayClusterSpec` in the yaml file such that all the worker pods can reach `Running` status.

```sh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why the RayCluster looks like this? What is the expected state of the RayCluster based on the YAML configuration (e.g., replicas)?


# Train a PyTorch Model on Fashion MNIST with CPUs on Kubernetes

This example runs distributed training of a PyTorch model on Fashion MNIST with Ray Train. See [Train a PyTorch Model on Fashion MNIST](../../../train/examples/pytorch/torch_fashion_mnist_example.rst) for more details.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

../../../train/examples/pytorch/torch_fashion_mnist_example.rst => We seldom use relative paths. Would you mind adding a label to torch_fashion_mnist_example.rst and referring to the document via the label?


You might need to adjust some fields in the RayJob description YAML file so that it can run in your environment:
* `replicas` under `workerGroupSpecs` in `rayClusterSpec`: This field specifies the number of worker Pods that will be scheduled to the Kubernetes cluster. Each worker Pod and the head Pod, as described in the `template` field, require 2 CPUs. A RayJob submitter Pod requires 1 CPU. For example, if your machine has 8 CPUs, the maximum `replicas` value will be 2 to allow all Pods to reach the `Running` status.
* `NUM_WORKERS` under `runtimeEnvYAML` in `spec`: This field indicates the number of Ray actors to launch (see [Document](../../../train/api/doc/ray.train.ScalingConfig.rst) for more information). Each Ray actor must be served by a worker Pod in the Kubernetes cluster. Therefore, `NUM_WORKERS` must be less than or equal to `replicas`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid relative path

Copy link
Collaborator

@can-anyscale can-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also have screenshot to link to view the new added page, thankks

@chungen04
Copy link
Contributor Author

can we also have screenshot to link to view the new added page, thankks

@can-anyscale Sure. Here they are:

image
image
image
image
image

Copy link
Contributor

@angelinalg angelinalg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. Thanks for pinging me.

@@ -0,0 +1,120 @@
(kuberay-mnist-training-example)=

# Train a PyTorch Model on Fashion MNIST with CPUs on Kubernetes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Train a PyTorch Model on Fashion MNIST with CPUs on Kubernetes
# Train a PyTorch model on Fashion MNIST with CPUs on Kubernetes


## Step 3: Create a RayJob

A RayJob consists of a RayCluster custom resource and a job that can be submitted to the RayCluster. With RayJob, KubeRay creates a RayCluster and submits a job when the cluster is ready. Here is a CPU-only RayJob description YAML file for MNIST training on a PyTorch model.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A RayJob consists of a RayCluster custom resource and a job that can be submitted to the RayCluster. With RayJob, KubeRay creates a RayCluster and submits a job when the cluster is ready. Here is a CPU-only RayJob description YAML file for MNIST training on a PyTorch model.
A RayJob consists of a RayCluster custom resource and a job that you can submit to the RayCluster. With RayJob, KubeRay creates a RayCluster and submits a job when the cluster is ready. The following is a CPU-only RayJob description YAML file for MNIST training on a PyTorch model.

```

You might need to adjust some fields in the RayJob description YAML file so that it can run in your environment:
* `replicas` under `workerGroupSpecs` in `rayClusterSpec`: This field specifies the number of worker Pods that will be scheduled to the Kubernetes cluster. Each worker Pod and the head Pod, as described in the `template` field, require 2 CPUs. A RayJob submitter Pod requires 1 CPU. For example, if your machine has 8 CPUs, the maximum `replicas` value will be 2 to allow all Pods to reach the `Running` status.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `replicas` under `workerGroupSpecs` in `rayClusterSpec`: This field specifies the number of worker Pods that will be scheduled to the Kubernetes cluster. Each worker Pod and the head Pod, as described in the `template` field, require 2 CPUs. A RayJob submitter Pod requires 1 CPU. For example, if your machine has 8 CPUs, the maximum `replicas` value will be 2 to allow all Pods to reach the `Running` status.
* `replicas` under `workerGroupSpecs` in `rayClusterSpec`: This field specifies the number of worker Pods that KubeRay schedules to the Kubernetes cluster. Each worker Pod and the head Pod, as described in the `template` field, requires 2 CPUs. A RayJob submitter Pod requires 1 CPU. For example, if your machine has 8 CPUs, the maximum `replicas` value is 2 to allow all Pods to reach the `Running` status.

# Create a RayJob
kubectl apply -f ray-job.pytorch-mnist.yaml

# Check existing Pods: According to `replicas`, there should be 2 worker Pods
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Check existing Pods: According to `replicas`, there should be 2 worker Pods
# Check existing Pods: According to `replicas`, there should be 2 worker Pods.

kubectl apply -f ray-job.pytorch-mnist.yaml

# Check existing Pods: According to `replicas`, there should be 2 worker Pods
# Make sure all the Pods are in the `Running` status
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Make sure all the Pods are in the `Running` status
# Make sure all the Pods are in the `Running` status.

After seeing `JOB_STATUS` marked as `SUCCEEDED`, you can check the training logs:

```sh
# Check Pods name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Check Pods name
# Check Pods name.

# rayjob-pytorch-mnist-nxmj2 0/1 Completed 0 38m
# rayjob-pytorch-mnist-raycluster-rkdmq-head-m4dsl 1/1 Running 0 38m

# Check training logs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Check training logs
# Check training logs.

# ...
```

## Clean-up
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Clean-up
## Clean up

@@ -1,5 +1,7 @@
:orphan:

.. _train-pytorch-fashion-mnist:

Train a PyTorch Model on Fashion MNIST
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Train a PyTorch Model on Fashion MNIST
Train a PyTorch model on Fashion MNIST


# Train a PyTorch Model on Fashion MNIST with CPUs on Kubernetes

This example runs distributed training of a PyTorch model on Fashion MNIST with Ray Train. See [Train a PyTorch Model on Fashion MNIST](train-pytorch-fashion-mnist) for more details.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This example runs distributed training of a PyTorch model on Fashion MNIST with Ray Train. See [Train a PyTorch Model on Fashion MNIST](train-pytorch-fashion-mnist) for more details.
This example runs distributed training of a PyTorch model on Fashion MNIST with Ray Train. See [Train a PyTorch model on Fashion MNIST](train-pytorch-fashion-mnist) for more details.

@kevin85421 kevin85421 added the go add ONLY when ready to merge, run all tests label Jul 14, 2024
@kevin85421
Copy link
Member

@can-anyscale, the CI failure seems to be unrelated to this PR. I have rerun it, but it still failed. Is it OK to merge this PR, or do you have any suggestions to fix the issue?

@kevin85421
Copy link
Member

@chungen04 I triggered the CI again. If it still fails, please rebase with the master branch.

Signed-off-by: chungen04 <b09901027@ntu.edu.tw>
Signed-off-by: chungen04 <b09901027@ntu.edu.tw>
Signed-off-by: chungen04 <b09901027@ntu.edu.tw>
Signed-off-by: chungen04 <b09901027@ntu.edu.tw>
Signed-off-by: chungen04 <b09901027@ntu.edu.tw>
Signed-off-by: chungen04 <b09901027@ntu.edu.tw>
@jjyao jjyao merged commit 424a876 into ray-project:master Jul 17, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants