[doc] Added MNIST training using KubeRay doc page #46123

chungen04 · 2024-06-18T15:04:51Z

Why are these changes needed?

Added a MNIST training page using KubeRay to document page. The training workload is based on (currently in kuberay repo):
https://github.com/ray-project/kuberay/tree/master/ray-operator/config/samples/pytorch-mnist

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

kevin85421 · 2024-06-19T23:16:38Z

doc/source/cluster/kubernetes/examples/mnist-training-example.md

+
+```sh
+# Download `ray-job.pytorch-mnist.yaml`
+curl -LO https://raw.githubusercontent.com/ray-project/kuberay/v1.0.0/ray-operator/config/samples/pytorch-mnist/ray-job.pytorch-mnist.yaml


https://raw.githubusercontent.com/ray-project/kuberay/v1.0.0/ray-operator/config/samples/pytorch-mnist/ray-job.pytorch-mnist.yaml is 404 not found.

doc/source/cluster/kubernetes/examples/mnist-training-example.md

kevin85421 · 2024-06-27T21:08:30Z

doc/source/cluster/kubernetes/examples/mnist-training-example.md

+
+Feel free to adjust the `NUM_WORKERS` field and the `replicas` field under `workerGroupSpecs` in `rayClusterSpec` in the yaml file such that all the worker pods can reach `Running` status.
+
+```sh


Can you explain why the RayCluster looks like this? What is the expected state of the RayCluster based on the YAML configuration (e.g., replicas)?

doc/source/cluster/kubernetes/examples/mnist-training-example.md

kevin85421 · 2024-07-02T00:47:17Z

doc/source/cluster/kubernetes/examples/mnist-training-example.md

+
+# Train a PyTorch Model on Fashion MNIST with CPUs on Kubernetes
+
+This example runs distributed training of a PyTorch model on Fashion MNIST with Ray Train. See [Train a PyTorch Model on Fashion MNIST](../../../train/examples/pytorch/torch_fashion_mnist_example.rst) for more details.


../../../train/examples/pytorch/torch_fashion_mnist_example.rst => We seldom use relative paths. Would you mind adding a label to torch_fashion_mnist_example.rst and referring to the document via the label?

kevin85421 · 2024-07-02T00:48:45Z

doc/source/cluster/kubernetes/examples/mnist-training-example.md

+
+You might need to adjust some fields in the RayJob description YAML file so that it can run in your environment:
+* `replicas` under `workerGroupSpecs` in `rayClusterSpec`: This field specifies the number of worker Pods that will be scheduled to the Kubernetes cluster. Each worker Pod and the head Pod, as described in the `template` field, require 2 CPUs. A RayJob submitter Pod requires 1 CPU. For example, if your machine has 8 CPUs, the maximum `replicas` value will be 2 to allow all Pods to reach the `Running` status.
+* `NUM_WORKERS` under `runtimeEnvYAML` in `spec`: This field indicates the number of Ray actors to launch (see [Document](../../../train/api/doc/ray.train.ScalingConfig.rst) for more information). Each Ray actor must be served by a worker Pod in the Kubernetes cluster. Therefore, `NUM_WORKERS` must be less than or equal to `replicas`.


avoid relative path

can-anyscale

can we also have screenshot to link to view the new added page, thankks

chungen04 · 2024-07-09T22:40:52Z

can we also have screenshot to link to view the new added page, thankks

@can-anyscale Sure. Here they are:

angelinalg

Sorry for the delay. Thanks for pinging me.

angelinalg · 2024-07-12T23:37:22Z

doc/source/cluster/kubernetes/examples/mnist-training-example.md

@@ -0,0 +1,120 @@
+(kuberay-mnist-training-example)=
+
+# Train a PyTorch Model on Fashion MNIST with CPUs on Kubernetes


Suggested change

# Train a PyTorch Model on Fashion MNIST with CPUs on Kubernetes

# Train a PyTorch model on Fashion MNIST with CPUs on Kubernetes

angelinalg · 2024-07-12T23:38:13Z

doc/source/cluster/kubernetes/examples/mnist-training-example.md

+
+## Step 3: Create a RayJob
+
+A RayJob consists of a RayCluster custom resource and a job that can be submitted to the RayCluster. With RayJob, KubeRay creates a RayCluster and submits a job when the cluster is ready. Here is a CPU-only RayJob description YAML file for MNIST training on a PyTorch model.


Suggested change

A RayJob consists of a RayCluster custom resource and a job that can be submitted to the RayCluster. With RayJob, KubeRay creates a RayCluster and submits a job when the cluster is ready. Here is a CPU-only RayJob description YAML file for MNIST training on a PyTorch model.

A RayJob consists of a RayCluster custom resource and a job that you can submit to the RayCluster. With RayJob, KubeRay creates a RayCluster and submits a job when the cluster is ready. The following is a CPU-only RayJob description YAML file for MNIST training on a PyTorch model.

angelinalg · 2024-07-12T23:39:35Z

doc/source/cluster/kubernetes/examples/mnist-training-example.md

+```
+
+You might need to adjust some fields in the RayJob description YAML file so that it can run in your environment:
+* `replicas` under `workerGroupSpecs` in `rayClusterSpec`: This field specifies the number of worker Pods that will be scheduled to the Kubernetes cluster. Each worker Pod and the head Pod, as described in the `template` field, require 2 CPUs. A RayJob submitter Pod requires 1 CPU. For example, if your machine has 8 CPUs, the maximum `replicas` value will be 2 to allow all Pods to reach the `Running` status.


Suggested change

* `replicas` under `workerGroupSpecs` in `rayClusterSpec`: This field specifies the number of worker Pods that will be scheduled to the Kubernetes cluster. Each worker Pod and the head Pod, as described in the `template` field, require 2 CPUs. A RayJob submitter Pod requires 1 CPU. For example, if your machine has 8 CPUs, the maximum `replicas` value will be 2 to allow all Pods to reach the `Running` status.

* `replicas` under `workerGroupSpecs` in `rayClusterSpec`: This field specifies the number of worker Pods that KubeRay schedules to the Kubernetes cluster. Each worker Pod and the head Pod, as described in the `template` field, requires 2 CPUs. A RayJob submitter Pod requires 1 CPU. For example, if your machine has 8 CPUs, the maximum `replicas` value is 2 to allow all Pods to reach the `Running` status.

angelinalg · 2024-07-12T23:40:02Z

doc/source/cluster/kubernetes/examples/mnist-training-example.md

+# Create a RayJob
+kubectl apply -f ray-job.pytorch-mnist.yaml
+
+# Check existing Pods: According to `replicas`, there should be 2 worker Pods


Suggested change

# Check existing Pods: According to `replicas`, there should be 2 worker Pods

# Check existing Pods: According to `replicas`, there should be 2 worker Pods.

angelinalg · 2024-07-12T23:40:06Z

doc/source/cluster/kubernetes/examples/mnist-training-example.md

+kubectl apply -f ray-job.pytorch-mnist.yaml
+
+# Check existing Pods: According to `replicas`, there should be 2 worker Pods
+# Make sure all the Pods are in the `Running` status


Suggested change

# Make sure all the Pods are in the `Running` status

# Make sure all the Pods are in the `Running` status.

angelinalg · 2024-07-12T23:40:44Z

doc/source/cluster/kubernetes/examples/mnist-training-example.md

+After seeing `JOB_STATUS` marked as `SUCCEEDED`, you can check the training logs:
+
+```sh
+# Check Pods name


Suggested change

# Check Pods name

# Check Pods name.

angelinalg · 2024-07-12T23:40:51Z

doc/source/cluster/kubernetes/examples/mnist-training-example.md

+# rayjob-pytorch-mnist-nxmj2                                0/1     Completed   0          38m
+# rayjob-pytorch-mnist-raycluster-rkdmq-head-m4dsl          1/1     Running     0          38m
+
+# Check training logs


Suggested change

# Check training logs

# Check training logs.

angelinalg · 2024-07-12T23:41:02Z

doc/source/cluster/kubernetes/examples/mnist-training-example.md

+# ...
+```
+
+## Clean-up


Suggested change

## Clean-up

## Clean up

angelinalg · 2024-07-12T23:41:21Z

doc/source/train/examples/pytorch/torch_fashion_mnist_example.rst

@@ -1,5 +1,7 @@
 :orphan:

+.. _train-pytorch-fashion-mnist:
+
 Train a PyTorch Model on Fashion MNIST


Suggested change

Train a PyTorch Model on Fashion MNIST

Train a PyTorch model on Fashion MNIST

angelinalg · 2024-07-12T23:48:05Z

doc/source/cluster/kubernetes/examples/mnist-training-example.md

+
+# Train a PyTorch Model on Fashion MNIST with CPUs on Kubernetes
+
+This example runs distributed training of a PyTorch model on Fashion MNIST with Ray Train. See [Train a PyTorch Model on Fashion MNIST](train-pytorch-fashion-mnist) for more details.


Suggested change

This example runs distributed training of a PyTorch model on Fashion MNIST with Ray Train. See [Train a PyTorch Model on Fashion MNIST](train-pytorch-fashion-mnist) for more details.

This example runs distributed training of a PyTorch model on Fashion MNIST with Ray Train. See [Train a PyTorch model on Fashion MNIST](train-pytorch-fashion-mnist) for more details.

kevin85421 · 2024-07-15T22:56:58Z

@can-anyscale, the CI failure seems to be unrelated to this PR. I have rerun it, but it still failed. Is it OK to merge this PR, or do you have any suggestions to fix the issue?

kevin85421 · 2024-07-17T06:33:45Z

@chungen04 I triggered the CI again. If it still fails, please rebase with the master branch.

Signed-off-by: chungen04 <b09901027@ntu.edu.tw>

chungen04 requested review from architkulkarni, maxpumperla, pcmoritz, kevin85421 and a team as code owners June 18, 2024 15:04

kevin85421 self-assigned this Jun 19, 2024

kevin85421 reviewed Jun 19, 2024

View reviewed changes

chungen04 force-pushed the mnist-example branch 3 times, most recently from fc154fd to f212de7 Compare June 21, 2024 04:08

kevin85421 reviewed Jun 27, 2024

View reviewed changes

kevin85421 reviewed Jul 2, 2024

View reviewed changes

can-anyscale reviewed Jul 2, 2024

View reviewed changes

chungen04 requested review from hongpeng-guo, justinvyu, matthewdeng, raulchen and woshiyyya as code owners July 9, 2024 22:28

kevin85421 approved these changes Jul 10, 2024

View reviewed changes

angelinalg approved these changes Jul 12, 2024

View reviewed changes

chungen04 force-pushed the mnist-example branch from 2519ac9 to 78e3669 Compare July 13, 2024 04:49

kevin85421 approved these changes Jul 14, 2024

View reviewed changes

kevin85421 added the go add ONLY when ready to merge, run all tests label Jul 14, 2024

chungen04 added 4 commits July 17, 2024 16:04

[doc] Added MNIST training using KubeRay doc page

88b3ea4

Signed-off-by: chungen04 <b09901027@ntu.edu.tw>

[doc] Added MNIST training using KubeRay doc page (fixed link)

d14b282

Signed-off-by: chungen04 <b09901027@ntu.edu.tw>

fix typo

67a49d2

Signed-off-by: chungen04 <b09901027@ntu.edu.tw>

[doc] Added MNIST training using KubeRay doc page (0629 fix)

6bba1e1

Signed-off-by: chungen04 <b09901027@ntu.edu.tw>

chungen04 added 2 commits July 17, 2024 16:04

[doc] Added MNIST training using KubeRay doc page (0710 fix)

b149acf

Signed-off-by: chungen04 <b09901027@ntu.edu.tw>

fix text

544be6e

Signed-off-by: chungen04 <b09901027@ntu.edu.tw>

chungen04 force-pushed the mnist-example branch from 78e3669 to 544be6e Compare July 17, 2024 08:04

jjyao merged commit 424a876 into ray-project:master Jul 17, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[doc] Added MNIST training using KubeRay doc page #46123

[doc] Added MNIST training using KubeRay doc page #46123

chungen04 commented Jun 18, 2024

kevin85421 Jun 19, 2024

chungen04 Jun 20, 2024

kevin85421 Jun 27, 2024

kevin85421 Jul 2, 2024

kevin85421 Jul 2, 2024

can-anyscale left a comment

chungen04 commented Jul 9, 2024

angelinalg left a comment

angelinalg Jul 12, 2024

angelinalg Jul 12, 2024

angelinalg Jul 12, 2024

angelinalg Jul 12, 2024

angelinalg Jul 12, 2024

angelinalg Jul 12, 2024

angelinalg Jul 12, 2024

angelinalg Jul 12, 2024

angelinalg Jul 12, 2024

angelinalg Jul 12, 2024

kevin85421 commented Jul 15, 2024

kevin85421 commented Jul 17, 2024


		Feel free to adjust the `NUM_WORKERS` field and the `replicas` field under `workerGroupSpecs` in `rayClusterSpec` in the yaml file such that all the worker pods can reach `Running` status.

		```sh


		# Train a PyTorch Model on Fashion MNIST with CPUs on Kubernetes

		This example runs distributed training of a PyTorch model on Fashion MNIST with Ray Train. See [Train a PyTorch Model on Fashion MNIST](../../../train/examples/pytorch/torch_fashion_mnist_example.rst) for more details.

		@@ -0,0 +1,120 @@
		(kuberay-mnist-training-example)=

		# Train a PyTorch Model on Fashion MNIST with CPUs on Kubernetes


		## Step 3: Create a RayJob

		A RayJob consists of a RayCluster custom resource and a job that can be submitted to the RayCluster. With RayJob, KubeRay creates a RayCluster and submits a job when the cluster is ready. Here is a CPU-only RayJob description YAML file for MNIST training on a PyTorch model.

	# Check existing Pods: According to `replicas`, there should be 2 worker Pods
	# Check existing Pods: According to `replicas`, there should be 2 worker Pods.

	# Make sure all the Pods are in the `Running` status
	# Make sure all the Pods are in the `Running` status.

	Train a PyTorch Model on Fashion MNIST
	Train a PyTorch model on Fashion MNIST

[doc] Added MNIST training using KubeRay doc page #46123

[doc] Added MNIST training using KubeRay doc page #46123

Conversation

chungen04 commented Jun 18, 2024

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

can-anyscale left a comment

Choose a reason for hiding this comment

chungen04 commented Jul 9, 2024

angelinalg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevin85421 commented Jul 15, 2024

kevin85421 commented Jul 17, 2024