-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[doc] Added MNIST training using KubeRay doc page #46123
Conversation
|
||
```sh | ||
# Download `ray-job.pytorch-mnist.yaml` | ||
curl -LO https://raw.githubusercontent.com/ray-project/kuberay/v1.0.0/ray-operator/config/samples/pytorch-mnist/ray-job.pytorch-mnist.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solved
fc154fd
to
f212de7
Compare
doc/source/cluster/kubernetes/examples/mnist-training-example.md
Outdated
Show resolved
Hide resolved
doc/source/cluster/kubernetes/examples/mnist-training-example.md
Outdated
Show resolved
Hide resolved
doc/source/cluster/kubernetes/examples/mnist-training-example.md
Outdated
Show resolved
Hide resolved
doc/source/cluster/kubernetes/examples/mnist-training-example.md
Outdated
Show resolved
Hide resolved
doc/source/cluster/kubernetes/examples/mnist-training-example.md
Outdated
Show resolved
Hide resolved
|
||
Feel free to adjust the `NUM_WORKERS` field and the `replicas` field under `workerGroupSpecs` in `rayClusterSpec` in the yaml file such that all the worker pods can reach `Running` status. | ||
|
||
```sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why the RayCluster looks like this? What is the expected state of the RayCluster based on the YAML configuration (e.g., replicas
)?
doc/source/cluster/kubernetes/examples/mnist-training-example.md
Outdated
Show resolved
Hide resolved
doc/source/cluster/kubernetes/examples/mnist-training-example.md
Outdated
Show resolved
Hide resolved
doc/source/cluster/kubernetes/examples/mnist-training-example.md
Outdated
Show resolved
Hide resolved
|
||
# Train a PyTorch Model on Fashion MNIST with CPUs on Kubernetes | ||
|
||
This example runs distributed training of a PyTorch model on Fashion MNIST with Ray Train. See [Train a PyTorch Model on Fashion MNIST](../../../train/examples/pytorch/torch_fashion_mnist_example.rst) for more details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
../../../train/examples/pytorch/torch_fashion_mnist_example.rst
=> We seldom use relative paths. Would you mind adding a label to torch_fashion_mnist_example.rst
and referring to the document via the label?
|
||
You might need to adjust some fields in the RayJob description YAML file so that it can run in your environment: | ||
* `replicas` under `workerGroupSpecs` in `rayClusterSpec`: This field specifies the number of worker Pods that will be scheduled to the Kubernetes cluster. Each worker Pod and the head Pod, as described in the `template` field, require 2 CPUs. A RayJob submitter Pod requires 1 CPU. For example, if your machine has 8 CPUs, the maximum `replicas` value will be 2 to allow all Pods to reach the `Running` status. | ||
* `NUM_WORKERS` under `runtimeEnvYAML` in `spec`: This field indicates the number of Ray actors to launch (see [Document](../../../train/api/doc/ray.train.ScalingConfig.rst) for more information). Each Ray actor must be served by a worker Pod in the Kubernetes cluster. Therefore, `NUM_WORKERS` must be less than or equal to `replicas`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid relative path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also have screenshot to link to view the new added page, thankks
@can-anyscale Sure. Here they are: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. Thanks for pinging me.
@@ -0,0 +1,120 @@ | |||
(kuberay-mnist-training-example)= | |||
|
|||
# Train a PyTorch Model on Fashion MNIST with CPUs on Kubernetes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Train a PyTorch Model on Fashion MNIST with CPUs on Kubernetes | |
# Train a PyTorch model on Fashion MNIST with CPUs on Kubernetes |
|
||
## Step 3: Create a RayJob | ||
|
||
A RayJob consists of a RayCluster custom resource and a job that can be submitted to the RayCluster. With RayJob, KubeRay creates a RayCluster and submits a job when the cluster is ready. Here is a CPU-only RayJob description YAML file for MNIST training on a PyTorch model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A RayJob consists of a RayCluster custom resource and a job that can be submitted to the RayCluster. With RayJob, KubeRay creates a RayCluster and submits a job when the cluster is ready. Here is a CPU-only RayJob description YAML file for MNIST training on a PyTorch model. | |
A RayJob consists of a RayCluster custom resource and a job that you can submit to the RayCluster. With RayJob, KubeRay creates a RayCluster and submits a job when the cluster is ready. The following is a CPU-only RayJob description YAML file for MNIST training on a PyTorch model. |
``` | ||
|
||
You might need to adjust some fields in the RayJob description YAML file so that it can run in your environment: | ||
* `replicas` under `workerGroupSpecs` in `rayClusterSpec`: This field specifies the number of worker Pods that will be scheduled to the Kubernetes cluster. Each worker Pod and the head Pod, as described in the `template` field, require 2 CPUs. A RayJob submitter Pod requires 1 CPU. For example, if your machine has 8 CPUs, the maximum `replicas` value will be 2 to allow all Pods to reach the `Running` status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* `replicas` under `workerGroupSpecs` in `rayClusterSpec`: This field specifies the number of worker Pods that will be scheduled to the Kubernetes cluster. Each worker Pod and the head Pod, as described in the `template` field, require 2 CPUs. A RayJob submitter Pod requires 1 CPU. For example, if your machine has 8 CPUs, the maximum `replicas` value will be 2 to allow all Pods to reach the `Running` status. | |
* `replicas` under `workerGroupSpecs` in `rayClusterSpec`: This field specifies the number of worker Pods that KubeRay schedules to the Kubernetes cluster. Each worker Pod and the head Pod, as described in the `template` field, requires 2 CPUs. A RayJob submitter Pod requires 1 CPU. For example, if your machine has 8 CPUs, the maximum `replicas` value is 2 to allow all Pods to reach the `Running` status. |
# Create a RayJob | ||
kubectl apply -f ray-job.pytorch-mnist.yaml | ||
|
||
# Check existing Pods: According to `replicas`, there should be 2 worker Pods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Check existing Pods: According to `replicas`, there should be 2 worker Pods | |
# Check existing Pods: According to `replicas`, there should be 2 worker Pods. |
kubectl apply -f ray-job.pytorch-mnist.yaml | ||
|
||
# Check existing Pods: According to `replicas`, there should be 2 worker Pods | ||
# Make sure all the Pods are in the `Running` status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Make sure all the Pods are in the `Running` status | |
# Make sure all the Pods are in the `Running` status. |
After seeing `JOB_STATUS` marked as `SUCCEEDED`, you can check the training logs: | ||
|
||
```sh | ||
# Check Pods name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Check Pods name | |
# Check Pods name. |
# rayjob-pytorch-mnist-nxmj2 0/1 Completed 0 38m | ||
# rayjob-pytorch-mnist-raycluster-rkdmq-head-m4dsl 1/1 Running 0 38m | ||
|
||
# Check training logs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Check training logs | |
# Check training logs. |
# ... | ||
``` | ||
|
||
## Clean-up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## Clean-up | |
## Clean up |
@@ -1,5 +1,7 @@ | |||
:orphan: | |||
|
|||
.. _train-pytorch-fashion-mnist: | |||
|
|||
Train a PyTorch Model on Fashion MNIST |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Train a PyTorch Model on Fashion MNIST | |
Train a PyTorch model on Fashion MNIST |
|
||
# Train a PyTorch Model on Fashion MNIST with CPUs on Kubernetes | ||
|
||
This example runs distributed training of a PyTorch model on Fashion MNIST with Ray Train. See [Train a PyTorch Model on Fashion MNIST](train-pytorch-fashion-mnist) for more details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example runs distributed training of a PyTorch model on Fashion MNIST with Ray Train. See [Train a PyTorch Model on Fashion MNIST](train-pytorch-fashion-mnist) for more details. | |
This example runs distributed training of a PyTorch model on Fashion MNIST with Ray Train. See [Train a PyTorch model on Fashion MNIST](train-pytorch-fashion-mnist) for more details. |
@can-anyscale, the CI failure seems to be unrelated to this PR. I have rerun it, but it still failed. Is it OK to merge this PR, or do you have any suggestions to fix the issue? |
@chungen04 I triggered the CI again. If it still fails, please rebase with the master branch. |
Signed-off-by: chungen04 <b09901027@ntu.edu.tw>
Signed-off-by: chungen04 <b09901027@ntu.edu.tw>
Signed-off-by: chungen04 <b09901027@ntu.edu.tw>
Signed-off-by: chungen04 <b09901027@ntu.edu.tw>
Why are these changes needed?
Added a MNIST training page using KubeRay to document page. The training workload is based on (currently in kuberay repo):
https://github.com/ray-project/kuberay/tree/master/ray-operator/config/samples/pytorch-mnist
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.