-
Notifications
You must be signed in to change notification settings - Fork 705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2170: Add Torch Distributed Runtime #2328
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
This file was deleted.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,4 @@ | ||
resources: | ||
- manager.yaml | ||
# TODO (andreyvelich): Move it to overlays once we copy the JobSet manifests. | ||
namespace: kubeflow-system |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
apiVersion: kustomize.config.k8s.io/v1beta1 | ||
kind: Kustomization | ||
resources: | ||
- torch-distributed.yaml |
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
@@ -0,0 +1,33 @@ | ||||
apiVersion: kubeflow.org/v2alpha1 | ||||
kind: ClusterTrainingRuntime | ||||
metadata: | ||||
name: torch-distributed | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be useful to anticipate support for other accelerators like AMD/ROCm, so calling that one like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think, it is a good idea.
@astefanutti Could you create a separate issue so we can track it ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @andreyvelich thanks for the feedback, I've just created #2335. In addition to the resources, it's also the container image set for the runtime that's coupled to the accelerator. For instance For AMD ROCm, that would need to be something like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, that is correct. We need to find image that adds support for ROCm. |
||||
labels: | ||||
training.kubeflow.org/phase: pre-training | ||||
spec: | ||||
mlPolicy: | ||||
numNodes: 1 | ||||
torch: | ||||
numProcPerNode: auto | ||||
template: | ||||
spec: | ||||
replicatedJobs: | ||||
- name: trainer-node | ||||
template: | ||||
spec: | ||||
template: | ||||
spec: | ||||
containers: | ||||
- name: trainer | ||||
image: pytorch/pytorch:2.5.0-cuda12.4-cudnn9-runtime | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was thinking how will the user configure PyTorch version of their choice? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Data Scientists can always override the base image or Platform Engineers can configure runtime with the appropriate torch version. Do you see other use-cases ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
In the production team, I think that you are mostly right. But, in the research department, each researcher often uses the different PyTorch versions, IMO. But the discussion is a separate one. So I think that we can merge this PR in any case. |
||||
command: | ||||
- /bin/bash | ||||
- -c | ||||
- | | ||||
echo "Torch Distributed Runtime" | ||||
|
||||
echo "--------------------------------------" | ||||
echo "Torch Default Runtime Env" | ||||
env | grep PET_ | ||||
|
||||
pip list |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
apiVersion: kustomize.config.k8s.io/v1beta1 | ||
kind: Kustomization | ||
resources: | ||
- namespace.yaml | ||
- ../../base/crds | ||
- ../../base/manager | ||
- ../../base/rbac | ||
- ../../base/webhook | ||
# TODO (andreyvelich): JobSet should support kubeflow-system namespace. | ||
- https://github.com/kubernetes-sigs/jobset/releases/download/v0.6.0/manifests.yaml | ||
Comment on lines
+9
to
+10
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let me tie the root problem here: kubernetes-sigs/jobset#713 |
||
images: | ||
- name: kubeflow/training-operator-v2 | ||
newTag: latest | ||
secretGenerator: | ||
- name: training-operator-v2-webhook-cert | ||
namespace: kubeflow-system | ||
options: | ||
disableNameSuffixHash: true |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
apiVersion: v1 | ||
kind: Namespace | ||
metadata: | ||
name: kubeflow-system |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
apiVersion: kustomize.config.k8s.io/v1beta1 | ||
kind: Kustomization | ||
resources: | ||
- ../../base/runtimes/pre-training |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we put this under PyTorch folder so that we can group multiple runtime files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we create this subfolder when we support more runtimes ?
I would like us to have discussion on which pre-training runtimes we want to support in upstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure we can add later