Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2170: Add Torch Distributed Runtime #2328

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 0 additions & 9 deletions manifests/v2/base/kustomization.yaml

This file was deleted.

2 changes: 2 additions & 0 deletions manifests/v2/base/manager/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
resources:
- manager.yaml
# TODO (andreyvelich): Move it to overlays once we copy the JobSet manifests.
namespace: kubeflow-system
2 changes: 2 additions & 0 deletions manifests/v2/base/rbac/kustomization.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,5 @@ resources:
- role.yaml
- role_binding.yaml
- service_account.yaml
# TODO (andreyvelich): Move it to overlays once we copy the JobSet manifests.
namespace: kubeflow-system
4 changes: 4 additions & 0 deletions manifests/v2/base/runtimes/pre-training/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- torch-distributed.yaml
33 changes: 33 additions & 0 deletions manifests/v2/base/runtimes/pre-training/torch-distributed.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
apiVersion: kubeflow.org/v2alpha1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we put this under PyTorch folder so that we can group multiple runtime files

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we create this subfolder when we support more runtimes ?
I would like us to have discussion on which pre-training runtimes we want to support in upstream.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure we can add later

kind: ClusterTrainingRuntime
metadata:
name: torch-distributed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful to anticipate support for other accelerators like AMD/ROCm, so calling that one like torch-cuda-distributed could help to be more descriptive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, it is a good idea.
However, we are still discussing with @tenzen-y on how we should propagate accelerator type to the runtime labels given that it overlaps with Kueue ResourceFlavor concepts.
As you can see, by default we don't attach any resources to the Trainer container in this pre-training runtime:

@astefanutti Could you create a separate issue so we can track it ?
E.g. support AMD accelerators in Torch Distributed Training Runtime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich thanks for the feedback, I've just created #2335.

In addition to the resources, it's also the container image set for the runtime that's coupled to the accelerator.

For instance pytorch/pytorch:2.5.0-cuda12.4-cudnn9-runtime makes that training runtime specific to NVIDIA CUDA.

For AMD ROCm, that would need to be something like rocm/pytorch:rocm6.2.3_ubuntu22.04_py3.10_pytorch_release_2.3.0 for example.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that is correct. We need to find image that adds support for ROCm.

labels:
training.kubeflow.org/phase: pre-training
spec:
mlPolicy:
numNodes: 1
torch:
numProcPerNode: auto
template:
spec:
replicatedJobs:
- name: trainer-node
template:
spec:
template:
spec:
containers:
- name: trainer
image: pytorch/pytorch:2.5.0-cuda12.4-cudnn9-runtime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking how will the user configure PyTorch version of their choice?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data Scientists can always override the base image or Platform Engineers can configure runtime with the appropriate torch version.
From my past experience and discussions, usually ML teams try to stitch with a single PyTorch version for all of their ML Development.

Do you see other use-cases ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my past experience and discussions, usually ML teams try to stitch with a single PyTorch version for all of their ML Development.

In the production team, I think that you are mostly right. But, in the research department, each researcher often uses the different PyTorch versions, IMO.

But the discussion is a separate one. So I think that we can merge this PR in any case.

command:
- /bin/bash
- -c
- |
echo "Torch Distributed Runtime"

echo "--------------------------------------"
echo "Torch Default Runtime Env"
env | grep PET_

pip list
2 changes: 2 additions & 0 deletions manifests/v2/base/webhook/kustomization.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,5 @@ patches:
kind: ValidatingWebhookConfiguration
configurations:
- kustomizeconfig.yaml
# TODO (andreyvelich): Move it to overlays once we copy the JobSet manifests.
namespace: kubeflow-system
18 changes: 18 additions & 0 deletions manifests/v2/overlays/only-manager/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- namespace.yaml
- ../../base/crds
- ../../base/manager
- ../../base/rbac
- ../../base/webhook
# TODO (andreyvelich): JobSet should support kubeflow-system namespace.
- https://github.com/kubernetes-sigs/jobset/releases/download/v0.6.0/manifests.yaml
Comment on lines +9 to +10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me tie the root problem here: kubernetes-sigs/jobset#713

images:
- name: kubeflow/training-operator-v2
newTag: latest
secretGenerator:
- name: training-operator-v2-webhook-cert
namespace: kubeflow-system
options:
disableNameSuffixHash: true
4 changes: 4 additions & 0 deletions manifests/v2/overlays/only-manager/namespace.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
apiVersion: v1
kind: Namespace
metadata:
name: kubeflow-system
4 changes: 4 additions & 0 deletions manifests/v2/overlays/only-runtimes/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../../base/runtimes/pre-training
6 changes: 5 additions & 1 deletion manifests/v2/overlays/standalone/kustomization.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,11 @@ apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- namespace.yaml
- ../../base
- ../../base/crds
- ../../base/manager
- ../../base/rbac
- ../../base/webhook
- ../../base/runtimes/pre-training
# TODO (andreyvelich): JobSet should support kubeflow-system namespace.
- https://github.com/kubernetes-sigs/jobset/releases/download/v0.6.0/manifests.yaml
images:
Expand Down
Loading