Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Stateful JobSet #572

Open
Tracked by #2170
tenzen-y opened this issue May 14, 2024 · 9 comments
Open
Tracked by #2170

Support Stateful JobSet #572

tenzen-y opened this issue May 14, 2024 · 9 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@tenzen-y
Copy link
Member

What would you like to be added:
I would like to support features to create a single PVC and mount the PV to some replicatedJobs like this:

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: volume-sample
spec:
  volumePolicy:
    replicatedJobs:
     - name: A
     - name: B
    volumeClaimTemplates:
    - metadata:
        name: pretrained-model
      spec:
        accessModes: [ "ReadWriteMany" ]
        storageClassName: "my-storage-class"
        resources:
          requests:
            storage: 1000Gi
  replicatedJobs:
  - name: workers
[...]

In this example, JobSet creates a PVC, "pretrained-model" and then the created PV is mounted to replicatedJobs specified in the .spec.volumePolicy.replicatedJobs

This feature is similar to kubernetes/kubernetes#115066

Why is this needed:
In large distributed training, we often store the base model, and then we want to share the pre-trained model with all workers so that we can avoid downloading the pre-trained model many times.

@tenzen-y
Copy link
Member Author

cc: @ahg-g @andreyvelich

@tenzen-y
Copy link
Member Author

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label May 14, 2024
@danielvegamyhre
Copy link
Contributor

Is this shared volume containing the pretrained model not possible to do using a regular PV, and specifying a PVC in the JobTemplate? Or is the purpose of this just to provide some automation and lifecycle management for the PV so the user doesn't need to specify the PV manifest separately?

@andreyvelich
Copy link

Or is the purpose of this just to provide some automation and lifecycle management for the PV so the user doesn't need to specify the PV manifest separately?

That's right, we want to manage lifecycle of the storage on the controller side, not on the client side.
This should simplify the ability to fine-tune models where each worker should share pre-trained model and dataset that we download in the master pod.

@ahg-g
Copy link
Contributor

ahg-g commented May 21, 2024

I like this feature, we should track it under 0.7 release.

@danielvegamyhre
Copy link
Contributor

@tenzen-y I'm tentatively labeling the issue to be marked as part of the v0.7.0 release, under the assumption you plan to work on this - let me know if that's not the case.

@andreyvelich
Copy link

@danielvegamyhre I can try to take this feature in the mid Q4 - Q1.
Do we know when are you planning to cut the v0.7.0 for JobSet ?

@danielvegamyhre
Copy link
Contributor

@danielvegamyhre I can try to take this feature in the mid Q4 - Q1. Do we know when are you planning to cut the v0.7.0 for JobSet ?

Sounds good, we have release cycle of roughly every 3 months, with 0.6 planned to release any day now. So 0.7 will be around October 1st, and 0.8 we can plan for around January 1st.

For now I've removed the 0.7 label for this issue and we can tentatively plan on including it in 0.8, I'll follow up on this once we get closer to that time of year.

@tenzen-y
Copy link
Member Author

Sorry for the delay.
Actually, I'm okay with Andray taking this issue.

Maybe, from kubeflow v2 perspective, we need to order the priority for the additional JobSet features.
But I guess that is irrelevant to the JobSet community.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

5 participants