Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2170: Implement runtime framework #2248

Merged

Conversation

tenzen-y
Copy link
Member

@tenzen-y tenzen-y commented Sep 4, 2024

What this PR does / why we need it:
Brief Design: https://docs.google.com/presentation/d/1HyEsBa7hxWpIoBXaX6uECiB48FWB85SG1kx15mO8hug/edit#slide=id.g30596bfee76_0_202

I implemented the runtime framework interfaces.
The responsibilities are the following:

  • /runtime.v2/core: This contains the actual Kubeflow Job Pipeline like TrainigRuntime (not CRD), which is an internal concept.
    These pipelines build objects or create reconcile builders. We will add some pipelines in the future like SingleHostTrainingRuntime.

  • /runtime.v2/framework: This contains the Kubeflow Job Pipeline Framework, which has some extension points in the following, and we will add some extension points in the future.

    • WatchExtensionPlugin
    • EnforcePodGroupPolicyPlugin
    • EnforceMLPolicyPlugin
    • CustomValidationPlugin
    • ComponentBuilderPlugin
  • /runtime.v2/framework/plugins: This contains the Kubeflow Job Pipeline Framework plugins, which implement the Framework extension points. Each of these plugins is performed in Kubeflow Job Pipeline Framework extension points.

    • coscheduleing
    • jobset (Under development)
    • mpi
    • plainml
    • torch

Additionally, I did not implement all plugins. So, I will open an issue and delegate plugin implementation contributors who are interested in this project.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #
Part-of #2290

Checklist:

  • Docs included if any changes are user facing

@tenzen-y tenzen-y force-pushed the second-implementation-for-traininig-v2 branch 6 times, most recently from 92b1dd1 to 4195338 Compare September 6, 2024 17:49
@coveralls
Copy link

coveralls commented Sep 6, 2024

Pull Request Test Coverage Report for Build 11372731768

Details

  • 9 of 9 (100.0%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 100.0%

Totals Coverage Status
Change from base Build 11330381194: 0.0%
Covered Lines: 73
Relevant Lines: 73

💛 - Coveralls

@tenzen-y tenzen-y force-pushed the second-implementation-for-traininig-v2 branch 18 times, most recently from d220851 to caa8564 Compare September 10, 2024 17:40
Makefile Outdated Show resolved Hide resolved
sigs.k8s.io/controller-runtime v0.17.3
sigs.k8s.io/jobset v0.5.2
sigs.k8s.io/kueue v0.6.3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need kueue dependency ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dependency came from

PodRequests: kueuelr.TotalRequests(&spec.podSpec),
.

This allows us to set the appropriate required resources for PodGroup. If we remove this dependency, we need to just copy Kueue's "TotalRequests" function here. I believe that just coping and pasting is not the ideal way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to reduce the dependency. Maybe copy and paste is okay in this case as long as we provide a reference to the original source

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not simple implementation. That's so complex, multiple files and lines codes.
So, I would propose keeping it here and then (after kube 1.32) switching to the kube library as I mentioned in #2280.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we discussed with @tenzen-y offline that after k/k separates this utility function, we will remove dependency on Kueue.

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
@tenzen-y tenzen-y force-pushed the second-implementation-for-traininig-v2 branch from fd7aab4 to 2aaae2b Compare October 14, 2024 23:46
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
…indexes for the TrainJobs

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Comment on lines +106 to +109
options := defaultOptions
for _, opt := range opts {
opt(&options)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need default options for Info object ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
options := defaultOptions
for _, opt := range opts {
opt(&options)
}
options := InfoOptions{}
for _, opt := range opts {
opt(&options)
}

Do you recommend this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, I am trying to understand how are your planning to use InfoOptions in other parts ?
@tenzen-y What are the differences between Info{} and InfoOptions{} ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The infoOptions is object to set up the Info object.
This approach allows us to dynamically specify the parameters to the Info.

When we get rid of the InfoOptions, we need to specify all parameters everytime or need to directly pass the Info object.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach is well-known to avid the following function:

// Even if all parameters is not used, all parameters should be specified.
func Foo(paramA string, paramB int, paramC int32, paramD bool, paramE int64)
// After introduced `infoOptions`.
func Foo(params ...InfoOption)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so maybe let's name it as InfoOption, not defaultOptions to make it clearer, and since we are not going to have default values for Info object.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are not going to have default values for Info object.

Actually, this is the default info Option. Here, this means that the default is an empty struct.
So, the below indicates to initialize options as a default values (currently default has empty fields)

options := defaultOptions

Copy link
Member Author

@tenzen-y tenzen-y Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, we need to consider if we should the default parameters to infoOption. For example common labels and annotations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that makes sense.

Comment on lines +48 to +53
for rName := range info.TotalRequests {
info.TotalRequests[rName] = runtime.TotalResourceRequest{
Replicas: numNodes,
PodRequests: info.TotalRequests[rName].PodRequests,
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we using this while enforcing the MLPolicy ?

Copy link
Member Author

@tenzen-y tenzen-y Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

~~ if info == nil || info.MLPolicy != nil { ~~

I wanted to implement this in line 43. Maybe I failed to reabase.
Let me fix this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVM above comments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spec:
  mlPolicy:
    numNodes: 1

We can imagine this situation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so we override the value that we set here, right ?

for _, spec := range options.podSpecReplicas {
info.TotalRequests[spec.name] = TotalResourceRequest{
Replicas: spec.replicas,
// TODO: Need to address LimitRange and RuntimeClass.
PodRequests: kueuelr.TotalRequests(&spec.podSpec),
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this plainml codes try to update with the proper one.

)

var (
TrainingRuntimeContainerRuntimeClassKey = ".trainingRuntimeSpec.jobSetTemplateSpec.replicatedJobs.podTemplateSpec.runtimeClassName"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the indexer be equal to the Golang struct or json name ?
E.g. the trainingRuntimeSpec is named as .spec

Spec TrainingRuntimeSpec `json:"spec,omitempty"`

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can specify the arbitrary key name. But the key is global within the training-ooerator.

@andreyvelich
Copy link
Member

We should be ready to merge this.
Thank you for this great work @tenzen-y!
/lgtm
/assign @terrytangyuan @johnugeorge @kannon92

Copy link

@andreyvelich: GitHub didn't allow me to assign the following users: kannon92.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

We should be ready to merge this.
Thank you for this great work @tenzen-y!
/lgtm
/assign @terrytangyuan @johnugeorge @kannon92

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

@kannon92 kannon92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really have context to review this at the moment.

I'll leave that to kubeflow members.

@tenzen-y
Copy link
Member Author

We should be ready to merge this. Thank you for this great work @tenzen-y! /lgtm /assign @terrytangyuan @johnugeorge @kannon92

Thanks for the review!

@andreyvelich
Copy link
Member

/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@johnugeorge
Copy link
Member

/hold cancel

@google-oss-prow google-oss-prow bot merged commit 62a058d into kubeflow:master Oct 17, 2024
40 checks passed
@tenzen-y tenzen-y deleted the second-implementation-for-traininig-v2 branch October 17, 2024 18:35
saileshd1402 pushed a commit to saileshd1402/training-operator that referenced this pull request Dec 2, 2024
* KEP-2170: Implement runtime framework interfaces

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Remove grep dependency

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* KEP-2170: Implement ValidateObjects interface to the runtime framework

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* KEP-2170: Expose the TrainingRuntime and ClusterTrainingRuntime Kind

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* KEP-2170: Remove unneeded scheme field from the internal TrainingRuntime

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Rephrase the error message

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Distinguish TrainingRuntime and ClusterTrainingRuntime when creating indexes for the TrainJobs

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Propagate the TrainJob labels and annotations to the JobSet

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Remove PodAnnotations from the runtime info

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Implement TrainingRuntime ReplicatedJob validation

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Add TODO comments

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Replace queueSuspendedTrainJob with queueSuspendedTrainJobs

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

---------

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>
google-oss-prow bot pushed a commit that referenced this pull request Dec 9, 2024
* Added test for create-pytorchjob.ipynb

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* fix yaml syntax

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* Fix uses path

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* Add actions/checkout

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* Add bash to action.yaml

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* Install pip dependencies step

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* Add quotes for args

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* Add jupyter

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* Add nbformat_minor: 5 to fix invalid format error

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* Fix job name

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* test papermill-args-yaml

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* testing multi line args

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* testing multi line args1

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* testing multi line args2

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* testing multi line args3

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* Parameterize sdk install

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* Remove unnecessary output

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* nbformat normailze

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* [SDK] Training Client Conditions related unit tests (#2253)

* test: add unit test for get_job_conditions function of training client

Signed-off-by: Bobbins228 <mcampbel@redhat.com>

* test: add unit test for is_job_created function of training client

Signed-off-by: Bobbins228 <mcampbel@redhat.com>

* test: add unit test for is_job_running function of training client

Signed-off-by: Bobbins228 <mcampbel@redhat.com>

* test: add unit test for is_job_restarting function of training client

Signed-off-by: Bobbins228 <mcampbel@redhat.com>

* test: add unit test for is_job_failed function of training client

Signed-off-by: Bobbins228 <mcampbel@redhat.com>

* test: add unit test for is_job_succeded function of training client

Signed-off-by: Bobbins228 <mcampbel@redhat.com>

* test: improve job condition unit tests efficiency

Signed-off-by: Bobbins228 <mcampbel@redhat.com>

---------

Signed-off-by: Bobbins228 <mcampbel@redhat.com>
Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* [SDK] test: add unit test for list_jobs method of the training_client (#2267)

Signed-off-by: wei-chenglai <qazwsx0939059006@gmail.com>
Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* KEP-2170: Generate clientset, openapi spec for the V2 APIs (#2273)

Generate clientset, informers, listers and open api spec
for v2alpha1 APIs.

Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* [SDK] Use torchrun to create PyTorchJob from function (#2276)

* [SDK] Use torchrun to create PyTorchJob from function

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update PyTorchJob SDK example

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add consts for entrypoint

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add check for num procs per worker

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* [SDK] test: add unit test for get_job_logs method of the training_client (#2275)

Signed-off-by: wei-chenglai <qazwsx0939059006@gmail.com>
Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* [v2alpha] Move GV related codebase (#2281)

Move GV related codebase in v2alpha

Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* KEP-2170: Implement runtime framework (#2248)

* KEP-2170: Implement runtime framework interfaces

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Remove grep dependency

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* KEP-2170: Implement ValidateObjects interface to the runtime framework

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* KEP-2170: Expose the TrainingRuntime and ClusterTrainingRuntime Kind

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* KEP-2170: Remove unneeded scheme field from the internal TrainingRuntime

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Rephrase the error message

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Distinguish TrainingRuntime and ClusterTrainingRuntime when creating indexes for the TrainJobs

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Propagate the TrainJob labels and annotations to the JobSet

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Remove PodAnnotations from the runtime info

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Implement TrainingRuntime ReplicatedJob validation

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Add TODO comments

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Replace queueSuspendedTrainJob with queueSuspendedTrainJobs

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

---------

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* Add DeepSpeed Example with Pytorch Operator (#2235)

Signed-off-by: Syulin7 <735122171@qq.com>
Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API (#2283)

* KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Rename RuntimeRef in runtime framework

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* KEP-2170: Adding CEL validations on v2 TrainJob CRD (#2260)

Signed-off-by: Akshay Chitneni <achitneni@apple.com>
Co-authored-by: Akshay Chitneni <achitneni@apple.com>
Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* Upgrade Deepspeed demo dependencies (#2294)

Signed-off-by: Syulin7 <735122171@qq.com>
Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* KEP-2170: Add manifests for Kubeflow Training V2 (#2289)

* KEP-2170: Add manifests for Kubeflow Training V2

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix invalid name for webhook config in cert

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix integration tests

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Move kubebuilder markers to runtime framework

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Use Kubernetes recommended labels

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* FSDP Example for T5 Fine-Tuning and PyTorchJob (#2286)

* FSDP Example with PyTorchJob and T5 Fine-Tuning

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Modify text

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* KEP-2170: Implement TrainJob Reconciler to manage objects (#2295)

* KEP-2170: Implement TrainJob Reconciler to manage objects

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Mode dep-crds to manifests/external-crds

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Rename run with runtime

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

---------

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* Remove Prometheus Monitoring doc (#2301)

Signed-off-by: Sophie <sophy010017@gmail.com>
Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* KEP-2170: Decouple JobSet from TrainJob (#2296)

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* KEP-2170: Strictly verify the CRD marker validation and defaulting in the integration testings (#2304)

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* KEP-2170: Initialize runtimes before the manager starts (#2306)

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* KEP-2170: Generate Python SDK for Kubeflow Training V2 (#2310)

* Generate SDK models for the Training V2 APIs

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Create pyproject.toml config

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Remove comments

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix pre-commit

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* KEP-2170: Create model and dataset initializers (#2303)

* KEP-2170: Create model and dataset initializers

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add abstract classes

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add storage URI to config

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update .gitignore

Co-authored-by: Kevin Hannon <kehannon@redhat.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix the misspelling for initializer

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add .pt and .pth to ignore_patterns

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Kevin Hannon <kehannon@redhat.com>
Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* KEP-2170: Implement JobSet, PlainML, and Torch Plugins (#2308)

* KEP-2170: Implement JobSet and PlainML Plugins

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix nil pointer exception for Trainer

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix unit tests in runtime package

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix unit tests

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix integration tests

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix lint

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Implement Torch Plugin

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Use list for the Info envs

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix golang ci

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix Torch plugin

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Use K8s sets
Update error return
Use ptr.Deref() for nil values

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Use client.Object for Build() call

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Remove DeepCopy

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Remove MLPolicy and PodGroupPolicy from the Info object

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Inline error

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Remove SDK jar file

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add integration test for Torch plugin

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add TODO to calculate PodGroup values in unit tests

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Revert the change to add original Runtime Policies to Info

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Create const for the DefaultJobReplicas

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Check if PodLabels is empty

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* KEP-2170: Implement Initializer builders in the JobSet plugin  (#2316)

* KEP-2170: Implement Initializer builder in the JobSet plugin

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update the SDK models

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Remove Info from Initializer builder

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update manifests

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update pkg/constants/constants.go

Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Use var for envs

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Remove check manifests from GitHub actions

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Move consts to JobSet plugin

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* KEP-2170: Add the TrainJob state transition design (#2298)

* KEP-2170: Add the TrainJob state transition design

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Replace actual jobs with TrainJob

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Remove the JobSet conditions propagation and Add expanding runtime framework interfaces for each plugin

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Expand the Creation Failed reasons

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Rename Completed to Complete

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

---------

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* Update tf job examples to tf v2 (#2270)

* mnist with summaries updaetd to TF v2

Signed-off-by: yelias <yossi.elias@nokia.com>

* tf_sample updaetd to TF v2

Signed-off-by: yelias <yossi.elias@nokia.com>

* Add mnist_utils and update dist-mnist

Signed-off-by: yelias <yossi.elias@nokia.com>

* Add mnist_utils and update dist-mnist

Signed-off-by: yelias <yossi.elias@nokia.com>

* Remove old example - estimator-API, this example has been replaced by distribution_strategy

Signed-off-by: yelias <yossi.elias@nokia.com>

* Small fix

Signed-off-by: yelias <yossi.elias@nokia.com>

* Remove unsupported powerPC dockerfiles

Signed-off-by: yelias <yossi.elias@nokia.com>

* Fix typo in copyright

Signed-off-by: yelias <yossi.elias@nokia.com>

---------

Signed-off-by: yelias <yossi.elias@nokia.com>
Co-authored-by: yelias <yossi.elias@nokia.com>
Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* KEP-2170: Add TrainJob conditions (#2322)

* KEP-2170: Implement TrainJob conditions

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Fix API comments

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Make condition message constants

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Stop connecting condition type and reason in JobSet plugin

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

---------

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* Pin Gloo repository in JAX Dockerfile to a specific commit (#2329)

This commit pins the Gloo repository to a specific commit (43b7acbf) in
the JAX Dockerfile to prevent build failures caused by a recent bug
introduced in the Gloo codebase. By locking the version of Gloo to
a known working commit, we ensure that the JAX build remains stable and
functional until the issue is resolved upstream.

The build failure occurs when compiling the gloo/transport/tcp/buffer.cc
file due to an undefined __NR_gettid constant, which was introduced
after the pinned commit. By using this commit, we bypass the issue and
allow the build to complete successfully.

Signed-off-by: Sandipan Panda <samparksandipan@gmail.com>
Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* [fix] Resolve v2alpha API exceptions (#2317)

Resolve v2alpha API exceptions by adding necessary listType validations.

Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* Upgrade Kubernetes to v1.30.7 (#2332)

* Upgrade Kubernetes to v1.30.7

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Use typed event handlers and predicates in job controllers

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Re-organize pkg/common/util/reconciler.go

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Update installation instructions in README

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

---------

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* Ignore cache exporting errors in the image building workflows (#2336)

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* KEP-2170: Add Torch Distributed Runtime (#2328)

* KEP-2170: Add Torch Distributed Runtime

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add pip list

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* Refine the server-side apply installation args (#2337)

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* Add openapi-generator CLI option to skip SDK v2 test generation (#2338)

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* Upgrade kustomization files to Kustomize v5 (#2326)

Signed-off-by: oksanabaza <obazylie@redhat.com>
Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* Pin accelerate package version in trainer (#2340)

* Pin accelerate package version in trainer

Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>

* include new line to pass pre-commit hook

Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>

---------

Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* Replace papermill command with bash script

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* Typo fix

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* Move Checkout step outside action.yaml file

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* Add newline EOF in script

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* Pass python dependencies as args and pin versions

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* Update Usage

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* Install dependencies in yaml

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* fix ipynb

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* set bash flags

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* Update script args and add more kubernetes versions for tests

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* add gang-scheduler-name to  template

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* move go setup to template

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* remove -p parameter from script

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

---------

Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>
Signed-off-by: Bobbins228 <mcampbel@redhat.com>
Signed-off-by: wei-chenglai <qazwsx0939059006@gmail.com>
Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Syulin7 <735122171@qq.com>
Signed-off-by: Akshay Chitneni <achitneni@apple.com>
Signed-off-by: Sophie <sophy010017@gmail.com>
Signed-off-by: yelias <yossi.elias@nokia.com>
Signed-off-by: Sandipan Panda <samparksandipan@gmail.com>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: oksanabaza <obazylie@redhat.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Co-authored-by: Mark Campbell <mcampbel@redhat.com>
Co-authored-by: Wei-Cheng Lai <qazwsx0939059006@gmail.com>
Co-authored-by: Varsha <varshaprasad96@gmail.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Co-authored-by: yu lin <735122171@qq.com>
Co-authored-by: Akshay Chitneni <akshayadatta@gmail.com>
Co-authored-by: Akshay Chitneni <achitneni@apple.com>
Co-authored-by: Sophie Hsu <112261858+sophie0730@users.noreply.github.com>
Co-authored-by: Kevin Hannon <kehannon@redhat.com>
Co-authored-by: YosiElias <73485442+YosiElias@users.noreply.github.com>
Co-authored-by: yelias <yossi.elias@nokia.com>
Co-authored-by: Sandipan Panda <87253083+sandipanpanda@users.noreply.github.com>
Co-authored-by: Antonin Stefanutti <astefanutti@users.noreply.github.com>
Co-authored-by: Oksana Bazylieva <61097730+oksanabaza@users.noreply.github.com>
Co-authored-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants