-
Notifications
You must be signed in to change notification settings - Fork 705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade Kubernetes to v1.30.7 #2332
Conversation
3db15dd
to
17fc956
Compare
dee7021
to
85e3863
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ok-to-test
/rerun-all
85e3863
to
0b18aa9
Compare
@@ -74,6 +74,7 @@ jobs: | |||
- name: Create k8s Kind Cluster | |||
uses: helm/kind-action@v1.10.0 | |||
with: | |||
version: v0.25.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this version specification brought the CI errors. Do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the error is also present with the current version. I've speculatively upgraded it, but it turns out it's the kind-action that stills downloads the kubectl
binary from storage.googleapis.com
while the latest versions are now hosted on dl.k8s.io
.
I've changed it to reference helm/kind-action#127 by SHA until a new version of the action gets released.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. Thanks.
In that case, could you open the issue so that we can use the released dedicated tag once the new version is released?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I had already created helm/kind-action#128 earlier :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for open that!
0b18aa9
to
1d5c90c
Compare
@astefanutti It seems that we specify the un existence K8s version: https://github.com/kubeflow/training-operator/actions/runs/12006038657/job/33464332223?pr=2332 Could you replace those based on https://hub.docker.com/r/kindest/node/tags? |
1d5c90c
to
b97168b
Compare
@tenzen-y I've just re-pushed with the downgrade to kindest image v1.30.6 as the image hasn't been published for v1.30.7. |
There is still one issue with the the SDK where some methods have |
b97168b
to
04fd79e
Compare
The issue with the SDK generation seems similar to OpenAPITools/openapi-generator#10236. It's been reported for the legacy Python generator, but a quick Look at the code shows the unit test generation template hasn't changed. I've added a replacement into the post-generation script that replaces Note the issue is only with the generated unit tests. I'm not sure how useful they are. There are currently deleted for the SDK v2. |
04fd79e
to
05a4fdb
Compare
The SDK e2e tests now pass. They are some SDK unit tests that fail because openapi-generator does not generate types that are declared as "aliases" to the object type like |
I think, we can remove those generated unit tests from the V1 SDK since we don't use them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for doing this @astefanutti!
I left a few comments.
sigs.k8s.io/jobset v0.5.2 | ||
sigs.k8s.io/kueue v0.6.3 | ||
sigs.k8s.io/scheduler-plugins v0.28.9 | ||
sigs.k8s.io/structured-merge-diff/v4 v4.4.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this package ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's coming from one of the SSA apply configuration files in the generated Golang client, specifically pkg/client/applyconfiguration/internal/internal.go
.
I don't think this is actually used, but it's needed to compile the generated client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, that makes sense.
hack/python-sdk/post_gen.py
Outdated
@@ -26,7 +26,9 @@ | |||
("import kubeflow.training", "from kubeflow.training.models import *"), | |||
("kubeflow.training.models.v1\/.*.v1.", "V1"), | |||
("kubeflow.training.models.kubeflow/org/v1/", "kubeflow_org_v1_"), | |||
("kubeflow.training.models.runtime/raw_extension.runtime\.", "Runtime"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we use it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And why do we have runtime model in our APIs ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's used in the APIs, but it's being added by the OpenAPI specification generation script here:
https://github.com/kubernetes/code-generator/blob/b15df6411b47bf6e80bfc63947af6b436b2e05c6/kube_codegen.sh#L365-L367
It's hard-coded and it doesn't seem like there is an easy way to get rid of it. That being said, I don't think it actually impacts anything.
kube::codegen::gen_helpers \ | ||
--boilerplate "${TRAINING_OPERATOR_ROOT}/hack/boilerplate/boilerplate.go.txt" \ | ||
"${TRAINING_OPERATOR_ROOT}/pkg/apis" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@astefanutti @tenzen-y Is this a new recommended way to use kube codegen ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the previous command has been completely removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, the "old" way has recently been EOL'ed: kubernetes/code-generator@2e5be31.
@@ -27,7 +27,7 @@ cd manifests/overlays/standalone | |||
kustomize edit set image kubeflow/training-operator=${TRAINING_CI_IMAGE} | |||
|
|||
echo "Installing training operator manifests" | |||
kustomize build . | kubectl apply -f - | |||
kustomize build . | kubectl apply --server-side=true -f - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to use server side apply to deploy operator ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Client-side apply now fails as the size of the CRDs have increased beyond the (default) maximum "last-applied" annotation size. It's a recurrent issue kubernetes/kubectl#712 that's often faced with CRDs.
Note that server-side apply may eventually become the default for the kubectl apply
command: kubernetes/enhancements#3805.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks for sharing
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
05a4fdb
to
aaa79c0
Compare
@andreyvelich thanks, that was my assumption as well. I've added the |
@@ -45,7 +45,7 @@ go run "${repo_root}"/hack/swagger/main.go ${VERSION} >"${SWAGGER_CODEGEN_FILE}" | |||
echo "Removing previously generated files ..." | |||
rm -rf "${SDK_OUTPUT_PATH}"/docs/KubeflowOrgV1*.md "${SDK_OUTPUT_PATH}"/kubeflow/training/models "${SDK_OUTPUT_PATH}"/kubeflow/training/*.py "${SDK_OUTPUT_PATH}"/test/test_*.py | |||
echo "Generating Python SDK for Training Operator ..." | |||
java -jar "${SWAGGER_CODEGEN_JAR}" generate -i "${repo_root}"/hack/python-sdk/swagger.json -g python -o "${SDK_OUTPUT_PATH}" -c "${SWAGGER_CODEGEN_CONF}" | |||
java -jar "${SWAGGER_CODEGEN_JAR}" generate -i "${repo_root}"/hack/python-sdk/swagger.json -g python --global-property apiTests=false,modelTests=false -o "${SDK_OUTPUT_PATH}" -c "${SWAGGER_CODEGEN_CONF}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know that swagger has this flag to disable test generation.
Could we add the same flag for the V2 SDK, so we can get rid of git clean
command ?
training-operator/hack/python-sdk-v2/gen-sdk.sh
Lines 53 to 54 in aaa79c0
# TODO (andreyvelich): Discuss if we should use these test files. | |
git clean -f ${SDK_OUTPUT_PATH}/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely. Maybe I can do that in a follow-up PR or you'd prefer to include that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine, we can do it as a followup PR.
Thanks for this effort @astefanutti! |
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
Basically, lgtm
@@ -198,26 +197,26 @@ func (r *JAXJobReconciler) SetupWithManager(mgr ctrl.Manager, controllerThreads | |||
DeleteFunc: util.OnDependentDeleteFuncGeneric(r.Expectations), | |||
} | |||
// inject watching for job related pod | |||
if err = c.Watch(source.Kind(mgr.GetCache(), &corev1.Pod{}), eventHandler, predicates); err != nil { | |||
if err = c.Watch(source.Kind[client.Object](mgr.GetCache(), &corev1.Pod{}, eventHandler, predicates)); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if err = c.Watch(source.Kind[client.Object](mgr.GetCache(), &corev1.Pod{}, eventHandler, predicates)); err != nil { | |
if err = c.Watch(source.Kind[*corev1.Pod](mgr.GetCache(), &corev1.Pod{}, eventHandler, predicates)); err != nil { |
Could we do exactly type parameter?
There are the same questions in all of Job controllers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be possible, but one instance of eventHandler
and genericPredicates
would have to be created per type, as they would not be reusable for different types.
I think also predicates
could be removed then to always use the generic version.
Happy to do it if having one event handler and predicates instance per type is OK for you. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ve pushed f59f171 that should cover it so we can see what that looks like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like tests are failing, did you try to run them locally @astefanutti ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich sorry for the noise, I should have pushed that extra commit somewhere else.
I've fixed it and the tests should pass now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@astefanutti I'm happy with accepting better refactoring :)
return err | ||
} | ||
// inject watching for job related service | ||
if err = c.Watch(source.Kind(mgr.GetCache(), &corev1.Service{}), eventHandler, predicates); err != nil { | ||
if err = c.Watch(source.Kind[client.Object](mgr.GetCache(), &corev1.Service{}, eventHandler, predicates)); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if err = c.Watch(source.Kind[client.Object](mgr.GetCache(), &corev1.Service{}, eventHandler, predicates)); err != nil { | |
if err = c.Watch(source.Kind[*corev1.Service](mgr.GetCache(), &corev1.Service{}, eventHandler, predicates)); err != nil { |
return err | ||
} | ||
// skip watching volcano PodGroup if volcano PodGroup is not installed | ||
if _, err = mgr.GetRESTMapper().RESTMapping(schema.GroupKind{Group: v1beta1.GroupName, Kind: "PodGroup"}, | ||
v1beta1.SchemeGroupVersion.Version); err == nil { | ||
// inject watching for job related volcano PodGroup | ||
if err = c.Watch(source.Kind(mgr.GetCache(), &v1beta1.PodGroup{}), eventHandler, genericPredicates); err != nil { | ||
if err = c.Watch(source.Kind[client.Object](mgr.GetCache(), &v1beta1.PodGroup{}, eventHandler, genericPredicates)); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if err = c.Watch(source.Kind[client.Object](mgr.GetCache(), &v1beta1.PodGroup{}, eventHandler, genericPredicates)); err != nil { | |
if err = c.Watch(source.Kind[*v1beta1.PodGroup](mgr.GetCache(), &v1beta1.PodGroup{}, eventHandler, genericPredicates)); err != nil { |
return err | ||
} | ||
} | ||
// skip watching scheduler-plugins PodGroup if scheduler-plugins PodGroup is not installed | ||
if _, err = mgr.GetRESTMapper().RESTMapping(schema.GroupKind{Group: schedulerpluginsv1alpha1.SchemeGroupVersion.Group, Kind: "PodGroup"}, | ||
schedulerpluginsv1alpha1.SchemeGroupVersion.Version); err == nil { | ||
// inject watching for job related scheduler-plugins PodGroup | ||
if err = c.Watch(source.Kind(mgr.GetCache(), &schedulerpluginsv1alpha1.PodGroup{}), eventHandler, genericPredicates); err != nil { | ||
if err = c.Watch(source.Kind[client.Object](mgr.GetCache(), &schedulerpluginsv1alpha1.PodGroup{}, eventHandler, genericPredicates)); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if err = c.Watch(source.Kind[client.Object](mgr.GetCache(), &schedulerpluginsv1alpha1.PodGroup{}, eventHandler, genericPredicates)); err != nil { | |
if err = c.Watch(source.Kind[*schedulerpluginsv1alpha1.PodGroup](mgr.GetCache(), &schedulerpluginsv1alpha1.PodGroup{}, eventHandler, genericPredicates)); err != nil { |
ce31d7b
to
eb452fb
Compare
/rerun-all |
bc31553
to
ff7f60f
Compare
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
ff7f60f
to
f59f171
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the update!
Mostly lgtm!
pkg/common/util/reconciler.go
Outdated
import ( | ||
"fmt" | ||
"reflect" | ||
|
||
corev1 "k8s.io/api/core/v1" | ||
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" | ||
"sigs.k8s.io/controller-runtime/pkg/event" | ||
|
||
kubeflowv1 "github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v1" | ||
"github.com/kubeflow/training-operator/pkg/controller.v1/common" | ||
"github.com/kubeflow/training-operator/pkg/controller.v1/expectation" | ||
commonutil "github.com/kubeflow/training-operator/pkg/util" | ||
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make dependencies as group like
<Go std libs>
<Third party libs>
<ourselves libs>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated it. I've also folded the two reconciler.go
and reconciler_generic.go
files now that it's generics all along.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
kubeflowv1 "github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v1" | ||
"github.com/kubeflow/training-operator/pkg/controller.v1/common" | ||
"github.com/kubeflow/training-operator/pkg/controller.v1/expectation" | ||
log "github.com/sirupsen/logrus" | ||
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" | ||
"k8s.io/apimachinery/pkg/runtime" | ||
"k8s.io/apimachinery/pkg/runtime/schema" | ||
"sigs.k8s.io/controller-runtime/pkg/client" | ||
"sigs.k8s.io/controller-runtime/pkg/event" | ||
"sigs.k8s.io/controller-runtime/pkg/predicate" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
@@ -27,7 +27,7 @@ cd manifests/overlays/standalone | |||
kustomize edit set image kubeflow/training-operator=${TRAINING_CI_IMAGE} | |||
|
|||
echo "Installing training operator manifests" | |||
kustomize build . | kubectl apply -f - | |||
kustomize build . | kubectl apply --server-side=true -f - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with follow-up.
Could you update the installation documentation in the following?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the README. I'll raise a PR in the website repository promptly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this great contribution!
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* Upgrade Kubernetes to v1.30.7 Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Use typed event handlers and predicates in job controllers Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Re-organize pkg/common/util/reconciler.go Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Update installation instructions in README Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> --------- Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>
* Added test for create-pytorchjob.ipynb Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * fix yaml syntax Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Fix uses path Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Add actions/checkout Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Add bash to action.yaml Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Install pip dependencies step Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Add quotes for args Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Add jupyter Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Add nbformat_minor: 5 to fix invalid format error Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Fix job name Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * test papermill-args-yaml Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * testing multi line args Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * testing multi line args1 Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * testing multi line args2 Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * testing multi line args3 Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Parameterize sdk install Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Remove unnecessary output Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * nbformat normailze Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * [SDK] Training Client Conditions related unit tests (#2253) * test: add unit test for get_job_conditions function of training client Signed-off-by: Bobbins228 <mcampbel@redhat.com> * test: add unit test for is_job_created function of training client Signed-off-by: Bobbins228 <mcampbel@redhat.com> * test: add unit test for is_job_running function of training client Signed-off-by: Bobbins228 <mcampbel@redhat.com> * test: add unit test for is_job_restarting function of training client Signed-off-by: Bobbins228 <mcampbel@redhat.com> * test: add unit test for is_job_failed function of training client Signed-off-by: Bobbins228 <mcampbel@redhat.com> * test: add unit test for is_job_succeded function of training client Signed-off-by: Bobbins228 <mcampbel@redhat.com> * test: improve job condition unit tests efficiency Signed-off-by: Bobbins228 <mcampbel@redhat.com> --------- Signed-off-by: Bobbins228 <mcampbel@redhat.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * [SDK] test: add unit test for list_jobs method of the training_client (#2267) Signed-off-by: wei-chenglai <qazwsx0939059006@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Generate clientset, openapi spec for the V2 APIs (#2273) Generate clientset, informers, listers and open api spec for v2alpha1 APIs. Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * [SDK] Use torchrun to create PyTorchJob from function (#2276) * [SDK] Use torchrun to create PyTorchJob from function Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update PyTorchJob SDK example Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add consts for entrypoint Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add check for num procs per worker Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * [SDK] test: add unit test for get_job_logs method of the training_client (#2275) Signed-off-by: wei-chenglai <qazwsx0939059006@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * [v2alpha] Move GV related codebase (#2281) Move GV related codebase in v2alpha Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Implement runtime framework (#2248) * KEP-2170: Implement runtime framework interfaces Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Remove grep dependency Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * KEP-2170: Implement ValidateObjects interface to the runtime framework Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * KEP-2170: Expose the TrainingRuntime and ClusterTrainingRuntime Kind Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * KEP-2170: Remove unneeded scheme field from the internal TrainingRuntime Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Rephrase the error message Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Distinguish TrainingRuntime and ClusterTrainingRuntime when creating indexes for the TrainJobs Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Propagate the TrainJob labels and annotations to the JobSet Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Remove PodAnnotations from the runtime info Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Implement TrainingRuntime ReplicatedJob validation Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Add TODO comments Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Replace queueSuspendedTrainJob with queueSuspendedTrainJobs Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> --------- Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Add DeepSpeed Example with Pytorch Operator (#2235) Signed-off-by: Syulin7 <735122171@qq.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API (#2283) * KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Rename RuntimeRef in runtime framework Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Adding CEL validations on v2 TrainJob CRD (#2260) Signed-off-by: Akshay Chitneni <achitneni@apple.com> Co-authored-by: Akshay Chitneni <achitneni@apple.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Upgrade Deepspeed demo dependencies (#2294) Signed-off-by: Syulin7 <735122171@qq.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Add manifests for Kubeflow Training V2 (#2289) * KEP-2170: Add manifests for Kubeflow Training V2 Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix invalid name for webhook config in cert Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix integration tests Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Move kubebuilder markers to runtime framework Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Use Kubernetes recommended labels Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * FSDP Example for T5 Fine-Tuning and PyTorchJob (#2286) * FSDP Example with PyTorchJob and T5 Fine-Tuning Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Modify text Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Implement TrainJob Reconciler to manage objects (#2295) * KEP-2170: Implement TrainJob Reconciler to manage objects Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Mode dep-crds to manifests/external-crds Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Rename run with runtime Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> --------- Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Remove Prometheus Monitoring doc (#2301) Signed-off-by: Sophie <sophy010017@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Decouple JobSet from TrainJob (#2296) Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Strictly verify the CRD marker validation and defaulting in the integration testings (#2304) Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Initialize runtimes before the manager starts (#2306) Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Generate Python SDK for Kubeflow Training V2 (#2310) * Generate SDK models for the Training V2 APIs Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Create pyproject.toml config Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Remove comments Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix pre-commit Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Create model and dataset initializers (#2303) * KEP-2170: Create model and dataset initializers Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add abstract classes Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add storage URI to config Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update .gitignore Co-authored-by: Kevin Hannon <kehannon@redhat.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix the misspelling for initializer Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add .pt and .pth to ignore_patterns Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Co-authored-by: Kevin Hannon <kehannon@redhat.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Implement JobSet, PlainML, and Torch Plugins (#2308) * KEP-2170: Implement JobSet and PlainML Plugins Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix nil pointer exception for Trainer Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix unit tests in runtime package Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix unit tests Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix integration tests Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix lint Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Implement Torch Plugin Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Use list for the Info envs Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix golang ci Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix Torch plugin Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Use K8s sets Update error return Use ptr.Deref() for nil values Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Use client.Object for Build() call Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Remove DeepCopy Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Remove MLPolicy and PodGroupPolicy from the Info object Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Inline error Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Remove SDK jar file Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add integration test for Torch plugin Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add TODO to calculate PodGroup values in unit tests Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Revert the change to add original Runtime Policies to Info Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Create const for the DefaultJobReplicas Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Check if PodLabels is empty Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Implement Initializer builders in the JobSet plugin (#2316) * KEP-2170: Implement Initializer builder in the JobSet plugin Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update the SDK models Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Remove Info from Initializer builder Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update manifests Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update pkg/constants/constants.go Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Use var for envs Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Remove check manifests from GitHub actions Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Move consts to JobSet plugin Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Add the TrainJob state transition design (#2298) * KEP-2170: Add the TrainJob state transition design Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Replace actual jobs with TrainJob Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Remove the JobSet conditions propagation and Add expanding runtime framework interfaces for each plugin Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Expand the Creation Failed reasons Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Rename Completed to Complete Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> --------- Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Update tf job examples to tf v2 (#2270) * mnist with summaries updaetd to TF v2 Signed-off-by: yelias <yossi.elias@nokia.com> * tf_sample updaetd to TF v2 Signed-off-by: yelias <yossi.elias@nokia.com> * Add mnist_utils and update dist-mnist Signed-off-by: yelias <yossi.elias@nokia.com> * Add mnist_utils and update dist-mnist Signed-off-by: yelias <yossi.elias@nokia.com> * Remove old example - estimator-API, this example has been replaced by distribution_strategy Signed-off-by: yelias <yossi.elias@nokia.com> * Small fix Signed-off-by: yelias <yossi.elias@nokia.com> * Remove unsupported powerPC dockerfiles Signed-off-by: yelias <yossi.elias@nokia.com> * Fix typo in copyright Signed-off-by: yelias <yossi.elias@nokia.com> --------- Signed-off-by: yelias <yossi.elias@nokia.com> Co-authored-by: yelias <yossi.elias@nokia.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Add TrainJob conditions (#2322) * KEP-2170: Implement TrainJob conditions Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Fix API comments Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Make condition message constants Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Stop connecting condition type and reason in JobSet plugin Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> --------- Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Pin Gloo repository in JAX Dockerfile to a specific commit (#2329) This commit pins the Gloo repository to a specific commit (43b7acbf) in the JAX Dockerfile to prevent build failures caused by a recent bug introduced in the Gloo codebase. By locking the version of Gloo to a known working commit, we ensure that the JAX build remains stable and functional until the issue is resolved upstream. The build failure occurs when compiling the gloo/transport/tcp/buffer.cc file due to an undefined __NR_gettid constant, which was introduced after the pinned commit. By using this commit, we bypass the issue and allow the build to complete successfully. Signed-off-by: Sandipan Panda <samparksandipan@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * [fix] Resolve v2alpha API exceptions (#2317) Resolve v2alpha API exceptions by adding necessary listType validations. Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Upgrade Kubernetes to v1.30.7 (#2332) * Upgrade Kubernetes to v1.30.7 Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Use typed event handlers and predicates in job controllers Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Re-organize pkg/common/util/reconciler.go Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Update installation instructions in README Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> --------- Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Ignore cache exporting errors in the image building workflows (#2336) Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Add Torch Distributed Runtime (#2328) * KEP-2170: Add Torch Distributed Runtime Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add pip list Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Refine the server-side apply installation args (#2337) Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Add openapi-generator CLI option to skip SDK v2 test generation (#2338) Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Upgrade kustomization files to Kustomize v5 (#2326) Signed-off-by: oksanabaza <obazylie@redhat.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Pin accelerate package version in trainer (#2340) * Pin accelerate package version in trainer Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com> * include new line to pass pre-commit hook Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com> --------- Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Replace papermill command with bash script Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Typo fix Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Move Checkout step outside action.yaml file Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Add newline EOF in script Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Pass python dependencies as args and pin versions Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Update Usage Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Install dependencies in yaml Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * fix ipynb Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * set bash flags Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Update script args and add more kubernetes versions for tests Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * add gang-scheduler-name to template Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * move go setup to template Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * remove -p parameter from script Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> --------- Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> Signed-off-by: Bobbins228 <mcampbel@redhat.com> Signed-off-by: wei-chenglai <qazwsx0939059006@gmail.com> Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: Syulin7 <735122171@qq.com> Signed-off-by: Akshay Chitneni <achitneni@apple.com> Signed-off-by: Sophie <sophy010017@gmail.com> Signed-off-by: yelias <yossi.elias@nokia.com> Signed-off-by: Sandipan Panda <samparksandipan@gmail.com> Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> Signed-off-by: oksanabaza <obazylie@redhat.com> Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com> Co-authored-by: Mark Campbell <mcampbel@redhat.com> Co-authored-by: Wei-Cheng Lai <qazwsx0939059006@gmail.com> Co-authored-by: Varsha <varshaprasad96@gmail.com> Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Co-authored-by: yu lin <735122171@qq.com> Co-authored-by: Akshay Chitneni <akshayadatta@gmail.com> Co-authored-by: Akshay Chitneni <achitneni@apple.com> Co-authored-by: Sophie Hsu <112261858+sophie0730@users.noreply.github.com> Co-authored-by: Kevin Hannon <kehannon@redhat.com> Co-authored-by: YosiElias <73485442+YosiElias@users.noreply.github.com> Co-authored-by: yelias <yossi.elias@nokia.com> Co-authored-by: Sandipan Panda <87253083+sandipanpanda@users.noreply.github.com> Co-authored-by: Antonin Stefanutti <astefanutti@users.noreply.github.com> Co-authored-by: Oksana Bazylieva <61097730+oksanabaza@users.noreply.github.com> Co-authored-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
What this PR does / why we need it:
This PR includes:
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):This is the first PR to fix #2291, for upgrading to 1.30 first, and that'll be followed by #2330 to upgrade to 1.31 after this one is merged.
This supersedes #2299.
Checklist: