Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade RayJob API to v1 #1802

Merged
merged 1 commit into from
Apr 2, 2024
Merged

Conversation

astefanutti
Copy link
Member

@astefanutti astefanutti commented Mar 5, 2024

What this PR does / why we need it:

KubeRay has promoted the RayJob API from v1alpha1 to v1 with KubeRay v1.0.0. While the v1alpha1 version is still served in KubeRay v1.1.0, it'll be removed in v1.2.0.

Which issue(s) this PR fixes:

Fixes #1335.

Special notes for your reviewer:

Supersedes #1435.

Does this PR introduce a user-facing change?

Upgrade RayJob API to v1

ACTION REQUIRED: If you use KubeRay older than v1.0.0, you'll have to upgrade your existing installation
to KubeRay v1.0.0, or any more recent version, that supports KubeRay v1 APIs, for it to
remain compatible with Kueue.

@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Mar 5, 2024
@k8s-ci-robot k8s-ci-robot requested a review from mimowo March 5, 2024 09:58
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 5, 2024
@k8s-ci-robot k8s-ci-robot requested a review from trasc March 5, 2024 09:58
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 5, 2024
Copy link

netlify bot commented Mar 5, 2024

Deploy Preview for kubernetes-sigs-kueue ready!

Name Link
🔨 Latest commit 379e4ba
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/660c39ecf54efc0008ae9628
😎 Deploy Preview https://deploy-preview-1802--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@mbzomowski
Copy link

Hey, apologies if I'm jumping the gun here, but I'm currently trying to get v1 RayJobs to work with Kueue - I pulled down your PR changes, and installed Kueue through helm, but I'm getting an error whenever I try to create a new RayJob now:

Error from server (InternalError): error when creating "ray-job.sample.yaml": Internal error occurred: failed calling webhook "mrayjob.kb.io": failed to call webhook: the server could not find the requested resource

Looking at the Kueue controller logs, it seems like the mutating (and I assume validating) webhook is still for v1alpha1 RayJobs:

{"level":"info","ts":"2024-03-12T22:31:38.10128149Z","logger":"controller-runtime.builder","caller":"builder/webhook.go:158","msg":"Registering a mutating webhook","GVK":"ray.io/v1alpha1, Kind=RayJob","path":"/mutate-ray-io-v1alpha1-rayjob"}

Am I doing something wrong here? Possibly installing Kueue in the wrong way to utilize these changes?

@astefanutti
Copy link
Member Author

Hey, apologies if I'm jumping the gun here, but I'm currently trying to get v1 RayJobs to work with Kueue - I pulled down your PR changes, and installed Kueue through helm, but I'm getting an error whenever I try to create a new RayJob now:

No pb at all :) Thanks a lot for testing this!

Looking at the Kueue controller logs, it seems like the mutating (and I assume validating) webhook is still for v1alpha1 RayJobs:

{"level":"info","ts":"2024-03-12T22:31:38.10128149Z","logger":"controller-runtime.builder","caller":"builder/webhook.go:158","msg":"Registering a mutating webhook","GVK":"ray.io/v1alpha1, Kind=RayJob","path":"/mutate-ray-io-v1alpha1-rayjob"}

Am I doing something wrong here? Possibly installing Kueue in the wrong way to utilize these changes?

That "GVK":"ray.io/v1alpha1, Kind=RayJob" really looks like it should not be there if the changes from that PR would be taken into account, specifically: https://github.com/kubernetes-sigs/kueue/pull/1802/files#diff-24b8d3ba7104004ba575becf8db0a4ee47c9062f716a744f1afbbdb7bfded0d9R50

How exactly it is you've tested it? Could you double check the deployment uses the container image that's built with the changes from that PR?

@mbzomowski
Copy link

Ah whoops - realizing now that just running helm install ... after pulling down your changes wasn't enough.

I ran make install deploy, but now I'm getting the following error:

  Warning  Failed     14s (x2 over 29s)  kubelet            Failed to pull image "gcr.io/k8s-staging-kueue/kueue:v0.7.0-devel-33-g0414b335": rpc error: code = NotFound desc = failed to pull and unpack image "gcr.io/k8s-staging-kueue/kueue:v0.7.0-devel-33-g0414b335": failed to resolve reference "gcr.io/k8s-staging-kueue/kueue:v0.7.0-devel-33-g0414b335": gcr.io/k8s-staging-kueue/kueue:v0.7.0-devel-33-g0414b335: not found

I'm guessing I should build the new image locally and use that? Are there instructions for testing anywhere? Is make install deploy enough to try out the changes once I get the image built?

@mbzomowski
Copy link

Oof nevermind, sorry just found this: https://kueue.sigs.k8s.io/docs/installation/#build-and-install-from-source

I'll try this again

@mbzomowski
Copy link

Ok, I built & installed from source and confirmed I'm now getting the v1 RayJob mutating & admission webhooks. kubectl -f apply'd a RayJob, and the RayJob gets created, and so does the workload, but nothing else. And I'm seeing this in the kueue-controller logs:

{"level":"error","ts":"2024-03-13T17:27:04.793883358Z","logger":"cert-rotation","caller":"rotator/rotator.go:322","msg":"could not refresh CA and server certs","error":"Operation cannot be fulfilled on secrets \"kueue-webhook-server-cert\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).refreshCertIfNeeded.func1\n\t/go/pkg/mod/github.com/open-policy-agent/cert-controller@v0.10.1/pkg/rotator/rotator.go:322\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection\n\t/go/pkg/mod/k8s.io/apimachinery@v0.29.2/pkg/util/wait/wait.go:145\nk8s.io/apimachinery/pkg/util/wait.ExponentialBackoff\n\t/go/pkg/mod/k8s.io/apimachinery@v0.29.2/pkg/util/wait/backoff.go:461\ngit.luolix.top/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).refreshCertIfNeeded\n\t/go/pkg/mod/github.com/open-policy-agent/cert-controller@v0.10.1/pkg/rotator/rotator.go:350\ngit.luolix.top/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).Start\n\t/go/pkg/mod/github.com/open-policy-agent/cert-controller@v0.10.1/pkg/rotator/rotator.go:278\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/manager/runnable_group.go:223"}

@astefanutti
Copy link
Member Author

Ok, I built & installed from source and confirmed I'm now getting the v1 RayJob mutating & admission webhooks. kubectl -f apply'd a RayJob, and the RayJob gets created, and so does the workload, but nothing else. And I'm seeing this in the kueue-controller logs:

{"level":"error","ts":"2024-03-13T17:27:04.793883358Z","logger":"cert-rotation","caller":"rotator/rotator.go:322","msg":"could not refresh CA and server certs","error":"Operation cannot be fulfilled on secrets \"kueue-webhook-server-cert\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).refreshCertIfNeeded.func1\n\t/go/pkg/mod/github.com/open-policy-agent/cert-controller@v0.10.1/pkg/rotator/rotator.go:322\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection\n\t/go/pkg/mod/k8s.io/apimachinery@v0.29.2/pkg/util/wait/wait.go:145\nk8s.io/apimachinery/pkg/util/wait.ExponentialBackoff\n\t/go/pkg/mod/k8s.io/apimachinery@v0.29.2/pkg/util/wait/backoff.go:461\ngit.luolix.top/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).refreshCertIfNeeded\n\t/go/pkg/mod/github.com/open-policy-agent/cert-controller@v0.10.1/pkg/rotator/rotator.go:350\ngit.luolix.top/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).Start\n\t/go/pkg/mod/github.com/open-policy-agent/cert-controller@v0.10.1/pkg/rotator/rotator.go:278\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/manager/runnable_group.go:223"}

That error message happens when optimistic locking fails during an update, but the operation is simply retried so it's not an issue.

For the RayJob and the associated workload resource, checking their statuses should provide the reason why it's not started / running.

@mbzomowski
Copy link

RayJob is stuck in Initializing status:

Status:
  Job Deployment Status:  Initializing
  Job Id:                 rayjob-sample-mpmnf
  Ray Cluster Name:       rayjob-sample-raycluster-c2hgb
  Ray Cluster Status:
    Desired CPU:     0
    Desired GPU:     0
    Desired Memory:  0
    Desired TPU:     0
    Head:
  Start Time:  2024-03-13T19:43:01Z

Workload status doesn't show anything out of the ordinary:

  Conditions:
    Last Transition Time:  2024-03-13T19:43:01Z
    Message:               Quota reserved in ClusterQueue cluster-queue
    Reason:                QuotaReserved
    Status:                True
    Type:                  QuotaReserved
    Last Transition Time:  2024-03-13T19:43:01Z
    Message:               The workload is admitted
    Reason:                Admitted
    Status:                True
    Type:                  Admitted

RayCluster is showing as suspended, though:

  Normal  Deleted  4m (x4 over 9m)  raycluster-controller  Deleted Pods for RayCluster default/rayjob-sample-raycluster-c2hgb due to suspension

Also seeing these errors in the kueue-controller logs (in addition to that other one above):

{"level":"error","ts":"2024-03-13T19:41:54.423509864Z","caller":"jobframework/reconciler.go:202","msg":"Removing finalizer","controller":"rayjob","controllerGroup":"ray.io","controllerKind":"RayJob","RayJob":{"name":"rayjob-sample","namespace":"default"},"namespace":"default","name":"rayjob-sample","reconcileID":"c3bc5a6e-4af2-46a9-9656-939200d808b5","job":"default/rayjob-sample","gvk":"ray.io/v1, Kind=RayJob","error":"Operation cannot be fulfilled on workloads.kueue.x-k8s.io \"rayjob-rayjob-sample-8f26e\": StorageError: invalid object, Code: 4, Key: /registry/kueue.x-k8s.io/workloads/default/rayjob-rayjob-sample-8f26e, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 4c144f8e-b609-4495-84c1-3a71a73dc4d6, UID in object meta: ","stacktrace":"sigs.k8s.io/kueue/pkg/controller/jobframework.(*JobReconciler).ReconcileGenericJob\n\t/workspace/pkg/controller/jobframework/reconciler.go:202\nsigs.k8s.io/kueue/pkg/controller/jobframework.(*genericReconciler).Reconcile\n\t/workspace/pkg/controller/jobframework/reconciler.go:1003\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/internal/controller/controller.go:227"}
{"level":"error","ts":"2024-03-13T19:41:54.423578204Z","caller":"controller/controller.go:329","msg":"Reconciler error","controller":"rayjob","controllerGroup":"ray.io","controllerKind":"RayJob","RayJob":{"name":"rayjob-sample","namespace":"default"},"namespace":"default","name":"rayjob-sample","reconcileID":"c3bc5a6e-4af2-46a9-9656-939200d808b5","error":"Operation cannot be fulfilled on workloads.kueue.x-k8s.io \"rayjob-rayjob-sample-8f26e\": StorageError: invalid object, Code: 4, Key: /registry/kueue.x-k8s.io/workloads/default/rayjob-rayjob-sample-8f26e, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 4c144f8e-b609-4495-84c1-3a71a73dc4d6, UID in object meta: ","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/internal/controller/controller.go:227"}

@astefanutti
Copy link
Member Author

@mbzomowski I've been able to reproduce. That issue is actually unrelated to that PR and should be fixed with #1844.

@mbzomowski
Copy link

Rebased with the new changes & everything works! Thanks so much @astefanutti!

@astefanutti
Copy link
Member Author

@mbzomowski great, thanks for the feedback, and the thorough testing!

Copy link
Contributor

@trasc trasc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 25, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: eeb318d227895819982a137aae623f27907664e9

@astefanutti
Copy link
Member Author

@alculquicondor @tenzen-y @andrewsykim @kevin85421 Do we want to get this in the next Kueue minor release, now that KubeRay v1.1.0 has been released?

@tenzen-y
Copy link
Member

@alculquicondor @tenzen-y @andrewsykim @kevin85421 Do we want to get this in the next Kueue minor release, now that KubeRay v1.1.0 has been released?

Based on this discussion: #1435 (comment)

I agree with the migration of KubeRay v1 in Kueue v0.7.

@alculquicondor
Copy link
Contributor

sgtm, please rebase

And put ACTION REQUIRED in the release notes to ask users to upgrade to a version of kuberay that supports v1 (1.0?)

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 29, 2024
@k8s-ci-robot k8s-ci-robot requested a review from trasc March 29, 2024 12:22
@k8s-ci-robot k8s-ci-robot added release-note-action-required Denotes a PR that introduces potentially breaking changes that require user action. and removed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Mar 29, 2024
@astefanutti
Copy link
Member Author

PR rebased, and release notes updated. PTAL, thanks.

@tenzen-y
Copy link
Member

tenzen-y commented Apr 2, 2024

@astefanutti Could you address CI errors?

@astefanutti
Copy link
Member Author

/retest

@astefanutti
Copy link
Member Author

@tenzen-y retesting as I think these were transient failures.

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 2, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 261fce03970d3454a8a9b35fb6516b8050daff94

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: astefanutti, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 2, 2024
@k8s-ci-robot k8s-ci-robot merged commit af116ab into kubernetes-sigs:main Apr 2, 2024
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.7 milestone Apr 2, 2024
@astefanutti astefanutti deleted the pr-13 branch April 3, 2024 13:30
@astefanutti
Copy link
Member Author

@astefanutti Could you update documentation and examples in a follow-up?

@tenzen-y as far as I can see, the documentation and examples have been updated in the scope of that PR already, unless you have specific changes in mind?

On the other, while giving it another look, I've found another follow-up that's covered by #1941.

@andrewsykim
Copy link
Member

Sorry I'm late to review this, @astefanutti do we also want to update the go module version of kuberay? I ran into some corner cases where webhook logic drops fields because the imported type in Kueue doesn't have all the needed fields.

@andrewsykim
Copy link
Member

We should update to v1.1, but make sure it's backwards compatible with a cluster on v1.0

@andrewsykim
Copy link
Member

Oh ignore me, looks like it's using v1.1 already :)

@tenzen-y
Copy link
Member

tenzen-y commented Apr 3, 2024

@astefanutti Could you update documentation and examples in a follow-up?

@tenzen-y as far as I can see, the documentation and examples have been updated in the scope of that PR already, unless you have specific changes in mind?

On the other, while giving it another look, I've found another follow-up that's covered by #1941.

Oops, you're right.
NVM

@astefanutti
Copy link
Member Author

Oops, you're right. NVM

No pb, just wanted to make sure I was not missing anything!

vsoch pushed a commit to researchapps/kueue that referenced this pull request Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-action-required Denotes a PR that introduces potentially breaking changes that require user action. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Kuberay v1 CRDs
7 participants