Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] KEP: MultiKueue external custom Job support #2458

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

mszadkow
Copy link
Contributor

What type of PR is this?

/kind documentation

What this PR does / why we need it:

New KEP that explains the way to introduce MultiKueue external custom Job support.

Which issue(s) this PR fixes:

Fixes #2349

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/documentation Categorizes issue or PR as related to documentation. labels Jun 20, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mszadkow
Once this PR has been reviewed and has the lgtm label, please assign alculquicondor for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 20, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @mszadkow. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 20, 2024
Copy link

netlify bot commented Jun 20, 2024

Deploy Preview for kubernetes-sigs-kueue ready!

Name Link
🔨 Latest commit ec73442
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/668d159967d169000886f87b
😎 Deploy Preview https://deploy-preview-2458--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@mszadkow mszadkow marked this pull request as ready for review June 25, 2024 10:10
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 25, 2024
@trasc
Copy link
Contributor

trasc commented Jun 25, 2024

/cc @alculquicondor @tenzen-y

@alculquicondor
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 26, 2024
@alculquicondor
Copy link
Contributor

@dgrove-oss is this something that interests you and do you have any thoughts on the proposal?


## Motivation

Many Kueue adaptors have company specific CustomResources.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Many Kueue adaptors have company specific CustomResources.
Many Kueue adopters have company specific CustomResources.

?

Provide a way to integrate external controllers to manage Jobs in the Kueue.
MulikueueCluster API needs to be modified to ensure that one multikueue cluster is only managed by one admission check controller.

1. Add ControllerName to the `MultiKueueClusterSpec`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a non-starting point for this proposal.

Asking users to create one MultiKueueCluster object for each integration is too much burden.

Sure, they just need one for all of their CRDs, but what if they need to support a native jobs + in-house jobs + third-party jobs that offer MK support. Now they have at least 3 configurations for the same thing.

How does a MultiKueueConfig that supports all of these jobs look like?

IIUC, your intention with a separate object is to have conditions separate for each external admission check. Then let's maybe add an object just for that, that only one controller processes each time, separate from MultiKueueCluster.

Copy link
Contributor

@trasc trasc Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One alternative to this, described later, is to have a custom condition type for the MultiKueueCluster that reflects the connection stale for each ControllerName using it.

Asking users to create one MultiKueueCluster object for each integration is too much burden.

It should not be for each integration, is for each configured controller , the controller can support any number of adapters.

How does a MultiKueueConfig that supports all of these jobs look like?

Nothing changes for MultiKueueConfig, they are just used by an AdmissionCheck (which already has a ControllerName)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One alternative to this, described later, is to have a custom condition type for the MultiKueueCluster that reflects the connection stale for each ControllerName using it.

This sounds much better.

}
}
```

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please expand on:

  • which subcomponents of MK each external controller is implementing (even if provided by a framework)
  • how does a configuration for multiple of these controllers look like from the ClusterQueue point of view.

Copy link
Contributor

@dgrove-oss dgrove-oss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integration test scenario seemed reasonable to me.

Providing a mechanism to register an external controller and have it coordinate with the builtin components about which GVKs each is responsible seems right.

There was not sufficient detail for me as a MultiKueue outsider to understand how the pieces fit together and what I'd need to do to add an external custom Job. Is this KEP also supposed to cover external Job adapters too?


Internally the MultiKueue uses a set of controller-runtime client, one for each MultiKueueCluster, the clients are used both by the MultiKueue internals and passed to the job adapters for the custom jobs synchronization.

The clients cannot be share with an external controller and assuming the external controller can get the same connectivity to the worker clusters might be wrong, especially when we are taking about FS mounted kubeconfigs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The clients cannot be share with an external controller and assuming the external controller can get the same connectivity to the worker clusters might be wrong, especially when we are taking about FS mounted kubeconfigs.
The clients cannot be shared with an external controller and assuming the external controller can get the same connectivity to the worker clusters might be wrong, especially when we are taking about FS mounted kubeconfigs.

@alculquicondor
Copy link
Contributor

Is this KEP also supposed to cover external Job adapters too?

Yes, it's supposed to enable the easy creation of them. I agree that the proposal needs more detail.

@tenzen-y
Copy link
Member

/cc @alculquicondor @tenzen-y

ACK
Let me check this KEP

@alculquicondor
Copy link
Contributor

/retitle [WIP] KEP: MultiKueue external custom Job support

@k8s-ci-robot k8s-ci-robot changed the title KEP: MultiKueue external custom Job support. [WIP] KEP: MultiKueue external custom Job support Jun 28, 2024
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 28, 2024
@mszadkow mszadkow force-pushed the doc/add-custom-job-support branch from 72d5e1e to ec73442 Compare July 9, 2024 10:48
@k8s-ci-robot
Copy link
Contributor

@mszadkow: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kueue-verify-main ec73442 link true /test pull-kueue-verify-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/documentation Categorizes issue or PR as related to documentation. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide a way to implement a MultiKueue support for the custom Jobs
6 participants