Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nrt: cache: add support for dedicated informer #599

Merged
merged 1 commit into from
Jan 18, 2024

Conversation

ffromani
Copy link
Contributor

@ffromani ffromani commented Jun 30, 2023

What type of PR is this?

/kind bug
/kind api-change

What this PR does / why we need it:

add optional dedicated informer for the NodeResourceTopology scheduler plugin.
For its reconciliation mechanism to converge, the plugin needs to list pods in terminal phase which are not deleted yet

Which issue(s) this PR fixes:

Fixes #598

Special notes for your reviewer:

Technically API change but it is backward compatible (optional fields) and done like many others we did in the past. Still labeled for transparency

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 30, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ffromani

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 30, 2023
@ffromani
Copy link
Contributor Author

@Huang-Wei @dankensk PTAL when you have time. Would be adding a separate, optional, disabled by default informer acceptable? (please check the related issue for context) I'm also investigating from kubelet side about if/how terminal pods should be reported.

@ffromani
Copy link
Contributor Author

/cc @swatisehgal @Tal-or

@netlify
Copy link

netlify bot commented Oct 10, 2023

Deploy Preview for kubernetes-sigs-scheduler-plugins ready!

Name Link
🔨 Latest commit e466bc4
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-scheduler-plugins/deploys/65a814f2e45b12000952aebd
😎 Deploy Preview https://deploy-preview-599--kubernetes-sigs-scheduler-plugins.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@ffromani ffromani changed the title WIP: POC: nrt: cache: add support for dedicated informer nrt: cache: add support for dedicated informer Oct 10, 2023
@ffromani ffromani marked this pull request as ready for review October 10, 2023 09:49
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 10, 2023
@ffromani
Copy link
Contributor Author

@Huang-Wei @zwpaper since I mentioned this PR when reviewing #655, any design objections from you folks? note this is still on hold as we're still checking if the fix is better done on kubelet or not. But would this PR raise concern from the scheduling side?

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 28, 2023
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 30, 2023
@ffromani
Copy link
Contributor Author

@Huang-Wei @zwpaper since I mentioned this PR when reviewing #655, any design objections from you folks? note this is still on hold as we're still checking if the fix is better done on kubelet or not. But would this PR raise concern from the scheduling side?

gentle ping :) no major objections I guess?

@ffromani
Copy link
Contributor Author

ffromani commented Dec 4, 2023

assuming no design blockers, @Tal-or @swatisehgal PTAL
cc @PiotrProkop for awareness

@ffromani
Copy link
Contributor Author

ffromani commented Dec 5, 2023

I'm also investigating from kubelet side about if/how terminal pods should be reported.

Any conclusion on this?

Not yet but we're heading on the direction this is a bug on the kubelet side. It will need better docs (and agreement about the desired state) anyway, discussion need to be revamped.

Perhaps we should also mention that fixing this in Kubelet is a possible alternative in the PR description. Do we want to position this in a way that the plan is to fix this in Kubelet in the long run, and this is a workaround in the short term? Can we add more details about that in the PR description.

That's a very good point. If everything goes well, the kubelet fix will land in 1.30. This change will ensure scheduler compatibility with older versions (any older supported version) which seems like a reasonnable price to pay - assuming there are no other design concerns.

@zwpaper
Copy link
Member

zwpaper commented Dec 7, 2023

hi @ffromani sorry for the late reply, I was really busy on those days.

I was not able to get some time to look into the changes, it would be better for other reviewers to take it.

@ffromani
Copy link
Contributor Author

ffromani commented Jan 5, 2024

gentle ping @swatisehgal @Tal-or

Copy link
Contributor

@swatisehgal swatisehgal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience and your work on this!

I am happy to proceed with the current approach as the introduction of informer mode is a nice addition to current implementation and provides us a way to resolve the inconsistency in the pod data obtained from podresource API and that visible to scheduler.

But, I do feel that in the long term resolving this on the kubelet side would be a more appropriate fix as exposing resources of pods that are terminal does not seem appropriate to me. Irrespective of how we go about things from kubelet point of view, I think we are a way forward to handle this for now as well as in the future with some minor adjustments to default mode if needed.

I left comments suggesting a minor addition to tests and why IMO we should reevaluate the current default value of the mode.

pkg/noderesourcetopology/cache/store_test.go Outdated Show resolved Hide resolved
apis/config/v1/defaults.go Outdated Show resolved Hide resolved
@Tal-or
Copy link
Contributor

Tal-or commented Jan 7, 2024

Thank you for working on this, LGTM (not adding official label since I see there're other comments)

@ffromani
Copy link
Contributor Author

ffromani commented Jan 8, 2024

Thanks for your patience and your work on this!

I am happy to proceed with the current approach as the introduction of informer mode is a nice addition to current implementation and provides us a way to resolve the inconsistency in the pod data obtained from podresource API and that visible to scheduler.

But, I do feel that in the long term resolving this on the kubelet side would be a more appropriate fix as exposing resources of pods that are terminal does not seem appropriate to me. Irrespective of how we go about things from kubelet point of view, I think we are a way forward to handle this for now as well as in the future with some minor adjustments to default mode if needed.

Thank you for the valuable feedback. You raise (and raised, now that I remember) a great point that makes me still torn about this approach. The problem is indeed the represantation of completed pod.

The core issue is: a completed (either failed or succeeded) pod should accounted for? IOW, should resources accounted to completed pods be reused if need arises? There are valid points for both approaches.

What is more and more evident, anyway is the fact I'm still ruminating about the merit of the approach. This on itself is a strong indicator we should NOT bake in this workaround in the main codebase. Should the workaround be needeed, there are ways to add in the 3rd party codebases.

At the same time, out of practicality, the issue we're trying to fix here is real and affect real-world deployments.
Hence I'm pivoting this PR to remove all but the parts which make integration of an external informer possible, and propose this as next step.

@swatisehgal
Copy link
Contributor

swatisehgal commented Jan 8, 2024

The core issue is: a completed (either failed or succeeded) pod should accounted for? IOW, should resources accounted to completed pods be reused if need arises? There are valid points for both approaches.

IMO the answer to your question should be NO (in the long term) but I understand that we are not in a perfect world and we can't achieve that immediately. Solving this in the scheduler obviously brings us to a state better than we are currently at. To be a bit more explicit, the way I see this working out in my head is as follows:

  1. We merge your current PR that adds support for informer modes (default:dedicated and shared) in the scheduler plugin: we end up in a state where both the pod resource API are monitoring all pods including terminal pods. This is not ideal but resolves the reconciliation issue.
  2. PodresourceAPI reports about resources of pods in terminal phase kubernetes/kubernetes#119423 is fixed in kubelet either as a bug to fully remove terminal pods from the podresource API list endpoint or by allowing some sort of filtering.
  3. The exporters (NFD/RTE consumes podresource API updates) and in the scheduler plugin we default back to shared mode as now both scheduler plugin and podresource API would be accounting pods that are non-terminal (the ideal state!).

What is more and more evident, anyway is the fact I'm still ruminating about the merit of the approach. This on itself is a strong indicator we should NOT bake in this workaround in the main codebase. Should the workaround be needeed, there are ways to add in the 3rd party codebases.

At the same time, out of practicality, the issue we're trying to fix here is real and affect real-world deployments. Hence I'm pivoting this PR to remove all but the parts which make integration of an external informer possible, and propose this as next step.

Just want to state that I am happy with the PR in its current form with some minor updates but we can totally explore a more streamlined PR if you think that would be better.

@ffromani
Copy link
Contributor Author

The core issue is: a completed (either failed or succeeded) pod should accounted for? IOW, should resources accounted to completed pods be reused if need arises? There are valid points for both approaches.

IMO the answer to your question should be NO (in the long term) but I understand that we are not in a perfect world and we can't achieve that immediately. Solving this in the scheduler obviously brings us to a state better than we are currently at. To be a bit more explicit, the way I see this working out in my head is as follows:

1. We merge your current PR that adds support for informer modes (default:dedicated and shared) in the scheduler plugin: we end up in a state where both the pod resource API are monitoring all pods including terminal pods. This is not ideal but resolves the reconciliation issue.

2. [PodresourceAPI reports about resources of pods in terminal phase kubernetes/kubernetes#119423](https://github.com/kubernetes/kubernetes/issues/119423) is fixed in kubelet either as a bug to fully remove terminal pods from the podresource API list endpoint or by allowing some sort of filtering.

3. The exporters (NFD/RTE consumes podresource API updates) and in the scheduler plugin we default back to shared mode as now both scheduler plugin and podresource API would be accounting pods that are non-terminal (the ideal state!).

What is more and more evident, anyway is the fact I'm still ruminating about the merit of the approach. This on itself is a strong indicator we should NOT bake in this workaround in the main codebase. Should the workaround be needeed, there are ways to add in the 3rd party codebases.
At the same time, out of practicality, the issue we're trying to fix here is real and affect real-world deployments. Hence I'm pivoting this PR to remove all but the parts which make integration of an external informer possible, and propose this as next step.

Just want to state that I am happy with the PR in its current form with some minor updates but we can totally explore a more streamlined PR if you think that would be better.

Thanks for sharing your insights! I was contemplating the best course of action based on your feedback.
On one hand, this is a workaround on the scheduler side, there's no question about it.
On the other hand, without this workaround the caching mechanism of the plugin can end up in unrecoverable stall with basic functionalities of kubernetes are in use, like jobs. The only fix is to restart the scheduler (!!!) when stalls are detected, and detecting a stall is not trivial in the first place.

All of this is bad UX at best, and a crippling blow to the plugin functionality in all the scenarios.

The best course of action would be agree about a standard in the kubernetes, or at least node, community, but this is likely to happen anytime soon - kubernetes/kubernetes#119423 got little traction to begin with, and in all fairness this is a nontrivial effort.

This will lead us to a state on which the plugin functionality is severely diminuished until the oldest support version (current-2 if not mistaken) is fixed, which will takes at least months for now.

Because all of these reasons, and because reviewers and maintainers are at worst neutral if not mildly positive out of practical matters about this change, I'm removing the hold and moving forward.

Granted, we should, no we must keep working toward a conclusive and clean fix. This change can't be the long term solution, even though "temporary" solutions tend to stick around forever. Let's steer the boat for once.

@ffromani
Copy link
Contributor Author

/unhold

begrudgingly. It's a workaround. But we can't keep a cripple plugin until we have a N-2 fixed version. We will talk about at very least kube 1.32

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 17, 2024
Add an option to create and use a separate informer
to get unfiltered pod listing from the apiserver.

Due to mismatched view with respect to the kubelet,
the plugin needs access to pod in terminal phase
which are not deleted to make sure the reconciliation
is done correctly.

xref: kubernetes-sigs#598
xref: kubernetes/kubernetes#119423

Signed-off-by: Francesco Romani <fromani@redhat.com>
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 17, 2024
@ffromani
Copy link
Contributor Author

addressed all comments and fixed a subtle bug on which the pods were not filtered according to the new option value. Now they are.

Copy link
Contributor

@swatisehgal swatisehgal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Thanks for addressing all my review comments.

/assign @Huang-Wei
for approval

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 18, 2024
@swatisehgal
Copy link
Contributor

/hold
to allow time in case @Huang-Wei would like to review and prevent inadvertent merge.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 18, 2024
@k8s-ci-robot k8s-ci-robot merged commit 93c518b into kubernetes-sigs:master Jan 18, 2024
10 checks passed
@ffromani
Copy link
Contributor Author

/hold to allow time in case @Huang-Wei would like to review and prevent inadvertent merge.

Looks like the bot was faster. We can revert if needed.

@ffromani ffromani deleted the nrt-extra-informer branch January 18, 2024 13:16
@swatisehgal
Copy link
Contributor

Looks like the bot was faster. We can revert if needed.

Ah! Didn't expect that! Next time will be more careful.

@Huang-Wei Please let us know if you have concerns here.

@Huang-Wei
Copy link
Contributor

@Huang-Wei Please let us know if you have concerns here.

No concern from my side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[NodeResourceTopology] the plugin should be able to list pods in terminal state
6 participants