-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request PRR approval for two KEPs: "In-place Pod Vertical Scaling" and "Kubelet CRI extension for In-Place Pod Vertical Scaling" #2474
Conversation
Hi @vinaykul. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @derekwaynecarr |
/ok-to-test |
Related-to: #2273 |
Tracking issue for this enhancement is #1287 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinayakankugoyal you will need to finish filling out your kep.yaml and set a PRR approver:
enhancements/keps/NNNN-kep-template/kep.yaml
Lines 26 to 38 in 44dac05
# The target maturity stage in the current dev cycle for this KEP. | |
stage: alpha|beta|stable | |
# The most recent milestone for which work toward delivery of this KEP has been | |
# done. This can be the current (upcoming) milestone, if it is being actively | |
# worked on. | |
latest-milestone: "v1.19" | |
# The milestone at which this feature was, or is targeted to be, at each stage. | |
milestone: | |
alpha: "v1.19" | |
beta: "v1.20" | |
stable: "v1.22" |
Hi @ehashman - I have no context about this KEP. Was I added by mistake? |
@vinayakankugoyal Yes, Elana was responding to me. Please ignore. |
@ehashman @dchen1107 @derekwaynecarr Please review PRR section for the two KEPs related to this feature. Thanks, |
sorry @vinayakankugoyal! Different Vinay K! darn autocomplete 🤐 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/assign
Please rebase and I'll get to these ASAP.
keps/sig-node/2273-kubelet-container-resources-cri-api-changes/kep.yaml
Outdated
Show resolved
Hide resolved
/test pull-enhancements-verify |
Done. |
keps/sig-node/2273-kubelet-container-resources-cri-api-changes/README.md
Show resolved
Hide resolved
keps/sig-node/2273-kubelet-container-resources-cri-api-changes/README.md
Show resolved
Hide resolved
keps/sig-node/2273-kubelet-container-resources-cri-api-changes/README.md
Show resolved
Hide resolved
keps/sig-node/2273-kubelet-container-resources-cri-api-changes/README.md
Outdated
Show resolved
Hide resolved
PRR review comments addressed in above commit. |
Hi @vinaykul, this looks almost ready to go. It is still missing sections for Upgrade/Downgrade Strategy and Version Skew Strategy per my comment above. Please complete these: https://github.com/kubernetes/enhancements/tree/master/keps/NNNN-kep-template#upgrade--downgrade-strategy |
### Version Skew Strategy | ||
Kubelet and the CRI runtime versions are expected to match so we don't have to worry about. | ||
Previous versions of clients that are unaware of the new ResizePolicy fields would set them | ||
to nil. API server mutates such updates by copying non-nil values from old Pod to current Pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when the API server supports the feature but the kubelets do not? Keep in mind that kubelets can be up to n-2 versions behind the API server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case feature would not work as expected. Enabling the feature-flag will allow apiserver to accept PATCH to PodSpec.Containers[].Resources field, but kubelet will interpret it as a change to the container definition and restart the container. What's worse is that on restart it would apply the new resources spec and thus it may oversubscribe the node.
I'm not sure if there's a good way to mitigate this without ugly stop-gap apiserver hacks to check node version. Is stretching out alpha to n+2 a reasonable approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ehashman I've updated the KEPs to target 1.24 for beta. This way we will not have to worry about the above issue in n-2 kubelet versions surprising anyone as alpha has feature-gate default-off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case feature would not work as expected. Enabling the feature-flag will allow apiserver to accept PATCH to PodSpec.Containers[].Resources field, but kubelet will interpret it as a change to the container definition and restart the container. What's worse is that on restart it would apply the new resources spec and thus it may oversubscribe the node.
I'm not sure if there's a good way to mitigate this without ugly stop-gap apiserver hacks to check node version. Is stretching out alpha to n+2 a reasonable approach?
This is why we ask these questions, to catch them before we've implemented it without a strategy :)
I don't think we should delay beta, but this would delay GA, since we'd need to ensure that the feature gate was on in kubelets before removing it in the API server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch indeed! I believe we turn the feature gate on in beta. So, is a surprise such as this OK in beta?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't be a surprise, but rather, a documented failure mode.
Please update the writeup for version skew strategy documenting possible failure modes, and this is good to go from a PRR perspective. |
1287-in-place-update-pod-resources 2273-kubelet-container-resources-cri-api-changes
Updated the version skew capturing the above issue. I noticed some features are beta false. We are good if we go beta true at n+2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
I think PRR is good to go for alpha.
@derekwaynecarr @dchen1107 can you please give this a final approval?
I requested @Random-Liu a last minute review on CRI-related extension last Tuesday. Through the offline discussion, he didn't have much concern since both containerd and runc are doing checkpoint already. But both of us can imagine some state reconciliation during containerd restart to avoid the potential race condition. We agreed those details can be addressed during the implementation phase. I am approving this KEP. /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dchen1107, ehashman, vinaykul The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
allocations will see values that may not represent actual allocations. As a | ||
mitigation, this change needs to be documented and highlighted in the | ||
release notes, and in top-level Kubernetes documents. | ||
1. Resizing memory lower: Lowering cgroup memory limits may not work as pages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memory qos may help on v2 depending on a threshold we apply for memory.high
.
see: #2571
agree with @dchen1107 and see no major issues for proceeding. |
Hello @vinaykul 👋, 1.22 Docs Shadow here. Please follow the steps detailed in the documentation to open a PR against dev-1.22 branch in the k/website repo. This PR can be just a placeholder at this time and must be created before Fri July 9, 11:59 PM PDT. Also, take a look at Documenting for a release to familiarize yourself with the docs requirement for the release. Thank you! 🙏 |
As per updated process for KEP inclusion into release milestone, a Production Readiness Review (PRR) section is needed. Adding it, with answers required for items for alpha feature target.