Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for instance readiness KEP #1692
Proposal for instance readiness KEP #1692
Changes from all commits
407f5c6
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have an example of Unknown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
implementation detail warning :)
Is it your expectation that we will create a "message" through the joining of resource events (which is where we can see that a pod can't be scheduled)? that is what I have in my mental model... I don't know how you would do it another way... if that is the case... is there a way to identify the resource with the message or a format for this message.
I also assume we will use the constraints defined in the schema provided... which has Reason constrained to 1024 and Reason to 32768 which we may need to be mindful of and have a solution for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you asking whether the resource name will be machine parse-able out of that? 🤔 but yeah, that idea is that we'll join some information we can get from the resources not ready. it won't be machine parse-able or that was not my intention - message has a goal of being human readable, while reason should be something that machine can rely on. If that's a hard requirement, we can try to make it machine parse-able.
Yeah, good point, we should think about those constraints in implementation. Won't be a problem for a reason as that should be an enum
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need to filter out resources of other plans? E.g. if there are additional pods running (monitoring or 3rd party tools) that are not ready? Currently, we set
kudo.dev/plan
annotation but not as label.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to zen-dog's question (which aligns with earlier mention of job pods a little)
additionally... if there is a plan exec that launches a temp resource (a service for a short duration) or pod that finishes... is that resource involved in the status? is there a way to opt-out of being involved? should we design an opt-out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, good point, I think I like the idea of opt-out as it's hard to really capture these cases without it as one can use plans in any way possible. Would that in your mind solve the problem you pointed out @zen-dog ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opt-out (or maybe rather opt-in?) should work. We would only consider resources with
'kudo.dev/plan' in (deploy, update)
but extend the set with more plans if an operator developer specified it. But this is still "edgy" as resources can be reused between plans and other plans like e.g. upgrade can have one-time migration Jobs, etc.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it's enough to simply pull all resources and filter for labels. The issue is, that the deploy plan may have multiple steps that deploy resources, but at any given time the plan can land in
FATAL_ERROR
and cancel the execution of subsequent steps.If we were to pull only deployed resources with a filter, we could end up with a half-executed deploy plan in FATAL_ERROR, but "Ready" condition reporting "True" because all deployed resources are healthy - But the operator itself isn't.
I think this touches a lot of the problems we'll see with drift detection. I'm not sure if opt-out or opt-in will be the better solution here. Maybe only apply to the "deploy" plan by default, and allow other plans to opt-in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You actually brought up a very good edge case, what if plan fails 🤔 I think if last plan is in fatal_error would not it make sense for ready to be "False"? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or should it rather be unknown then? I think ideally it should be degraded https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apimachinery/pkg/apis/meta/v1/types.go#L1288-L1298 but that does not exist yet :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what to do for Service type. Maybe we should drop that from the listed resources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, probably a good idea for phase 1. We can always add it later, but I think it'll require custom code and more than Condition checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the services for now, I'll add it there again if we have a clear path on how to assume readiness there