-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor deployment status propagation #5077
Refactor deployment status propagation #5077
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonjohnsonjr: 1 warning.
In response to this:
This change brings the Deployment -> Revision status propagation more in
line with how other reconcilers work. Since this is at the edge between
knative and kubernetes resource conventions, we need to transform the
deployment conditions to conform to our conventions, e.g. by inverting
appsv1.DeploymentReplicaFailure to be a happy condition and exposing a
top-level happy state ("Ready") for the deployment.There are some changes to behavior:
- An event is no longer emitted when the deployment times out.
- We surface the underlying DeploymentProgressing Reason/Message instead
of hard-coding our own.- We surface the DeploymentReplicaFailure message as well.
Fixes #4416
Part of #5076
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
func (ds *deploymentStatus) initializeConditions() { | ||
depCondSet.Manage(ds).InitializeConditions() | ||
// The absence of this condition means no failure has occured. | ||
depCondSet.Manage(ds).MarkTrue(deploymentConditionReplicaSetReady) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are not checking for the absence of this condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this around a bit to hopefully make it clearer what's going on. This will get overwritten by the deployment's condition if it's present, otherwise we assume it's true.
// The autoscaler mutates the deployment pretty often, which would cause us | ||
// to flip back and forth between Ready and Unknown every time we scale up | ||
// or down. | ||
if !rev.Status.IsActivationRequired() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of wonder if this should be folded into the propagation 🤔
Why did this move?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of wonder if this should be folded into the propagation
Possibly... I'm honestly not sure why we have this at all -- we're basically masking the underlying deployment status instead of exposing it. If we just exposed it, it seems like we'd have equivalent semantics (scaled to zero deployment is ready anyway, and we'd surface fatal errors faster if we didn't have to wait for activation to timeout...).
Why did this move?
There might be a better way to do this. We're currently calling MarkDeploying
before creating the deployment. This was overwriting that, so I moved PropagateDeploymentStatus into this else clause so that we don't call it when we first create the deployment.
I'm not sure if we care to keep the "Deploying" thing now that we have this, but I was trying to minimize the test impact.
case corev1.ConditionFalse: | ||
depCondSet.Manage(s).MarkFalse(deploymentConditionProgressing, cond.Reason, cond.Message) | ||
} | ||
case appsv1.DeploymentReplicaFailure: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this condition only have general failures across the board or is it triggered by even a single replica failure? I'm mostly making sure we don't mark the Revision as unready and subsequently cause backlash in the routing layer if only a single replica fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, this just surfaces any error that happens when a ReplicaSet attempts to create or delete a pod, see here.
From the Deployment's description:
// ReplicaFailure is added in a deployment when one of its pods fails to be created
// or deleted.
And the ReplicaSet description expands on that:
// ReplicaSetReplicaFailure is added in a replica set when one of its pods fails to be
// due to insufficient quota, limit ranges, pod security policy, node selectors, etc. or
// due to kubelet being down or finalizers are failing.
E.g. a pod crashing wouldn't trigger this, but being unable to recreate a new pod would.
Re: the routing layer, my understanding is that a Revision being Ready=False wouldn't blackhole a revision that's already routed, but would prevent a revision from being routed in the first place.
Does this supersede #4136? If so, please close the other PR or add a "Fixes" clause for it as well. |
Seems like it, but looking at #496 (the issue it claims to fix) it seems like we want to surface an error if the underlying deployment can't be created at all. That falls under #5076 as well, since I'm proposing we have something like I don't think #4136 fixes #496, but it does seem to fix #4416 🤷♂️ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jonjohnsonjr, mattmoor The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
1 similar comment
/retest |
/retest |
This change brings the Deployment -> Revision status propagation more in line with how other reconcilers work. Since this is at the edge between knative and kubernetes resource conventions, we need to transform the deployment conditions to conform to our conventions, e.g. by inverting appsv1.DeploymentReplicaFailure to be a happy condition and exposing a top-level happy state ("Ready") for the deployment. There are some changes to behavior: - An event is no longer emitted when the deployment times out. - We surface the underlying DeploymentProgressing Reason/Message instead of hard-coding our own. - We surface the DeploymentReplicaFailure message as well.
26e6417
to
d0b208a
Compare
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
The following tests are currently flaky. Running them again to verify...
Job pull-knative-serving-integration-tests expended all 3 retries without success. |
/retest |
This change brings the Deployment -> Revision status propagation more in
line with how other reconcilers work. Since this is at the edge between
knative and kubernetes resource conventions, we need to transform the
deployment conditions to conform to our conventions, e.g. by inverting
appsv1.DeploymentReplicaFailure to be a happy condition and exposing a
top-level happy state ("Ready") for the deployment.
There are some changes to behavior:
of hard-coding our own.
Fixes #4416
Part of #5076