Refactor deployment status propagation #5077

jonjohnsonjr · 2019-08-06T20:30:40Z

This change brings the Deployment -> Revision status propagation more in
line with how other reconcilers work. Since this is at the edge between
knative and kubernetes resource conventions, we need to transform the
deployment conditions to conform to our conventions, e.g. by inverting
appsv1.DeploymentReplicaFailure to be a happy condition and exposing a
top-level happy state ("Ready") for the deployment.

There are some changes to behavior:

An event is no longer emitted when the deployment times out.
We surface the underlying DeploymentProgressing Reason/Message instead
of hard-coding our own.
We surface the DeploymentReplicaFailure message as well.

Fixes #4416

Part of #5076

knative-prow-robot

@jonjohnsonjr: 1 warning.

In response to this:

This change brings the Deployment -> Revision status propagation more in
line with how other reconcilers work. Since this is at the edge between
knative and kubernetes resource conventions, we need to transform the
deployment conditions to conform to our conventions, e.g. by inverting
appsv1.DeploymentReplicaFailure to be a happy condition and exposing a
top-level happy state ("Ready") for the deployment.

There are some changes to behavior:

An event is no longer emitted when the deployment times out.

We surface the underlying DeploymentProgressing Reason/Message instead
of hard-coding our own.

We surface the DeploymentReplicaFailure message as well.

Fixes #4416

Part of #5076

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pkg/apis/serving/v1alpha1/revision_lifecycle.go

jonjohnsonjr · 2019-08-06T20:45:03Z

/assign @dgerd @mattmoor

mattmoor · 2019-08-07T01:14:00Z

pkg/apis/serving/v1alpha1/revision_lifecycle.go

+func (ds *deploymentStatus) initializeConditions() {
+	depCondSet.Manage(ds).InitializeConditions()
+	// The absence of this condition means no failure has occured.
+	depCondSet.Manage(ds).MarkTrue(deploymentConditionReplicaSetReady)


You are not checking for the absence of this condition?

Moved this around a bit to hopefully make it clearer what's going on. This will get overwritten by the deployment's condition if it's present, otherwise we assume it's true.

pkg/apis/serving/v1alpha1/revision_lifecycle.go

mattmoor · 2019-08-07T01:22:19Z

pkg/reconciler/revision/reconcile_resources.go

+		// The autoscaler mutates the deployment pretty often, which would cause us
+		// to flip back and forth between Ready and Unknown every time we scale up
+		// or down.
+		if !rev.Status.IsActivationRequired() {


I kind of wonder if this should be folded into the propagation 🤔

Why did this move?

I kind of wonder if this should be folded into the propagation

Possibly... I'm honestly not sure why we have this at all -- we're basically masking the underlying deployment status instead of exposing it. If we just exposed it, it seems like we'd have equivalent semantics (scaled to zero deployment is ready anyway, and we'd surface fatal errors faster if we didn't have to wait for activation to timeout...).

Why did this move?

There might be a better way to do this. We're currently calling MarkDeploying before creating the deployment. This was overwriting that, so I moved PropagateDeploymentStatus into this else clause so that we don't call it when we first create the deployment.

I'm not sure if we care to keep the "Deploying" thing now that we have this, but I was trying to minimize the test impact.

markusthoemmes · 2019-08-07T06:46:40Z

pkg/apis/serving/v1alpha1/revision_lifecycle.go

+			case corev1.ConditionFalse:
+				depCondSet.Manage(s).MarkFalse(deploymentConditionProgressing, cond.Reason, cond.Message)
+			}
+		case appsv1.DeploymentReplicaFailure:


Does this condition only have general failures across the board or is it triggered by even a single replica failure? I'm mostly making sure we don't mark the Revision as unready and subsequently cause backlash in the routing layer if only a single replica fails.

As far as I can tell, this just surfaces any error that happens when a ReplicaSet attempts to create or delete a pod, see here.

From the Deployment's description:

// ReplicaFailure is added in a deployment when one of its pods fails to be created
// or deleted.

And the ReplicaSet description expands on that:

// ReplicaSetReplicaFailure is added in a replica set when one of its pods fails to be
// due to insufficient quota, limit ranges, pod security policy, node selectors, etc. or
// due to kubelet being down or finalizers are failing.

E.g. a pod crashing wouldn't trigger this, but being unable to recreate a new pod would.

Re: the routing layer, my understanding is that a Revision being Ready=False wouldn't blackhole a revision that's already routed, but would prevent a revision from being routed in the first place.

markusthoemmes · 2019-08-07T14:35:38Z

Does this supersede #4136? If so, please close the other PR or add a "Fixes" clause for it as well.

jonjohnsonjr · 2019-08-07T20:39:14Z

Does this supersede #4136

Seems like it, but looking at #496 (the issue it claims to fix) it seems like we want to surface an error if the underlying deployment can't be created at all. That falls under #5076 as well, since I'm proposing we have something like MarkReconcileError that surfaces any error that occurs during reconciliation, but that would be a separate change.

I don't think #4136 fixes #496, but it does seem to fix #4416 🤷‍♂️

mattmoor

/lgtm
/approve

knative-prow-robot · 2019-08-07T21:50:34Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jonjohnsonjr, mattmoor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/apis/OWNERS~~ [mattmoor]
~~pkg/reconciler/OWNERS~~ [mattmoor]
~~pkg/testing/OWNERS~~ [mattmoor]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jonjohnsonjr · 2019-08-07T23:10:35Z

/retest

mattmoor · 2019-08-08T13:57:23Z

/retest

jonjohnsonjr · 2019-08-08T16:58:47Z

Failed to update Service: Internal error occurred: no kind "Service" is registered for version "serving.knative.dev/v1beta1" in scheme "k8s.io/apiextensions-apiserver/pkg/apiserver/apiserver.go:52"

jonjohnsonjr · 2019-08-08T16:58:51Z

/retest

This change brings the Deployment -> Revision status propagation more in line with how other reconcilers work. Since this is at the edge between knative and kubernetes resource conventions, we need to transform the deployment conditions to conform to our conventions, e.g. by inverting appsv1.DeploymentReplicaFailure to be a happy condition and exposing a top-level happy state ("Ready") for the deployment. There are some changes to behavior: - An event is no longer emitted when the deployment times out. - We surface the underlying DeploymentProgressing Reason/Message instead of hard-coding our own. - We surface the DeploymentReplicaFailure message as well.

jonjohnsonjr · 2019-08-08T19:00:28Z

@mattmoor had to rebase to resolve conflicts with #5074 can I get another LGTM? 😄

jonjohnsonjr · 2019-08-08T20:13:17Z

service_test.go:137: Service service-robfjbzy was not updated with annotation in its RevisionTemplateSpec: Internal error occurred: no kind "Service" is registered for version "serving.knative.dev/v1beta1" in scheme "k8s.io/apiextensions-apiserver/pkg/apiserver/apiserver.go:52"

/retest

mattmoor

/lgtm

knative-test-reporter-robot · 2019-08-08T22:19:44Z

The following tests are currently flaky. Running them again to verify...

Test name	Retries
pull-knative-serving-integration-tests	3/3

Job pull-knative-serving-integration-tests expended all 3 retries without success.

jonjohnsonjr · 2019-08-08T22:36:21Z

blue_green_test.go:103: Failed to update Service: Internal error occurred: no kind "Service" is registered for version "serving.knative.dev/v1beta1" in scheme "k8s.io/apiextensions-apiserver/pkg/apiserver/apiserver.go:52"

/retest

googlebot added the cla: yes Indicates the PR's author has signed the CLA. label Aug 6, 2019

knative-prow-robot added area/API API objects and controllers size/L Denotes a PR that changes 100-499 lines, ignoring generated files. area/test-and-release It flags unit/e2e/conformance/perf test issues for product features labels Aug 6, 2019

knative-prow-robot reviewed Aug 6, 2019

View reviewed changes

pkg/apis/serving/v1alpha1/revision_lifecycle.go Show resolved Hide resolved

knative-prow-robot requested review from dprotaso and greghaynes August 6, 2019 20:30

knative-prow-robot assigned dgerd and mattmoor Aug 6, 2019

mattmoor reviewed Aug 7, 2019

View reviewed changes

markusthoemmes reviewed Aug 7, 2019

View reviewed changes

mattmoor-sockpuppet reviewed Aug 7, 2019

View reviewed changes

mattmoor reviewed Aug 7, 2019

View reviewed changes

knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 7, 2019

knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 7, 2019

jonjohnsonjr force-pushed the propagate-deployment-status branch from 26e6417 to d0b208a Compare August 8, 2019 18:59

knative-prow-robot removed the lgtm Indicates that a PR is ready to be merged. label Aug 8, 2019

mattmoor reviewed Aug 8, 2019

View reviewed changes

knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 8, 2019

knative-prow-robot merged commit eb120c9 into knative:master Aug 8, 2019

jonjohnsonjr deleted the propagate-deployment-status branch August 9, 2019 17:45

jonjohnsonjr mentioned this pull request Sep 23, 2020

Autoscaling in Knative and in the Cluster #9531

Open

gabo1208 mentioned this pull request Oct 16, 2023

Surface cpu and mem requests forbidden errors (and other ones too) in KSVC creation #14453

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor deployment status propagation #5077

Refactor deployment status propagation #5077

jonjohnsonjr commented Aug 6, 2019

knative-prow-robot left a comment

jonjohnsonjr commented Aug 6, 2019

mattmoor Aug 7, 2019

jonjohnsonjr Aug 7, 2019

mattmoor Aug 7, 2019

jonjohnsonjr Aug 7, 2019

markusthoemmes Aug 7, 2019

jonjohnsonjr Aug 7, 2019

markusthoemmes commented Aug 7, 2019

jonjohnsonjr commented Aug 7, 2019

mattmoor left a comment

knative-prow-robot commented Aug 7, 2019

jonjohnsonjr commented Aug 7, 2019

mattmoor commented Aug 8, 2019

jonjohnsonjr commented Aug 8, 2019

jonjohnsonjr commented Aug 8, 2019

jonjohnsonjr commented Aug 8, 2019

jonjohnsonjr commented Aug 8, 2019

mattmoor left a comment

knative-test-reporter-robot commented Aug 8, 2019

jonjohnsonjr commented Aug 8, 2019

Refactor deployment status propagation #5077

Refactor deployment status propagation #5077

Conversation

jonjohnsonjr commented Aug 6, 2019

knative-prow-robot left a comment

Choose a reason for hiding this comment

jonjohnsonjr commented Aug 6, 2019

mattmoor Aug 7, 2019

Choose a reason for hiding this comment

jonjohnsonjr Aug 7, 2019

Choose a reason for hiding this comment

mattmoor Aug 7, 2019

Choose a reason for hiding this comment

jonjohnsonjr Aug 7, 2019

Choose a reason for hiding this comment

markusthoemmes Aug 7, 2019

Choose a reason for hiding this comment

jonjohnsonjr Aug 7, 2019

Choose a reason for hiding this comment

markusthoemmes commented Aug 7, 2019

jonjohnsonjr commented Aug 7, 2019

mattmoor left a comment

Choose a reason for hiding this comment

knative-prow-robot commented Aug 7, 2019

jonjohnsonjr commented Aug 7, 2019

mattmoor commented Aug 8, 2019

jonjohnsonjr commented Aug 8, 2019

jonjohnsonjr commented Aug 8, 2019

jonjohnsonjr commented Aug 8, 2019

jonjohnsonjr commented Aug 8, 2019

mattmoor left a comment

Choose a reason for hiding this comment

knative-test-reporter-robot commented Aug 8, 2019

jonjohnsonjr commented Aug 8, 2019