Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor deployment status propagation #5077

Merged

Conversation

jonjohnsonjr
Copy link
Contributor

This change brings the Deployment -> Revision status propagation more in
line with how other reconcilers work. Since this is at the edge between
knative and kubernetes resource conventions, we need to transform the
deployment conditions to conform to our conventions, e.g. by inverting
appsv1.DeploymentReplicaFailure to be a happy condition and exposing a
top-level happy state ("Ready") for the deployment.

There are some changes to behavior:

  • An event is no longer emitted when the deployment times out.
  • We surface the underlying DeploymentProgressing Reason/Message instead
    of hard-coding our own.
  • We surface the DeploymentReplicaFailure message as well.

Fixes #4416

Part of #5076

@googlebot googlebot added the cla: yes Indicates the PR's author has signed the CLA. label Aug 6, 2019
@knative-prow-robot knative-prow-robot added area/API API objects and controllers size/L Denotes a PR that changes 100-499 lines, ignoring generated files. area/test-and-release It flags unit/e2e/conformance/perf test issues for product features labels Aug 6, 2019
Copy link
Contributor

@knative-prow-robot knative-prow-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonjohnsonjr: 1 warning.

In response to this:

This change brings the Deployment -> Revision status propagation more in
line with how other reconcilers work. Since this is at the edge between
knative and kubernetes resource conventions, we need to transform the
deployment conditions to conform to our conventions, e.g. by inverting
appsv1.DeploymentReplicaFailure to be a happy condition and exposing a
top-level happy state ("Ready") for the deployment.

There are some changes to behavior:

  • An event is no longer emitted when the deployment times out.
  • We surface the underlying DeploymentProgressing Reason/Message instead
    of hard-coding our own.
  • We surface the DeploymentReplicaFailure message as well.

Fixes #4416

Part of #5076

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jonjohnsonjr
Copy link
Contributor Author

/assign @dgerd @mattmoor

func (ds *deploymentStatus) initializeConditions() {
depCondSet.Manage(ds).InitializeConditions()
// The absence of this condition means no failure has occured.
depCondSet.Manage(ds).MarkTrue(deploymentConditionReplicaSetReady)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are not checking for the absence of this condition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this around a bit to hopefully make it clearer what's going on. This will get overwritten by the deployment's condition if it's present, otherwise we assume it's true.

pkg/apis/serving/v1alpha1/revision_lifecycle.go Outdated Show resolved Hide resolved
pkg/apis/serving/v1alpha1/revision_lifecycle.go Outdated Show resolved Hide resolved
pkg/apis/serving/v1alpha1/revision_lifecycle.go Outdated Show resolved Hide resolved
// The autoscaler mutates the deployment pretty often, which would cause us
// to flip back and forth between Ready and Unknown every time we scale up
// or down.
if !rev.Status.IsActivationRequired() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of wonder if this should be folded into the propagation 🤔

Why did this move?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of wonder if this should be folded into the propagation

Possibly... I'm honestly not sure why we have this at all -- we're basically masking the underlying deployment status instead of exposing it. If we just exposed it, it seems like we'd have equivalent semantics (scaled to zero deployment is ready anyway, and we'd surface fatal errors faster if we didn't have to wait for activation to timeout...).

Why did this move?

There might be a better way to do this. We're currently calling MarkDeploying before creating the deployment. This was overwriting that, so I moved PropagateDeploymentStatus into this else clause so that we don't call it when we first create the deployment.

I'm not sure if we care to keep the "Deploying" thing now that we have this, but I was trying to minimize the test impact.

case corev1.ConditionFalse:
depCondSet.Manage(s).MarkFalse(deploymentConditionProgressing, cond.Reason, cond.Message)
}
case appsv1.DeploymentReplicaFailure:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this condition only have general failures across the board or is it triggered by even a single replica failure? I'm mostly making sure we don't mark the Revision as unready and subsequently cause backlash in the routing layer if only a single replica fails.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, this just surfaces any error that happens when a ReplicaSet attempts to create or delete a pod, see here.

From the Deployment's description:

// ReplicaFailure is added in a deployment when one of its pods fails to be created
// or deleted.

And the ReplicaSet description expands on that:

// ReplicaSetReplicaFailure is added in a replica set when one of its pods fails to be
// due to insufficient quota, limit ranges, pod security policy, node selectors, etc. or
// due to kubelet being down or finalizers are failing.

E.g. a pod crashing wouldn't trigger this, but being unable to recreate a new pod would.

Re: the routing layer, my understanding is that a Revision being Ready=False wouldn't blackhole a revision that's already routed, but would prevent a revision from being routed in the first place.

@markusthoemmes
Copy link
Contributor

Does this supersede #4136? If so, please close the other PR or add a "Fixes" clause for it as well.

@jonjohnsonjr
Copy link
Contributor Author

Does this supersede #4136

Seems like it, but looking at #496 (the issue it claims to fix) it seems like we want to surface an error if the underlying deployment can't be created at all. That falls under #5076 as well, since I'm proposing we have something like MarkReconcileError that surfaces any error that occurs during reconciliation, but that would be a separate change.

I don't think #4136 fixes #496, but it does seem to fix #4416 🤷‍♂️

Copy link
Member

@mattmoor mattmoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 7, 2019
@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jonjohnsonjr, mattmoor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow-robot knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 7, 2019
@jonjohnsonjr
Copy link
Contributor Author

/retest

1 similar comment
@mattmoor
Copy link
Member

mattmoor commented Aug 8, 2019

/retest

@jonjohnsonjr
Copy link
Contributor Author

Failed to update Service: Internal error occurred: no kind "Service" is registered for version "serving.knative.dev/v1beta1" in scheme "k8s.io/apiextensions-apiserver/pkg/apiserver/apiserver.go:52"

image

@jonjohnsonjr
Copy link
Contributor Author

/retest

This change brings the Deployment -> Revision status propagation more in
line with how other reconcilers work. Since this is at the edge between
knative and kubernetes resource conventions, we need to transform the
deployment conditions to conform to our conventions, e.g. by inverting
appsv1.DeploymentReplicaFailure to be a happy condition and exposing a
top-level happy state ("Ready") for the deployment.

There are some changes to behavior:

- An event is no longer emitted when the deployment times out.
- We surface the underlying DeploymentProgressing Reason/Message instead
of hard-coding our own.
- We surface the DeploymentReplicaFailure message as well.
@jonjohnsonjr jonjohnsonjr force-pushed the propagate-deployment-status branch from 26e6417 to d0b208a Compare August 8, 2019 18:59
@knative-prow-robot knative-prow-robot removed the lgtm Indicates that a PR is ready to be merged. label Aug 8, 2019
@jonjohnsonjr
Copy link
Contributor Author

@mattmoor had to rebase to resolve conflicts with #5074 can I get another LGTM? 😄

@jonjohnsonjr
Copy link
Contributor Author

service_test.go:137: Service service-robfjbzy was not updated with annotation in its RevisionTemplateSpec: Internal error occurred: no kind "Service" is registered for version "serving.knative.dev/v1beta1" in scheme "k8s.io/apiextensions-apiserver/pkg/apiserver/apiserver.go:52"

/retest

Copy link
Member

@mattmoor mattmoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 8, 2019
@knative-test-reporter-robot

The following tests are currently flaky. Running them again to verify...

Test name Retries
pull-knative-serving-integration-tests 3/3

Job pull-knative-serving-integration-tests expended all 3 retries without success.

@jonjohnsonjr
Copy link
Contributor Author

blue_green_test.go:103: Failed to update Service: Internal error occurred: no kind "Service" is registered for version "serving.knative.dev/v1beta1" in scheme "k8s.io/apiextensions-apiserver/pkg/apiserver/apiserver.go:52"

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/API API objects and controllers area/test-and-release It flags unit/e2e/conformance/perf test issues for product features cla: yes Indicates the PR's author has signed the CLA. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ResourceQuota error isn't reflected in kservice status
8 participants