apps: deployment config stuck in the new state should respect timeoutSeconds #17000

mfojtik · 2017-10-23T15:02:22Z

With this patch the deployment config controller will set the deployment as failed (timeout) after it reaches timeoutSeconds and the status of the deployment is 'new'. This generally happens when the deployment is not able to create the deployer pod (quota). We should not wait infinitely to have the quota.

mfojtik · 2017-10-23T15:02:44Z

need unit test...

tnozicka · 2017-10-23T16:27:43Z

pkg/apps/controller/deploymentconfig/deploymentconfig_controller.go

+			// In case we fail to created the deployer pod within strategy
+			// timeoutSeconds do not leave the deployment config in New state forever
+			// but timeout.
+			if deployutil.ConfigExceededTimeoutSeconds(config, latestRC) {


@mfojtik I think you need to cancel that deployment (RC) as well .

i will rather transition it to fail.

this is the DC part, cancellation of latest deployment must be part of the Handle() because I need client access.

mfojtik · 2017-10-24T11:48:11Z

@tnozicka @Kargakis @smarterclayton PTAL

Basically I use timeoutSeconds from strategy to transition rollout to "failed" state when the deployment is stuck in "new" state for longer than timeoutSeconds (10 minutes is the cluster default).

mfojtik · 2017-10-24T11:50:26Z

pkg/apps/apis/apps/types.go

@@ -113,6 +113,7 @@ const (
 	DeploymentCancelledNewerDeploymentExists  = "newer deployment was found running"
 	DeploymentFailedUnrelatedDeploymentExists = "unrelated pod with the same name as this deployment is already running"
 	DeploymentFailedDeployerPodNoLongerExists = "deployer pod no longer exists"
+	DeploymentFailedUnableToCreateDeployerPod = "unable to create deployer pod"


this looks nice in oc status:

dc/test deploys docker.io/library/centos:7 deployment #4 failed 44 minutes ago: unable to create deployer pod deployment #3 deployed 2 hours ago - 1 pod deployment #2 deployed 2 hours ago

tnozicka

I would prefer this code in DeploymentController instead of DeploymentConfigController. Logically this is about watching RC and transitioning it to FailedPhase. DeploymentController is the state machine that updates deployment phases and doing it from outside will only force conflicts.

tnozicka · 2017-10-24T11:56:48Z

pkg/apps/controller/deploymentconfig/deploymentconfig_controller.go

+	if !deployutil.IsNewDeployment(deployment) {
+		return nil
+	}
+	return retry.RetryOnConflict(retry.DefaultBackoff, func() error {


what if the deployment transitions to succeeded?
you have declared than you want to act only on state new but now you ignore it with retry...

i think a check if the deployment is still new is ok, honestly I don't think that will ever happen

@mfojtik 409?

by getting 409 none of your previous assertions is no longer valid. and transitioning from Succeeded to Failed might reveal unexpected side effects

tnozicka · 2017-10-24T12:00:57Z

pkg/apps/util/util.go

@@ -782,6 +788,23 @@ func DeploymentsForCleanup(configuration *deployapi.DeploymentConfig, deployment
 	return relevantDeployments
 }

+func ConfigExceededTimeoutSeconds(config *deployapi.DeploymentConfig, latestRC *v1.ReplicationController) bool {


DeploymentExceededTimeoutSeconds

right, I was lazy to change it, now you beat me to it :)

tnozicka · 2017-10-24T12:03:53Z

pkg/apps/util/util.go

+	var timeoutSeconds int64
+	if params := config.Spec.Strategy.RollingParams; params != nil {
+		timeoutSeconds = deployapi.DefaultRollingTimeoutSeconds
+		if params.TimeoutSeconds != nil {


I think this is the wrong property check the description here

origin/pkg/apps/apis/apps/types.go

Lines 272 to 273 in fb1679e

// TimeoutSeconds is the time to wait for updates before giving up. If the

// value is nil, a default will be used.

Seems like activeDeadlineSeconds is what we use for timeout

origin/pkg/apps/apis/apps/types.go

Lines 219 to 221 in fb1679e

// ActiveDeadlineSeconds is the duration in seconds that the deployer pods for this deployment

// config may be active on a node before the system actively tries to terminate them.

ActiveDeadlineSeconds *int64

also general timeout shouldn't be dependent on strategy

I think we want // TimeoutSeconds is the time to wait for updates before giving up. If the
// value is nil, a default will be used.

Because literally this thing is about not getting any progress since we are not able to create deployer pod. ActiveDeadlineSeconds sets the duration for the deployer pod, but in this case there is no deployer pod.

tnozicka · 2017-10-24T12:05:28Z

pkg/apps/util/util.go

+			timeoutSeconds = *params.TimeoutSeconds
+		}
+	}
+	return int64(time.Since(latestRC.CreationTimestamp.Time)*time.Second) > timeoutSeconds


I need to check how we do rollbacks because in that case creationTimestamp might be an issue

the RC has to be in "new" state for this to trigger... it is only in new when it is new :)

Yeah, we do rollbacks differently from upstream anyway (we always create a new RC).

tnozicka

@mfojtik this is what I had in mind. The only question remaining for me is where we take that timeout

tnozicka · 2017-10-24T14:44:58Z

pkg/apps/controller/deploymentconfig/deploymentconfig_controller.go

@@ -128,6 +128,7 @@ func (c *DeploymentConfigController) Handle(config *deployapi.DeploymentConfig)
 			return err
 		}
 	}
+


tnozicka · 2017-10-24T14:52:19Z

/retest

tnozicka · 2017-10-24T15:05:14Z

@mfojtik @smarterclayton @Kargakis I feel like we are missing proper deployment timeout in DC API object and closest to it is DC.Spec.Strategy.ActiveDeadlineSeconds. Timeout should be property of DC.Spec.Strategy (or DC.Spec) not part of a particular strategy. And CustomDeploymentStrategyParams is missing the timeout @mfojtik wants to use. I think it would be better to reshape activeDeadlineSeconds as that's common for all of them and also used as a timeout for deployer pod.

mfojtik · 2017-10-24T16:59:10Z

flake: #17024

/retest

mfojtik · 2017-10-24T19:59:04Z

/retest

tnozicka · 2017-10-25T09:07:10Z

pkg/apps/controller/deployer/deployer_controller.go

+		if deployutil.RolloutExceededTimeoutSeconds(config, deployment) {
+			nextStatus = deployapi.DeploymentStatusFailed
+			updatedAnnotations[deployapi.DeploymentStatusReasonAnnotation] = deployapi.DeploymentFailedUnableToCreateDeployerPod
+			break


@mfojtik this would be a good place to log

tnozicka · 2017-10-25T10:12:57Z

pkg/apps/util/util.go

+		}
+	}
+	// For "custom" strategy use the default for recreate strategy.
+	if timeoutSeconds == 0 {


I don't think this will work. If one of the strategies will have TimeoutSeconds=0 this will overwrite it

you should probably switch on strategy type

tnozicka · 2017-10-25T11:46:22Z

pkg/apps/util/util.go

+// (like quota, etc...). In that case deployer controller use this function to
+// measure if the created deployment (RC) exceeded the timeout.
+func RolloutExceededTimeoutSeconds(config *deployapi.DeploymentConfig, latestRC *v1.ReplicationController) bool {
+	return int64(time.Since(latestRC.CreationTimestamp.Time).Seconds()) > GetTimeoutSecondsForStrategy(config)


if none of the strategy matches or if timeout is set explicitly set to 0 it will always fail the deployment

@tnozicka 1) can never happen because we guard in validation?, 2) isn't that expected?

usually 0 means ignore the timeout as it wouldn't make sense otherwise

timeoutSeconds is optional an if not specified then it defaults to 600s. Setting explicitly to 0 means that I want 0 timeout because i'm crazy ;-)

setting 0 mean I want no timeout because if I don't specify it, it defaults to 600

ok, you beat me to it.

tnozicka · 2017-10-25T11:49:51Z

pkg/apps/util/util_test.go

@@ -586,3 +586,22 @@ func TestRemoveCondition(t *testing.T) {
 		}
 	}
 }
+
+func TestRolloutExceededTimeoutSeconds(t *testing.T) {


I'd like to see a test table containing setting different strategies and not specified timeouts mixed with creationTimestamps values

mfojtik · 2017-10-25T12:20:12Z

pkg/apps/controller/deployer/deployer_controller.go

+		// deployer pod (quota, etc..) we should respect the timeoutSeconds in the
+		// config strategy and transition the rollout to failed instead of waiting for
+		// the deployment pod forever.
+		config, err := deployutil.DecodeDeploymentConfig(deployment, c.codec)


long term, we want to get rid of this... we should copy the timeoutSecond into annotation so we don't have to decode here... this also apply to other fields we check i guess.

@tnozicka ^

tnozicka · 2017-10-25T12:20:23Z

pkg/apps/util/util.go

+func RolloutExceededTimeoutSeconds(config *deployapi.DeploymentConfig, latestRC *v1.ReplicationController) bool {
+	timeoutSeconds := GetTimeoutSecondsForStrategy(config)
+	// If user set the timeoutSeconds to 0, we assume there should be no timeout.
+	if timeoutSeconds == 0 {


nit, defensive would be to check timeoutSeconds <= 0

tnozicka · 2017-10-25T12:21:43Z

pkg/apps/controller/deployer/deployer_controller.go

+			nextStatus = deployapi.DeploymentStatusFailed
+			updatedAnnotations[deployapi.DeploymentStatusReasonAnnotation] = deployapi.DeploymentFailedUnableToCreateDeployerPod
+			c.emitDeploymentEvent(deployment, v1.EventTypeWarning, "RolloutTimeout", fmt.Sprintf("Rollout for %q failed to create deployer pod (timeoutSeconds: %ds)", deployutil.LabelForDeploymentV1(deployment), deployutil.GetTimeoutSecondsForStrategy(config)))
+			glog.V(4).Infof("Failing deployment %s/%s as we timeout out while waiting for the deployer pod to be created", deployment.Namespace, deployment.Name)


s/timeout out/timed out

: reached timeout while

tnozicka · 2017-10-25T12:24:58Z

pkg/apps/util/util_test.go

+	now := time.Now()
+	tests := []struct {
+		name                   string
+		config                 func(int64) *deployapi.DeploymentConfig


s/func(int64) *deployapi.DeploymentConfig/*deployapi.DeploymentConfig/

tnozicka · 2017-10-25T12:25:07Z

pkg/apps/util/util_test.go

+	tests := []struct {
+		name                   string
+		config                 func(int64) *deployapi.DeploymentConfig
+		timeoutSeconds         int64


tnozicka · 2017-10-25T12:25:14Z

pkg/apps/util/util_test.go

+		name                   string
+		config                 func(int64) *deployapi.DeploymentConfig
+		timeoutSeconds         int64
+		deploymentCreationTime time.Time


tnozicka · 2017-10-25T12:25:51Z

pkg/apps/util/util_test.go

+			config: func(timeoutSeconds int64) *deployapi.DeploymentConfig {
+				config := deploytest.OkDeploymentConfig(1)
+				config.Spec.Strategy.RecreateParams.TimeoutSeconds = &timeoutSeconds
+				return config


fill in timeout and creationTime

tnozicka · 2017-10-25T12:26:29Z

pkg/apps/util/util_test.go

+				config := deploytest.OkDeploymentConfig(1)
+				config.Spec.Strategy.RecreateParams.TimeoutSeconds = &timeoutSeconds
+				return config
+			},


call the lambda immediately == s/}/}()/

tnozicka · 2017-10-25T14:05:09Z

pkg/apps/util/util_test.go

+			config: func(timeoutSeconds int64) *deployapi.DeploymentConfig {
+				config := deploytest.OkDeploymentConfig(1)
+				config.Spec.Strategy = deploytest.OkRollingStrategy()
+				config.Spec.Strategy.RollingParams.TimeoutSeconds = &timeoutSeconds


I would assign int64ptr(10) directly here without the whole taking is as argument thing

tnozicka · 2017-10-25T14:14:57Z

pkg/apps/util/util_test.go

+		if tc.expectTimeout && !gotTimeout {
+			t.Errorf("[%s]: expected timeout, but got no timeout", tc.name)
+		}
+	}


@mfojtik what if !tc.expectTimeout && gotTimeout ?

tnozicka · 2017-10-25T14:21:43Z

pkg/apps/util/util_test.go

+				config := deploytest.OkDeploymentConfig(1)
+				config.Spec.Strategy.RecreateParams.TimeoutSeconds = nil
+				return config
+			}(0),


don't take it as you have to fix it because I told you, you don't have to but check how the argument is pointless here :)

🤷‍♂️

tnozicka · 2017-10-25T14:28:46Z

pkg/apps/util/util.go

-			OwnerReferences: []metav1.OwnerReference{*controllerRef},
+			Labels:            controllerLabels,
+			OwnerReferences:   []metav1.OwnerReference{*controllerRef},
+			CreationTimestamp: defaultCreationTimestamp,


@mfojtik why? we will now send objects to API server with CreationTimestamp set?

huh, I didn't notice this is not just for test...

…Secods

tnozicka · 2017-10-25T14:55:47Z

/lgtm

openshift-merge-robot · 2017-10-25T14:55:59Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mfojtik, tnozicka

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

~~pkg/apps/OWNERS~~ [mfojtik]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

openshift-bot · 2017-10-25T17:34:59Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-merge-robot · 2017-10-25T21:01:06Z

Automatic merge from submit-queue (batch tested with PRs 17020, 17026, 17000, 17010).

smarterclayton · 2017-10-26T03:09:16Z

There may be strategies in the future that don't use a pod, so there might be no timeout. Probably needs to be discussed around long term behavior of DC

openshift-merge-robot assigned soltysh and 0xmichalis Oct 23, 2017

openshift-merge-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 23, 2017

openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Oct 23, 2017

tnozicka reviewed Oct 23, 2017

View reviewed changes

mfojtik force-pushed the dc-timeout branch from a904877 to 3ac577c Compare October 24, 2017 11:46

openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 24, 2017

openshift-merge-robot added the needs-api-review label Oct 24, 2017

mfojtik commented Oct 24, 2017

View reviewed changes

tnozicka reviewed Oct 24, 2017

View reviewed changes

mfojtik force-pushed the dc-timeout branch from 3ac577c to f6388e5 Compare October 24, 2017 12:57

tnozicka reviewed Oct 24, 2017

View reviewed changes