Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apps: deployment config stuck in the new state should respect timeoutSeconds #17000

Merged
merged 1 commit into from
Oct 25, 2017

Conversation

mfojtik
Copy link
Contributor

@mfojtik mfojtik commented Oct 23, 2017

Fixes: #16962

With this patch the deployment config controller will set the deployment as failed (timeout) after it reaches timeoutSeconds and the status of the deployment is 'new'. This generally happens when the deployment is not able to create the deployer pod (quota). We should not wait infinitely to have the quota.

@openshift-merge-robot openshift-merge-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 23, 2017
@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Oct 23, 2017
@mfojtik
Copy link
Contributor Author

mfojtik commented Oct 23, 2017

need unit test...

// In case we fail to created the deployer pod within strategy
// timeoutSeconds do not leave the deployment config in New state forever
// but timeout.
if deployutil.ConfigExceededTimeoutSeconds(config, latestRC) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mfojtik I think you need to cancel that deployment (RC) as well .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will rather transition it to fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the DC part, cancellation of latest deployment must be part of the Handle() because I need client access.

@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 24, 2017
@mfojtik
Copy link
Contributor Author

mfojtik commented Oct 24, 2017

@tnozicka @Kargakis @smarterclayton PTAL

Basically I use timeoutSeconds from strategy to transition rollout to "failed" state when the deployment is stuck in "new" state for longer than timeoutSeconds (10 minutes is the cluster default).

@@ -113,6 +113,7 @@ const (
DeploymentCancelledNewerDeploymentExists = "newer deployment was found running"
DeploymentFailedUnrelatedDeploymentExists = "unrelated pod with the same name as this deployment is already running"
DeploymentFailedDeployerPodNoLongerExists = "deployer pod no longer exists"
DeploymentFailedUnableToCreateDeployerPod = "unable to create deployer pod"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks nice in oc status:

dc/test deploys docker.io/library/centos:7
  deployment #4 failed 44 minutes ago: unable to create deployer pod
  deployment #3 deployed 2 hours ago - 1 pod
  deployment #2 deployed 2 hours ago

Copy link
Contributor

@tnozicka tnozicka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer this code in DeploymentController instead of DeploymentConfigController. Logically this is about watching RC and transitioning it to FailedPhase. DeploymentController is the state machine that updates deployment phases and doing it from outside will only force conflicts.

if !deployutil.IsNewDeployment(deployment) {
return nil
}
return retry.RetryOnConflict(retry.DefaultBackoff, func() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the deployment transitions to succeeded?
you have declared than you want to act only on state new but now you ignore it with retry...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think a check if the deployment is still new is ok, honestly I don't think that will ever happen

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mfojtik 409?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by getting 409 none of your previous assertions is no longer valid. and transitioning from Succeeded to Failed might reveal unexpected side effects

@@ -782,6 +788,23 @@ func DeploymentsForCleanup(configuration *deployapi.DeploymentConfig, deployment
return relevantDeployments
}

func ConfigExceededTimeoutSeconds(config *deployapi.DeploymentConfig, latestRC *v1.ReplicationController) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeploymentExceededTimeoutSeconds

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, I was lazy to change it, now you beat me to it :)

var timeoutSeconds int64
if params := config.Spec.Strategy.RollingParams; params != nil {
timeoutSeconds = deployapi.DefaultRollingTimeoutSeconds
if params.TimeoutSeconds != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the wrong property check the description here

// TimeoutSeconds is the time to wait for updates before giving up. If the
// value is nil, a default will be used.

Seems like activeDeadlineSeconds is what we use for timeout
// ActiveDeadlineSeconds is the duration in seconds that the deployer pods for this deployment
// config may be active on a node before the system actively tries to terminate them.
ActiveDeadlineSeconds *int64

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also general timeout shouldn't be dependent on strategy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want // TimeoutSeconds is the time to wait for updates before giving up. If the
// value is nil, a default will be used.

Because literally this thing is about not getting any progress since we are not able to create deployer pod. ActiveDeadlineSeconds sets the duration for the deployer pod, but in this case there is no deployer pod.

timeoutSeconds = *params.TimeoutSeconds
}
}
return int64(time.Since(latestRC.CreationTimestamp.Time)*time.Second) > timeoutSeconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to check how we do rollbacks because in that case creationTimestamp might be an issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the RC has to be in "new" state for this to trigger... it is only in new when it is new :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we do rollbacks differently from upstream anyway (we always create a new RC).

Copy link
Contributor

@tnozicka tnozicka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mfojtik this is what I had in mind. The only question remaining for me is where we take that timeout

@@ -128,6 +128,7 @@ func (c *DeploymentConfigController) Handle(config *deployapi.DeploymentConfig)
return err
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leftover?

@tnozicka
Copy link
Contributor

/retest

@tnozicka
Copy link
Contributor

@mfojtik @smarterclayton @Kargakis I feel like we are missing proper deployment timeout in DC API object and closest to it is DC.Spec.Strategy.ActiveDeadlineSeconds. Timeout should be property of DC.Spec.Strategy (or DC.Spec) not part of a particular strategy. And CustomDeploymentStrategyParams is missing the timeout @mfojtik wants to use. I think it would be better to reshape activeDeadlineSeconds as that's common for all of them and also used as a timeout for deployer pod.

@mfojtik
Copy link
Contributor Author

mfojtik commented Oct 24, 2017

flake: #17024

/retest

@mfojtik
Copy link
Contributor Author

mfojtik commented Oct 24, 2017

/retest

if deployutil.RolloutExceededTimeoutSeconds(config, deployment) {
nextStatus = deployapi.DeploymentStatusFailed
updatedAnnotations[deployapi.DeploymentStatusReasonAnnotation] = deployapi.DeploymentFailedUnableToCreateDeployerPod
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mfojtik this would be a good place to log

}
}
// For "custom" strategy use the default for recreate strategy.
if timeoutSeconds == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will work. If one of the strategies will have TimeoutSeconds=0 this will overwrite it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should probably switch on strategy type

@mfojtik mfojtik force-pushed the dc-timeout branch 3 times, most recently from f06e179 to 24c9a78 Compare October 25, 2017 11:45
// (like quota, etc...). In that case deployer controller use this function to
// measure if the created deployment (RC) exceeded the timeout.
func RolloutExceededTimeoutSeconds(config *deployapi.DeploymentConfig, latestRC *v1.ReplicationController) bool {
return int64(time.Since(latestRC.CreationTimestamp.Time).Seconds()) > GetTimeoutSecondsForStrategy(config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if none of the strategy matches or if timeout is set explicitly set to 0 it will always fail the deployment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tnozicka 1) can never happen because we guard in validation?, 2) isn't that expected?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usually 0 means ignore the timeout as it wouldn't make sense otherwise

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeoutSeconds is optional an if not specified then it defaults to 600s. Setting explicitly to 0 means that I want 0 timeout because i'm crazy ;-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setting 0 mean I want no timeout because if I don't specify it, it defaults to 600

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, you beat me to it.

@@ -586,3 +586,22 @@ func TestRemoveCondition(t *testing.T) {
}
}
}

func TestRolloutExceededTimeoutSeconds(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see a test table containing setting different strategies and not specified timeouts mixed with creationTimestamps values

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@openshift-ci-robot openshift-ci-robot removed the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 25, 2017
@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 25, 2017
// deployer pod (quota, etc..) we should respect the timeoutSeconds in the
// config strategy and transition the rollout to failed instead of waiting for
// the deployment pod forever.
config, err := deployutil.DecodeDeploymentConfig(deployment, c.codec)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

long term, we want to get rid of this... we should copy the timeoutSecond into annotation so we don't have to decode here... this also apply to other fields we check i guess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func RolloutExceededTimeoutSeconds(config *deployapi.DeploymentConfig, latestRC *v1.ReplicationController) bool {
timeoutSeconds := GetTimeoutSecondsForStrategy(config)
// If user set the timeoutSeconds to 0, we assume there should be no timeout.
if timeoutSeconds == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, defensive would be to check timeoutSeconds <= 0

nextStatus = deployapi.DeploymentStatusFailed
updatedAnnotations[deployapi.DeploymentStatusReasonAnnotation] = deployapi.DeploymentFailedUnableToCreateDeployerPod
c.emitDeploymentEvent(deployment, v1.EventTypeWarning, "RolloutTimeout", fmt.Sprintf("Rollout for %q failed to create deployer pod (timeoutSeconds: %ds)", deployutil.LabelForDeploymentV1(deployment), deployutil.GetTimeoutSecondsForStrategy(config)))
glog.V(4).Infof("Failing deployment %s/%s as we timeout out while waiting for the deployer pod to be created", deployment.Namespace, deployment.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/timeout out/timed out

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

: reached timeout while

now := time.Now()
tests := []struct {
name string
config func(int64) *deployapi.DeploymentConfig
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/func(int64) *deployapi.DeploymentConfig/*deployapi.DeploymentConfig/

tests := []struct {
name string
config func(int64) *deployapi.DeploymentConfig
timeoutSeconds int64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed

name string
config func(int64) *deployapi.DeploymentConfig
timeoutSeconds int64
deploymentCreationTime time.Time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed

config: func(timeoutSeconds int64) *deployapi.DeploymentConfig {
config := deploytest.OkDeploymentConfig(1)
config.Spec.Strategy.RecreateParams.TimeoutSeconds = &timeoutSeconds
return config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fill in timeout and creationTime

config := deploytest.OkDeploymentConfig(1)
config.Spec.Strategy.RecreateParams.TimeoutSeconds = &timeoutSeconds
return config
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call the lambda immediately == s/}/}()/

@mfojtik mfojtik force-pushed the dc-timeout branch 2 times, most recently from ab2eeea to 38cc230 Compare October 25, 2017 13:19
config: func(timeoutSeconds int64) *deployapi.DeploymentConfig {
config := deploytest.OkDeploymentConfig(1)
config.Spec.Strategy = deploytest.OkRollingStrategy()
config.Spec.Strategy.RollingParams.TimeoutSeconds = &timeoutSeconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would assign int64ptr(10) directly here without the whole taking is as argument thing

if tc.expectTimeout && !gotTimeout {
t.Errorf("[%s]: expected timeout, but got no timeout", tc.name)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mfojtik what if !tc.expectTimeout && gotTimeout ?

config := deploytest.OkDeploymentConfig(1)
config.Spec.Strategy.RecreateParams.TimeoutSeconds = nil
return config
}(0),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't take it as you have to fix it because I told you, you don't have to but check how the argument is pointless here :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷‍♂️

OwnerReferences: []metav1.OwnerReference{*controllerRef},
Labels: controllerLabels,
OwnerReferences: []metav1.OwnerReference{*controllerRef},
CreationTimestamp: defaultCreationTimestamp,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mfojtik why? we will now send objects to API server with CreationTimestamp set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh, I didn't notice this is not just for test...

@tnozicka
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 25, 2017
@openshift-merge-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mfojtik, tnozicka

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@tnozicka tnozicka added the kind/bug Categorizes issue or PR as related to a bug. label Oct 25, 2017
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot
Copy link
Contributor

Automatic merge from submit-queue (batch tested with PRs 17020, 17026, 17000, 17010).

@openshift-merge-robot openshift-merge-robot merged commit 58d89a4 into openshift:master Oct 25, 2017
@smarterclayton
Copy link
Contributor

smarterclayton commented Oct 26, 2017 via email

@mfojtik mfojtik deleted the dc-timeout branch September 5, 2018 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. needs-api-review size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants