Add MinReadySeconds to DCs #7114

0xmichalis · 2016-02-08T13:25:44Z

Readiness in Kubernetes deployments is dictated by MinReadySeconds. We consider pods available as soon as they become ready w/o imposing any additional restriction.

    // Minimum number of seconds for which a newly created pod should be ready
    // without any of its container crashing, for it to be considered available.
    // Defaults to 0 (pod will be considered available as soon as it is ready)
    MinReadySeconds int32 `json:"minReadySeconds,omitempty"`

This issue is for tracking this gap.

@ironcladlou @smarterclayton @pweil-

stevekuznetsov · 2016-05-11T19:45:39Z

@smarterclayton do we have executive direction on this gap?

0xmichalis · 2016-05-11T20:12:22Z

I was thinking that we may want to skip this and always default to 0 (consider available when ready).

smarterclayton · 2016-05-11T20:29:42Z

I would like to have this - since I was the one who proposed it to Dan originally.

0xmichalis · 2016-05-12T19:14:37Z

ok

0xmichalis · 2016-05-12T19:15:38Z

@stevekuznetsov I would like to get #6233 first and then you could add this.

0xmichalis · 2016-05-18T13:27:12Z

@stevekuznetsov did you start working on this? If not, I would like @mfojtik to take a stab at it.

stevekuznetsov · 2016-05-19T15:18:00Z

I haven't yet, and I'm OOO until next week. @mfojtik thanks!

mfojtik · 2016-05-19T16:14:10Z

@Kargakis @stevekuznetsov to make me understand this:

We add MinReadySeconds into DeploymentConfig
When a new deployment is rollout every new pod that is started must be ready in ^^ seconds otherwise we fail deployment?

0xmichalis · 2016-05-19T16:31:40Z

MinReadySeconds takes effect after a pod is ready. If you have dc.Spec.MinReadySeconds == 0, the deployment process should consider pods available as soon as they become ready. If you have dc.Spec.MinReadySeconds == 5, the deployment process should consider pods available 5 seconds after they become ready. I think you will need changes in the rolling updater.

mfojtik · 2016-05-19T16:55:01Z

@Kargakis right, I get it now, i think.

stevekuznetsov · 2016-05-19T18:16:36Z

What's the utility of this over a intelligent readiness probe?

stevekuznetsov · 2016-05-19T18:25:34Z

This does seem less powerful and more frustrating to use than a readiness probe, anyway.

0xmichalis · 2016-05-19T18:27:52Z

@stevekuznetsov this doesn't replace readiness probes but complements them since it will take effect as soon as readiness probes in a pod succeed and it is considered ready.

stevekuznetsov · 2016-05-19T18:32:41Z

Why is that useful? Shouldn't we want users to define readiness probes that are actually meaningful instead of "X and then it didn't crash for five seconds"?

smarterclayton · 2016-05-19T18:32:55Z

Just because something is ready doesn't mean that you're ready to move on.

On Thu, May 19, 2016 at 2:30 PM, Michail Kargakis notifications@github.com
wrote:

@stevekuznetsov https://github.com/stevekuznetsov this doesn't replace
readiness probes but complements them since it will take effect as soon as
readiness probes in a pod succeed and it is considered ready.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#7114 (comment)

stevekuznetsov · 2016-05-19T18:37:36Z

Isn't the point of the readiness probe to tell you when you're ready to use the pod, when it's available? I think I may be misunderstanding that feature.

smarterclayton · 2016-05-19T22:31:30Z

Readiness probe tells you when to add it to a load balancer. Just because
something says it's ready, doesn't mean that it might not crash in the next
few seconds. The delay allows you to add time for new pods to start
accepting traffic in a stable fashion before continuing.

On Thu, May 19, 2016 at 2:41 PM, Steve Kuznetsov notifications@github.com
wrote:

Isn't the point of the readiness probe to tell you when you're ready to
use the pod, when it's available? I think I may be misunderstanding that
feature.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#7114 (comment)

stevekuznetsov · 2016-05-19T23:14:53Z

What's the use of adding it to the target pool of a load balancer if we're not sure it's booted to a stable state and able to service requests?

smarterclayton · 2016-05-20T02:00:13Z

Because how are you going to take requests that tax the new process until
you're added to the load balancer?

On May 19, 2016, at 7:35 PM, Steve Kuznetsov notifications@github.com
wrote:

What's the use of adding it to the target pool of a load balancer if we're
not sure it's booted to a stable state and able to service requests?

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#7114 (comment)

stevekuznetsov · 2016-05-20T10:23:54Z

So we're dealing with a service that boots fine, and without load would signal ready & healthy forever, but might cave under load but only in the first X seconds after being ready, so we want to expose it to traffic so it has the chance of crashing? What if no traffic happens to be routed to that pod in the first X seconds and it crashes on getting the first packet after X+1 seconds? Sorry, not trying to be obtuse here, but this just seems like a very limited use case to be adding to the API. The decision making part of this PR has already happened, so you can ignore me, but when I'm back in the office I'd really appreciate some help understanding this use case.

stevekuznetsov · 2016-05-20T10:29:42Z

I can see the value of marking the deployment as compete after the pods have been ready, healthy, and have serviced requests, but a seconds time out seems like the most fragile way of doing it. Seems like what you'd really want is a Jenkins pod our or whatever running an e2e suite, but the canonical action there is to label the image as successful, not the deployment.

ironcladlou · 2016-05-20T13:24:10Z

I think @stevekuznetsov has convinced me that MinReadySeconds is redundant.

smarterclayton · 2016-05-20T14:08:33Z

It's very, very common for services to crumple under load, or to manifest
failures that would result in aborting the rollout.

Imagine the simplest possible deployment - 2 pod rolling. Dev pushes a
change and the rollout starts. The first pod comes up, reports ready, and
start taking traffic. The next time that pod will be checked for
readiness is configurable by the user, but defaults to 1m. Imagine the
developer has a bug in his caching code - first user hits and immediately
causes the error to manifest. The readiness check for that pod won't have
a chance to report that error for one minute.

In the meantime, the other pod is scaled up and hits the same scenario. As
soon as that pod is ready, the deployment is marked successful. 30s later,
both pods have failing readiness checks, both are out of rotation, and the
app is down. The deployment failed to guard against failure.

MinReadySeconds is necessary to allow the app author to design well written
readiness checks that ensure deployment succeed. Without that there is no
way to force deployments to give the readiness check the change to remain
stable.

On May 20, 2016, at 6:24 AM, Steve Kuznetsov notifications@github.com
wrote:

So we're dealing with a service that boots fine, and without load would
signal ready & healthy forever, but might cave under load but only in the
first X seconds after being ready, so we want to expose it to traffic so it
has the chance of crashing? What if no traffic happens to be routed to that
pod in the first X seconds and it crashes on getting the first packet after
X+1 seconds? Sorry, not trying to be obtuse here, but this just seems like
a very limited use case to be adding to the API. The decision making part
of this PR has already happened, so you can ignore me, but when I'm back in
the office I'd really appreciate some help understanding this use case.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#7114 (comment)

stevekuznetsov · 2016-05-20T16:44:52Z

Do we want to be failing this at the deployment level?

I'm not a sophisticated user of OpenShift but I thought that a mature pipeline would look like dev->testing->prod, so these checks should be happening prior to you deploying something to a "production" environment where real traffic is being directed there.

Furthermore, there is no guarantee that the deployment you're making will be taking any load in the MinReadySeconds interval, making the check fragile.

As a consumer of OpenShift, I would be very hesitant to use this feature because of that fragility. Hell, if I threw in a

oc start-build my-build
# wait for completion
sleep 100

into our Bash scripts, I'd get smacked and someone would tell me to use os::build::wait_for_start.

smarterclayton · 2016-05-20T16:55:54Z

Yes, we do.

On Fri, May 20, 2016 at 12:44 PM, Steve Kuznetsov notifications@github.com
wrote:

Do we want to be failing this at the deployment level?

I'm not a sophisticated user of OpenShift but I thought that a mature
pipeline would look like dev->testing->prod, so these checks should be
happening prior to you deploying something to a "production" environment
where real traffic is being directed there.

Furthermore, there is no guarantee that the deployment you're making will
be taking any load in the MinReadySeconds interval, making the check
fragile.

As a consumer of OpenShift, I would be very hesitant to use this feature
because of that fragility. Hell, if I threw in a

oc start-build my-build# wait for completion
sleep 100

into our Bash scripts, I'd get smacked and someone would tell me to use
os::build::wait_for_start.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#7114 (comment)

stevekuznetsov · 2016-05-20T16:58:38Z

Can't argue with that :)

smarterclayton · 2016-05-20T17:10:47Z

Think about this as theory vs practice. In theory, you'd have this magic
system where everything worked based on edge driven changes. But the
reality is that there's an infinite amount of complexity within the app,
and so we ultimately care about levels. One way to help stabilize levels
is to ensure that you give the system time to burn in.

On Fri, May 20, 2016 at 12:58 PM, Steve Kuznetsov notifications@github.com
wrote:

Can't argue with that :)

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#7114 (comment)

stevekuznetsov · 2016-05-20T17:18:26Z

Sure, I just didn't think we were going to be supporting a rolling deployment into production without testing as a good idea. This feature simply doesn't make sense if your rolling deployment into production happens after the success of some e2e tests labels an image as production ready - with OpenShift it should be trivial to keep your dev/test/prod deployment specs in sync and be confident that your testing will be relevant in that sense.

smarterclayton · 2016-05-20T17:26:23Z

Testing in dev != verification in prod. Testing can only reduce the
impact, but if you have no difference between stage and prod, then stage
is prod. For instance, stage and prod usually have different secrets.
Your readiness check might verify that you can contact a db with your
secrets. But your readiness check can't verify that all the code that uses
the secret behaves the same when real users are hitting the service, which
is why you want to set a stabilization window.

Imagine this case. Your readiness check simulates a user login (expensive,
but possible). But can your readiness check simulate 1000 simultaneous
users hitting your brand new pod? No.

On May 20, 2016, at 1:18 PM, Steve Kuznetsov notifications@github.com
wrote:

Sure, I just didn't think we were going to be supporting a rolling
deployment into production without testing as a good idea. This feature
simply doesn't make sense if your rolling deployment into production
happens after the success of some e2e tests labels an image as production
ready - with OpenShift it should be trivial to keep your dev/test/prod
deployment specs in sync and be confident that your testing will be
relevant in that sense.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#7114 (comment)

stevekuznetsov · 2016-05-20T17:47:00Z

I really hope you're not doing your thousand-user stress test in production, and especially that you're not doing it by hoping one thousand users log in in X seconds after the ready check success...

As to your points about tangible differences in production environments, those make sense. We should doc the usefulness boundaries of this field and maybe even give warnings in oc status if users set the field to be shorter than a minute, or five minutes, or something, otherwise it's just fragile and could be a false sense of security when is utility is really minimal.

smarterclayton · 2016-05-20T18:05:54Z

Definitely the description should include a use case for why you would
use this - i.e., it only makes sense to set this to > your readiness
interval, and in rough multiples

On Fri, May 20, 2016 at 1:47 PM, Steve Kuznetsov notifications@github.com
wrote:

I really hope you're not doing your thousand-user stress test in
production, and especially that you're not doing it by hoping one thousand
users log in in X seconds after the ready check success...

As to your points about tangible differences in production environments,
those make sense. We should doc the usefulness boundaries of this field and
maybe even give warnings in oc status if users set the field to be
shorter than a minute, or five minutes, or something, otherwise it's just
fragile and could be a false sense of security when is utility is really
minimal.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#7114 (comment)

0xmichalis · 2016-06-23T15:55:27Z

@mfojtik #6233 is merged so you can start working on this. Note that it's part of https://trello.com/c/Mljldox7/643-8-deployments-downstream-support-for-upstream-features

mfojtik · 2016-06-27T12:34:32Z

UPSTREAM: kubernetes/kubernetes#28111

danmcp added kind/enhancement priority/P2 component/apps labels Feb 8, 2016

danmcp assigned pweil- Feb 8, 2016

pweil- assigned 0xmichalis and unassigned pweil- Feb 8, 2016

stevekuznetsov assigned stevekuznetsov and unassigned 0xmichalis Apr 22, 2016

0xmichalis assigned mfojtik and unassigned stevekuznetsov May 19, 2016

mfojtik mentioned this issue May 31, 2016

hack/util: wait for registry readiness #9079

Merged

0xmichalis self-assigned this Jul 14, 2016

0xmichalis mentioned this issue Jul 14, 2016

Add MinReadySeconds in deployment configs #9852

Merged

openshift-bot closed this as completed in #9852 Jul 18, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MinReadySeconds to DCs #7114

Add MinReadySeconds to DCs #7114

0xmichalis commented Feb 8, 2016

stevekuznetsov commented May 11, 2016

0xmichalis commented May 11, 2016

smarterclayton commented May 11, 2016

0xmichalis commented May 12, 2016

0xmichalis commented May 12, 2016

0xmichalis commented May 18, 2016

stevekuznetsov commented May 19, 2016

mfojtik commented May 19, 2016

0xmichalis commented May 19, 2016

mfojtik commented May 19, 2016

stevekuznetsov commented May 19, 2016

stevekuznetsov commented May 19, 2016

0xmichalis commented May 19, 2016

stevekuznetsov commented May 19, 2016

smarterclayton commented May 19, 2016

stevekuznetsov commented May 19, 2016

smarterclayton commented May 19, 2016

stevekuznetsov commented May 19, 2016

smarterclayton commented May 20, 2016

stevekuznetsov commented May 20, 2016

stevekuznetsov commented May 20, 2016

ironcladlou commented May 20, 2016

smarterclayton commented May 20, 2016

stevekuznetsov commented May 20, 2016

smarterclayton commented May 20, 2016

stevekuznetsov commented May 20, 2016

smarterclayton commented May 20, 2016

stevekuznetsov commented May 20, 2016

smarterclayton commented May 20, 2016

stevekuznetsov commented May 20, 2016

smarterclayton commented May 20, 2016

0xmichalis commented Jun 23, 2016

mfojtik commented Jun 27, 2016

Add MinReadySeconds to DCs #7114

Add MinReadySeconds to DCs #7114

Comments

0xmichalis commented Feb 8, 2016

stevekuznetsov commented May 11, 2016

0xmichalis commented May 11, 2016

smarterclayton commented May 11, 2016

0xmichalis commented May 12, 2016

0xmichalis commented May 12, 2016

0xmichalis commented May 18, 2016

stevekuznetsov commented May 19, 2016

mfojtik commented May 19, 2016

0xmichalis commented May 19, 2016

mfojtik commented May 19, 2016

stevekuznetsov commented May 19, 2016

stevekuznetsov commented May 19, 2016

0xmichalis commented May 19, 2016

stevekuznetsov commented May 19, 2016

smarterclayton commented May 19, 2016

stevekuznetsov commented May 19, 2016

smarterclayton commented May 19, 2016

stevekuznetsov commented May 19, 2016

smarterclayton commented May 20, 2016

stevekuznetsov commented May 20, 2016

stevekuznetsov commented May 20, 2016

ironcladlou commented May 20, 2016

smarterclayton commented May 20, 2016

stevekuznetsov commented May 20, 2016

smarterclayton commented May 20, 2016

stevekuznetsov commented May 20, 2016

smarterclayton commented May 20, 2016

stevekuznetsov commented May 20, 2016

smarterclayton commented May 20, 2016

stevekuznetsov commented May 20, 2016

smarterclayton commented May 20, 2016

0xmichalis commented Jun 23, 2016

mfojtik commented Jun 27, 2016