Skip to content
This repository has been archived by the owner on Apr 17, 2019. It is now read-only.

[ingress/controllers/nginx] Use Service Virtual IP instead of maintaining Pod list #1140

Closed
edouardKaiser opened this issue Jun 6, 2016 · 75 comments

Comments

@edouardKaiser
Copy link

Is there a way to tell the NGINX Ingress controller to use the Service Virtual IP Address instead of maintaining the Pods IP addresses in the upstream configuration ?

I couldn't find it. If not, I think it would be good. Because with the current situation, when we scale down a service. the Ingress controller does not work in harmony with the Replication Controller of the service.

That means, some requests to the Ingress Controller will fail while waiting for the Ingress Controller to be updated.

If we use the Service Virtual IP address, we can let kube-proxy do its job in harmony with the replication controller and we have a seamless down scaling.

@edouardKaiser
Copy link
Author

I guess it has been implemented that way for session stickiness. But for applications who don't need that, it could be a good option.

@aledbf
Copy link
Contributor

aledbf commented Jun 16, 2016

when we scale down a service. the Ingress controller does not work in harmony with the Replication Controller of the service.

What do you mean?
After a change in the number of replicas in a rc it takes a couple of seconds to receive the update from the api server.

In the case of scaling down the number of replicas you need to tune the upstream check defaults docs

Besides this I'm testing the module lua-upstream-nginx-module to avoid reloads and be able to add/remove servers in an upstream

@edouardKaiser
Copy link
Author

Ok, I'll try to explain with another example:

When you update a deployment resource (like changing the docker image), depending your configuration (rollingUpdate strategy, max surge, max unavailable), the deployment controller will bring down some pods, and create new one. All of this, in a fashion way where there is no downtime if you use the Service VIP to communicate with the pods.

Because first, when it wants to kill a pod, it removes the pod IP address from the service to avoid any new connection, and it follow the termination grace period of the pod to drain the existing connections. Meanwhile, it also creates a new pod, with the new docker image, and wait for the pod to be ready, and add the pod behind the service VIP.

By maintaining the pod list yourself in the Ingress Controller, at a certain point, during a deployment resource update, some requests will be redirected to pods which are shutting down. Because the Ingress Controller, does not know a RollingUpdate Deployment is happening. It will know maybe 1 second later. But for services, with a lots of connection/sec, it's potentially a lots of requests failing.

I personally don't want to tune the upstream to handle this scenario. Kubernetes is already doing an amazing job to update pods with no downtime. That only if you use the Service VIP.

Did I miss something ? If it's still not clear, or there is something I'm clearly not understanding, please don't hesitate.

@edouardKaiser
Copy link
Author

The NGINX Ingress Controller (https://github.com/nginxinc/kubernetes-ingress) used to use the service VIP. But they changed recently to a system like yours (pod list in the upstream).

Before they changed this behaviour, I did some test. I was continuously spamming requests to the Ingress Controller (5/sec). Meanwhile, I updated the Deployment resource related to those requests (new docker images):

  • Kubernetes/Contrib: You can clearly see some requests failing at the time of the update
  • NGINX/Controller: It looks like nothing happened behind the scene, perfect deployment, with no downtime (all because of Kubernetes doing a great job behind the scene)

@aledbf
Copy link
Contributor

aledbf commented Jun 17, 2016

@edouardKaiser how are you testing this? The request are GET or POST? Can you provide a description of the testing scenario?

@aledbf
Copy link
Contributor

aledbf commented Jun 17, 2016

I personally don't want to tune the upstream to handle this scenario.

I understand that but your request is the contradicts what other users requested (control over the upstream checks). Is hard to find a balance in the configuration that satisfies all the user scenarios.

@edouardKaiser
Copy link
Author

I understand some people might want to tweak the upstream configuration, but on the other side Kubernetes is doing a better job at managing deployment without downtime with the concept of communicating with pods through the Service VIP.

To reproduce, I just used the Chrome App Postman, and their Runner feature (you can specify some requests to run to a particular endpoint, with a number of iteration, delay....). And while the runner was running, I just updated the Deployment resource and watched the behaviour of the runner.

When the request is GET and it fails, Nginx automatically passes the request to the next server. But for non-idempotent method like POST, it does not (and I think it's the right behavior), and then we have failure.

@aledbf
Copy link
Contributor

aledbf commented Jun 17, 2016

But for non-idempotent method like POST, it does no

This is documented scenario https://github.com/kubernetes/contrib/tree/master/ingress/controllers/nginx#retries-in-no-idempotent-methods
NGINX changed this behavior in 1.9.13

Please add the option retry-non-idempotent=true in the nginx configmap to restore the old behavior

@edouardKaiser
Copy link
Author

edouardKaiser commented Jun 17, 2016

But it doesn't change the root of the problem: Ingress Controller and Deployment Controller don't work together.

Your pod might have accepted the connection and started to process it, but what the Ingress Controller does not know, is that this pod is gonna get killed the next second by the Deployment Controller.

  • For the Deployment Controller it's fine, the Deployment Controller did its job, removed the pod from the service and waited the termination grace period.
  • On the Ingress Controller: it's not fine, your connection will suddenly be aborted because the pod died. If you're lucky, the pod got removed by the Ingress Controller before any request gets in. If you're not, you'll experiment some failure. If it's a POST request, most of the time you really don't want to retry. If it's a GET request, but the pod gets killed in the middle of transferring a response, NGINX won't retry.

I know this is not a perfect world, and we need to embrace failure. Here, we have a way to potentially avoid that failure by using Service VIP.

I'm not saying it should be the default behaviour, but an option to use Service VIP instead of Pod endpoint would be awesome.

@glerchundi
Copy link

I'm with @edouardKaiser because:

  • You (as devops or operations guy) cannot guarantee that the final developer will follow the best practices regarding to keeping, dropping connections whenever it gets a SIGTERM for example. However, if services were used the responsibility for site reliability falls completely in concrete person/team.
  • Although retry-no-idempotent can mitigate some problems others could arise, choosing to retry those is not an option in much cases.

IMO the controller should expose a parameter or something to choose between final endpoints or services, that would cover all the use cases.

@edouardKaiser
Copy link
Author

I couldn't have explained it better. Thanks @glerchundi

@thockin
Copy link
Contributor

thockin commented Jun 28, 2016

If you go through the service VIP you can't ever do session affinity. It
also incurs some overhead, such as conntrack entries for iptables DNAT. I
think this is not ideal.

To answer the questions about "coordination" this is what readiness and
grace period are for. What is supposed to happen is:

RC creates 5 pods A, B, C, D, E
all 5 pods become ready
endpoints controller adds all 5 pods to the Endpoints structure
ingress controller sees the Endpoints update
... serving ...
RC deletes 2 pods (scaling down)
pods D and E are marked unready
kubelet notifies pods D and E
endpoints controller sees readiness change, removes D and E from endpoints
ingress controller sees the Endpoints update and removes D and E
termination grace period ends
kubelet kills pods D and E

It is possible that your ingress controller falls so far behind that the
grace expires before it has a chance to remove endpoints, but that is the
nature of distrubited systems. It's equally possible that kube-proxy falls
behind - they literally use the same API.

On Mon, Jun 27, 2016 at 6:51 PM, Edouard Kaiser notifications@github.com
wrote:

I couldn't have explained it better. Thanks @glerchundi
https://github.com/glerchundi


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1140 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFVgVLVBHdUQdCIRD6uVDhs8Icwg5k8tks5qQH4hgaJpZM4IuliB
.

@edouardKaiser
Copy link
Author

edouardKaiser commented Jun 28, 2016

I do understand this is not ideal for everyone, this is why I was talking about an option for this behaviour.

@thockin
Copy link
Contributor

thockin commented Jun 28, 2016

But I don't see how it is better for anyone? It buys you literally
nothing, but you lose the potential for affinity and incur performance hit
for no actual increase in stability.

On Mon, Jun 27, 2016 at 8:50 PM, Edouard Kaiser notifications@github.com
wrote:

I do understand this is not idea for everyone, this is why I was talking
about an option for this behaviour.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1140 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFVgVEExdHf-NncMom53MS7914scnzJjks5qQJnrgaJpZM4IuliB
.

@edouardKaiser
Copy link
Author

edouardKaiser commented Jun 28, 2016

Correct me if I'm wrong, I probably misunderstood something in the termination of pods flow:

When scaling down, the pod is removed from endpoints list for service and, at the same time, a TERM signal is sent.

So, for me, at this exact moment, there is an opened window. Potentially, this pod (which is shutting down gracefully), might still get some requests forwarded by the nginx ingress-controller (just the time it needs for the ingress-controller to notice the changes, update and reload the conf).

@thockin
Copy link
Contributor

thockin commented Jun 28, 2016

On Mon, Jun 27, 2016 at 9:53 PM, Edouard Kaiser
notifications@github.com wrote:

Correct me if I'm wrong, I probably misunderstood something in the termination of pods flow:

When scaling down, the pod is removed from endpoints list for service and, at the same time, a TERM signal is sent.

Pedanticly, "at the same time" has no real meaning. It happens
asynchronously. It might happen before or after or at the same time.

So, for me, at this exact moment, there is an opened window. Potentially, this pod (which is shutting down gracefully), might still get some requests forwarded by the nginx ingress-controller (just the time it needs for the ingress-controller to notify the changes, update and reload the conf).

The pod can take as long as it needs to shut down. Typically
O(seconds) is sufficient time to finish or cleanly terminate open
connections and ensure no new connections arrive. So, for example,
you could request 1 minute grace, keep accepting connections for max(5
seconds since last connection, 30 seconds), drain any open
connections, and then terminate.

Note that the exact same thing can happen with the service VIP.
kube-proxy is just an API watcher. It could happen that kube-proxy
sees the pod delete after kubelet does, in which case it would still
be routing service VIP traffic to th epod that had already been
signalled. There's literally no difference. That's my main point :)

@edouardKaiser
Copy link
Author

True, "at the same time" doesn't mean that much here, it's more like those operations are triggered in parallel.

I wanted to point out that possibility because I ran some tests before opening this issue (continuously sending requests to an endpoint backed by multiple-pods while scaling down). And when ingress-controller was using VIP, the down-scaling was happening more smoothly (no failure, no request passed to the next server by nginx), contrary to when the ingress-controller is maintaining the endpoint list (I noticed some requests were failing for that short time-window, and passed to the next server for the GET, PUT type...).

I'm surprised the same thing can happen with the service VIP. I supposed that Kubelet would start the shutdown only once the pod was removed from the iptable entries, but I was wrong.

So your point is, I got lucky during my tests, because depending the timing, I might have ended up with the same situation even with Service VIP.

@thockin
Copy link
Contributor

thockin commented Jun 28, 2016

On Mon, Jun 27, 2016 at 10:33 PM, Edouard Kaiser
notifications@github.com wrote:

I'm surprised the same thing can happen with the service VIP. I supposed that Kubelet would start the shutdown only once the pod was removed from the iptable entries, but I was wrong.

Nope. kube-proxy is replaceable, so we can't really couple except to the API.

So your point is, I got lucky during my tests, because depending the timing, I might have ended up with the same situation even with Service VIP.

I'd say you got UNlucky - it's always better to see the errors :)

If termination doesn't work as I described (roughly, I may get some
details wrong) we should figure out why

@edouardKaiser
Copy link
Author

Thanks for the explanation Tim, I guess I can close this one.

@thockin
Copy link
Contributor

thockin commented Jun 28, 2016

Not to impose too much, but since this is a rather frequent topic, I wonder
if you want to write a doc or an example or something? A way to
demonstrate the end-to-end config for this? I've been meaning to do it,
but it means so much more when non-core-team people document stuff (less
bad assumptions :).

I'll send you a tshirt...

On Mon, Jun 27, 2016 at 11:29 PM, Edouard Kaiser notifications@github.com
wrote:

Thanks for the explanation Tim, I guess I can close this one.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1140 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFVgVL1tS9yhMYvUh8MM6UKypMtCJRYEks5qQL9YgaJpZM4IuliB
.

@edouardKaiser
Copy link
Author

edouardKaiser commented Jun 28, 2016

Happy to write something.

Were you thinking about updating the README of the Ingress Controllers (https://github.com/kubernetes/contrib/tree/master/ingress/controllers/nginx)?

We could add a new paragraph about the choice of using endpoint list instead of service VIP (advantages like upstream tuning, session affinity..) and showing that there is no guarantee of synchronisation even by using the service VIP.

@glerchundi
Copy link

@thockin thanks for the explanation, it's very water clear now.

@edouardKaiser
Copy link
Author

I'm glad I have a better understanding on how it works, it makes sense if you think about the kube-proxy as just an API watcher.

But to be honest, now I'm kind of stuck. Some of our applications don't handle very well the SIGTERM (no graceful stop..). Even if the application is in the middle of a request, a SIGTERM would shutdown the app immediately.

Using Kubernetes, I'm not sure how to deploy without downtime now. My initial understanding was this flow when scaling down/deploying new version:

  1. Remove the pod from the endpoint list
  2. Wait for the terminationGracePeriod (to wait for any request in progress to finish)
  3. Then shutdown with SIGTERM

We need to rethink about how to deploy or see if we can adapt our application to handle SIGTERM.

@thockin
Copy link
Contributor

thockin commented Jun 29, 2016

wrt writing something, I was thinking a doc or a blog post or even an
example with yaml and a README

On Mon, Jun 27, 2016 at 11:45 PM, Edouard Kaiser notifications@github.com
wrote:

Happy to write something.

Were you think about updating the README of the Ingress Controllers (
https://github.com/kubernetes/contrib/tree/master/ingress/controllers/nginx
)?

We could add a new paragraph about the choice of using endpoint list
instead of service VIP (advantages like upstream tuning, session
affinity..) and showing that there is no guarantee of synchronisation even
by using the service VIP.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1140 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFVgVIOlLSwqcb-Gq7h7JcXUutoVQjmzks5qQMMCgaJpZM4IuliB
.

@thockin
Copy link
Contributor

thockin commented Jun 29, 2016

You also have labels at your disposal.

If you make your Service select app=myapp,active=true, then you can start
all your pods with that set of labels. When you want to do your own
termination, you can remove the active=true label from the pod, which
will update the Endpoints object, and that will stop sending traffic. Wait
however long you think you need, then delete the pod.

Or you could teach your apps to handle SIGTERM.

Or you could make an argument for a configurable signal rather than SIGTERM
(if you can make a good argument)

Or ... ? other ideas welcome

On Tue, Jun 28, 2016 at 4:47 AM, Edouard Kaiser notifications@github.com
wrote:

I'm glad I have a better understanding on how it works, it makes sense if
you think about the kube-proxy as just an API watcher.

But to be honest, now I'm kind of stuck. Some of our applications don't
handle very well the SIGTERM (no graceful stop..). Even if the application
is in the middle of a request, a SIGTERM would shutdown the app immediately.

Using Kubernetes, I'm not sure how to deploy without downtime now. My
initial understanding was this flow when scaling down/deploying new version:

  1. Remove the pod from the endpoint list
  2. Wait for the terminationGracePeriod (to wait for any request in
    progress to finish)
  3. Then shutdown with SIGTERM

We need to rethink about how to deploy or see if we can adapt our
application to handle SIGTERM.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1140 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFVgVMo0zaGLSpVIPgp_E8TwYDsNX-Rcks5qQQnUgaJpZM4IuliB
.

@edouardKaiser
Copy link
Author

Thanks for the advice, I tend to forget how powerful labels can be.

Regarding writing something, I can write a blog to explain why using and endpoint list is better. But I'm not sure what kind of example (YAML) you are talking about.

@ababkov
Copy link

ababkov commented Apr 2, 2017

@timoreimann I agree though I'd ask that you take the lead on that given that you're in the process of transitioning and can probably more clearly explain the use case. I fully understand your current situation (we're about to be in the same position) and am happy to contribute.

The only thing i'm not fully sure of from a proposal standpoint is whether the feature should / can try and address anything beyond a preliminary "draining" status that occurs for a fixed configurable period of time. The alternative of course being implementing a solution where things that route traffic to the container register themselves as listeners (via state updates) and acknowledge (via state updates) when a container had gone into draining status.... once all acknowledgements have been logged, container transitions to terminating status and everything proceeds per usual.

@edouardKaiser
Copy link
Author

@timoreimann we have the same issue, we have to ask every service owner to implement a proper SIGTERM handler to make sure deployment is transparent to the users.

It's true that it would make things easier if the pod was just flagged as not-ready anymore. Give time to remove it behind the service, draining requests, and then SIGTERM....

@timoreimann
Copy link

timoreimann commented Apr 2, 2017

@ababkov I'd be happy to be the one who files the feature request.

My understanding is that any solution requiring some means of extended coordination will be much harder to push through. @thockin and other members of the community have expressed in the past that too much coupling is rather dangerous in a distributed system like Kubernetes, and it's the reason why Kubernetes was designed differently. Personally, I think that does make a lot of sense.

I think we will have time to delve into these and other implementation details on the new ticket. Will drop pointers here once I can find some time to file it (after I made sure no one else had put forward a similar request yet).

Thanks for your input!

@ababkov
Copy link

ababkov commented Apr 2, 2017

@timoreimann based on that i'd suggest that the request be logged as the simpler alternative of allowing a configurable amount of time where the container sits in an unready / draining state prior to the remainder of the termination flow taking place (maybe in reference to this conversation).

That alone would make everything 10 times clearer and more straight forward.

@domino14
Copy link

domino14 commented Apr 2, 2017

I think at the very least a few notes should be added in the documentation, as it's not clear that a special set of procedures is needed to get actual zero downtime.

@timoreimann
Copy link

timoreimann commented Apr 2, 2017

@domino14 did you see @edouardKaiser's blog post he referenced before?

@timoreimann
Copy link

@ababkov sounds good to me!

@domino14
Copy link

domino14 commented Apr 2, 2017

Yep, I did see the post and it was helpful in showing me what was going on behind the scenes, but it's still a bit tough to go from that to knowing we need delays in several places, etc, to get zero downtime deploys. That's why I suggested notes in the documentation.

In any case I'm a newcomer to K8S and I appreciate greatly the work done here. I'd submit a documentation patch if I knew the terminology better.

@thockin
Copy link
Contributor

thockin commented Apr 3, 2017 via email

@keegancsmith
Copy link

keegancsmith commented Apr 18, 2017

With regards to the original issue, could it be related to differences in connection strategies between the nginx ingress and kube-proxy. At least for the userspace kube-proxy it has a retry if dialling an endpoint fails https://sourcegraph.com/github.com/kubernetes/kubernetes@v1.6.1/-/blob/pkg/proxy/userspace/proxysocket.go#L94:1-95:1
I'm not sure if nginx ingress or the other kube-proxy modes have a similar strategy.

When gunicorn receives a SIGTERM, does it stop listening on the socket but continue to drain the open requests? In that case it should gracefully drain with kube-proxy userspace since kube-proxy will move on to the next endpoint if gunicorn does not accept the connection.

@thockin
Copy link
Contributor

thockin commented Apr 18, 2017 via email

@n1koo
Copy link

n1koo commented Apr 20, 2017

Even if servers like (g)unicorn are doing the wrong thing theres the no guarantee which is faster a) endpoint update b) SIGTERM on process. I think the endpoint update should be sync and only after its finished we continue with SIGTERM (as discussed above). Any other solution has this timing problem.

Some notes about our testing and workarounds for now:

  • You can workaround this by using the sleep hack in case of services. Magic sleeps are never really "real" patterns, hacks at best
  • This problem is mitigated by nginx for GETs and such because it retries 50x.
    -Theres another issue in nginx where full upstream flip (eg. pool [a,b] -> [c,d) causes the proxy_next_upstream to not work. Testable with eg. rollingupdate, replicas 3, maxSurge 25%, maxUnavailable 50% where it ends up splitting from 4 pods (2 old in pool, 2 new getting ready) to 2 new in pool
  • I haven't been able to reproduce this on GLBC. Any ideas what it does differently since simply using a service vip doesn't fix it. Does it do the same as proxy_next_upstream but better ?

@aledbf
Copy link
Contributor

aledbf commented Apr 20, 2017

Theres another issue in nginx where full upstream flip (eg. pool [a,b] -> [c,d) causes the proxy_next_upstream to not work. Testable with eg. rollingupdate, replicas 3, maxSurge 25%, maxUnavailable 50% where it ends up splitting from 4 pods (2 old in pool, 2 new getting ready) to 2 new in pool

The focus after the next nginx controller release (0.9) will be avoid nginx reloads for upstream updates.

@warmchang
Copy link
Contributor

👍

@Diggsey
Copy link

Diggsey commented Mar 6, 2018

Just ran into this issue - at the very least this should be better documented, I'm sure the vast majority of people using rolling updates believe that kubernetes waits until a pod is out of rotation before sending SIGTERM, and this is not the case!

I also think that even if not implemented right away, sequencing SIGTERM to happen after a pod has been taken out of rotation should be the eventual goal: I don't think saying "gunicorn is doing the wrong thing" is a useful response, that's just kubernetes redefining what it thinks these signals mean, gunicorn and other servers behave in a fairly standard way wrt signals, and kubernetes should aim to be compatible with them rather than offloading the problem onto users.

@simonkotwicz
Copy link

simonkotwicz commented Apr 4, 2019

@thockin you mentioned in an earlier comment:

During that time window (30 seconds by default), the Pod is considered “terminating” and will be removed from any Services by a controller (async). The Pod itself only needs to catch the SIGTERM, and start failing any readiness probes.

Why do you need to start failing the readiness probe?

When a Pod is marked as Terminating, the Endpoint controller observes the change and removes the pod from Endpoints.

Since this is already done as a result of a terminating pod, why do we also need to start failing the readiness probe, which (after it fails a certain amount of times) will also result in the prod being removed as the endpoint of the service.

@timoreimann
Copy link

@simonkotwicz you do not need to start failing the readiness probe (these days anymore?). I'm not sure if it used to be like that, but it's certainly not the case these days. Things work fine without that part.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests