Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Killing Python pods via rollout causes 502's #3726

Closed
mlissner opened this issue Feb 2, 2024 · 19 comments
Closed

Killing Python pods via rollout causes 502's #3726

mlissner opened this issue Feb 2, 2024 · 19 comments

Comments

@mlissner
Copy link
Member

mlissner commented Feb 2, 2024

I think our problem lately is that we're doing a lot of code releases and whenever we do a rollout as part of our CD pipeline, it takes out pods, and things get nasty. I'm not sure the fix. Maybe a tweak in how we roll things out? Maybe a fix in the way gunicorn is killed in docker during the rollout?

Needs investigation and a fix. Hopefully an easy one.

@cweider
Copy link
Collaborator

cweider commented Feb 14, 2024

First piece of info to look-up is the grace period for the pods (terminationGracePeriodSeconds, I think). My intuition is that long-running celery tasks are the cause. I assume that the 502’s are HTTP responses from cl-django (not offline tasks?). Maybe Sentry makes it easy to get a handle on running time of tasks. Else, list of suspects will have to be drawn up (I can’t help with that, but seeing that list would be interesting!).

Maybe it’s possible to restructure tasks to shift some work offline. More likely, it’s that, for a select few tasks, the solution is to have them respect termination signal and end their work early.

docker-entrypoint.sh is using exec so signals should be received by celery - good! (but still a key thing to verify with a test, which I haven’t done)

@mlissner
Copy link
Member Author

Well, the 502's are from the front end, so I don't think they're celery related. I guess a good place to look at is how we handle termination in the web pods.

@mlissner
Copy link
Member Author

Review shows about 1/200 requests gets a 502. Too high.

Next step is probably to look at logs some more and search for patterns. Problem could be at many layers.

@blancoramiro
Copy link
Contributor

blancoramiro commented May 10, 2024

Hello. terminationGracePeriodSeconds is not set in the deployment so it's using the default value, 30 seconds.

We could set it to 90 seconds so it'll match the timeout set in the ALB.

I can confirm that cl-django(gunicorn) receives the signals and will wait for pending tasks/connections.

Something that I noticed is that there is no readinessProbe in place so some of the 502 might be from newly created pods. Since there is no readiness probe, the requests are routed to the pod the moment the process has started.

We could test this two settings very easily with little to no impact and see if they make a difference when running a deployment.

If we still see a high numbers we can look a little more into the balancer or the application itself. Some interesting articles related to Celery and Django:

Graceful Termination of Django and Celery Worker Pods in Kubernetes

Zero Downtime Django (gunicorn) Deployments on GKE

Thank you!

@mlissner
Copy link
Member Author

These are great fixes. Let's do one first, see if it helps. Your pick!

@blancoramiro
Copy link
Contributor

Hey @mlissner,

Just applied the following readinessProbe and we should see if it makes any difference on the next deployment.

If you feel that we should adjust any value just let me know. The readinessProbe should be in there either way so after we feel comfortable with the settings I'll commit it into the kubernetes repo.

        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /monitoring/health-check/
            port: 8000
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 1

@mlissner
Copy link
Member Author

Fantastic. I have a good feeling about this!

@blancoramiro
Copy link
Contributor

I've looking at the pods for quite a bit now and I can say without a doubt that the 502s come from the ALB not being able to reach the targets and that they happen when the pods are being terminated.

  • readinessProbe didn't do the trick, but I think it should be included anyway since there might be just some 502s because of the pod receiving requests too early.
  • terminationGracePeriodSeconds didn't work either. This grace period starts when the pod receives the SIGTERM signal. Most pods do terminate on their own way before that and the pod gets deleted right after that. Still I do think that it should be set to 90 seconds to match the balancer's timeout.

What did "work" was setting a preStop with a sleep. It seems that the ALB balancer takes a little while to remove the target from the group and it keeps sending requests to that pod.

This is less than optimal and hardly qualifies as a solution but maybe we can find a sweet spot. I've set the sleep to 90 seconds just as a test, that's too extreme.

    lifecycle:
      preStop:
        exec:
          command:
          - sleep
          - "90"

In this graph you can easily spot when there is a drop in the number of targets, there is a spike in the 502 from the ALB.

Image

Here you can see on the chart when added the preStop and there were no 502s when pods got terminated:

Image

@mlissner
Copy link
Member Author

Interesting.

It seems that the ALB balancer takes a little while to remove the target from the group and it keeps sending requests to that pod.

Does this possibly imply that the ALB is misconfigured?

What's the tradeoff of doing the 90s sleep in the preStop? Is there a bad thing that causes?

@blancoramiro
Copy link
Contributor

Does this possibly imply that the ALB is misconfigured?

I don't think so. I found several issues opened about this, most of them got around it using preStop: aws-load-balancer-controller

Also in here

Deregistration Lifecycle Alignment
During the deregistration process, we encountered a different alignment issue that seems to be an unresolved “bug.”

According to ALB documentation, the deregistration process in ALB begins by marking the target for deregistration and initiating the deregistration delay. This delay allows the target to complete processing any in-flight requests before it is ultimately removed. During this delay, the target should not receive any new traffic from the ALB.

However, our experience and an unresolved discussion on the ALB controller Github issue indicate that this is not the case: ALB still sends traffic even when a target is in the deregistration state. We are still uncertain about what is happening inside the ALB, but this behavior has been consistently reproducible to date.

To ensure a seamless shutdown process, we employed a preStop hook to delay the pod termination until after the ALB deregisters the target. Based on our observations, although the total time between the actual end of traffic acceptance termination and the start of the deregistration state varies, it generally falls within the range of 5 to 10 seconds. As a result, we set our preStop hook to 15 seconds to account for this delay:

Couldn't find a proper solution yet.

What's the tradeoff of doing the 90s sleep in the preStop? Is there a bad thing that causes?
Well... it means that we need to add a delay to each pod and it'll impact a little bit the deployment time. A lot of the pods will terminate concurrently and only 25% of them will wait per the rollingrestart settings.

I've set the sleep to 30 seconds which seems more reasonable. Want to see if the 502s reappear.

@mlissner
Copy link
Member Author

Sounds like you've probably got this fixed. Now we just need to do some deployments and see! If you want to force a few deployments as tests, that makes sense to me.

@mlissner
Copy link
Member Author

Oh, sorry, one other thing. Whenever we do stuff like this, we should be thinking about bots.law too. It mostly borrows from the CL k8s stuff, so we should upgrade it too.

@blancoramiro
Copy link
Contributor

I just saw that there was a deployment about 3 hours ago. No noticeable 502s spike can be seen. The rollout step did take 5m 29s that's roughly 2 more minutes than what it use to take.

I've also manually triggered several rollout restart for the deployment and nothing shows.

Additionally now there are less 502s in general now since this changes also cover the autoscaling.

Image

If these values settings seem reasonable I can go ahead and apply them to the bots-law deployment.

Thank you!

@mlissner
Copy link
Member Author

Let's bring these changes to bots.law too. Thank you!

@mlissner
Copy link
Member Author

And...while I'm thinking about this, would this also affect our other microservices? I could see how they might have start up or shutdown delays that'd cause these kinds of issues. We have issues with unreliable microservices...

@blancoramiro
Copy link
Contributor

hmm well. The readinessProbe should be on all of the deployments for the containers that will be accepting requests. It makes sense to give the process some time, depends on what the process actually does.

This issue in particular seems to be with the deployments that have an ALB balancer as ingress.

Containers communicating with each other within the cluster, since they use kubernetes svc resources, they are immediately removed as a target by kube-proxy once the pods go into a terminating state so they won't get any more request sent to them.
It would be very valuable to monitor this though, probably by looking at logs.

BTW there was a 502 spike today, but it seems that there wasn't a deployment during that time and it matches a spike in the number of connections.

Screenshot at 2024-05-15 20-59-29
Screenshot at 2024-05-15 21-04-20

@mlissner
Copy link
Member Author

The readinessProbe should be on all of the deployments for the containers that will be accepting requests.

Got it. Want to create a fresh issue to get that in place on all our microservices too?

@mlissner
Copy link
Member Author

For the connection count and 502 spike, that feels like an issue we should also split into its own thing, and maybe do some load testing. Does that make sense to you?

@mlissner
Copy link
Member Author

mlissner commented Jul 8, 2024

This is largely fixed. There are still a few 502's, but we'll address them in another issue, later.

@mlissner mlissner closed this as completed Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

3 participants