-
-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Killing Python pods via rollout causes 502's #3726
Comments
First piece of info to look-up is the grace period for the pods ( Maybe it’s possible to restructure tasks to shift some work offline. More likely, it’s that, for a select few tasks, the solution is to have them respect termination signal and end their work early.
|
Well, the 502's are from the front end, so I don't think they're celery related. I guess a good place to look at is how we handle termination in the web pods. |
Review shows about 1/200 requests gets a 502. Too high. Next step is probably to look at logs some more and search for patterns. Problem could be at many layers. |
Hello. We could set it to 90 seconds so it'll match the timeout set in the ALB. I can confirm that cl-django(gunicorn) receives the signals and will wait for pending tasks/connections. Something that I noticed is that there is no We could test this two settings very easily with little to no impact and see if they make a difference when running a deployment. If we still see a high numbers we can look a little more into the balancer or the application itself. Some interesting articles related to Celery and Django: Graceful Termination of Django and Celery Worker Pods in Kubernetes Zero Downtime Django (gunicorn) Deployments on GKE Thank you! |
These are great fixes. Let's do one first, see if it helps. Your pick! |
Hey @mlissner, Just applied the following If you feel that we should adjust any value just let me know. The
|
Fantastic. I have a good feeling about this! |
I've looking at the pods for quite a bit now and I can say without a doubt that the 502s come from the ALB not being able to reach the targets and that they happen when the pods are being terminated.
What did "work" was setting a This is less than optimal and hardly qualifies as a solution but maybe we can find a sweet spot. I've set the sleep to 90 seconds just as a test, that's too extreme.
In this graph you can easily spot when there is a drop in the number of targets, there is a spike in the 502 from the ALB. Here you can see on the chart when added the |
Interesting.
Does this possibly imply that the ALB is misconfigured? What's the tradeoff of doing the 90s sleep in the preStop? Is there a bad thing that causes? |
I don't think so. I found several issues opened about this, most of them got around it using Also in here
Couldn't find a proper solution yet.
I've set the sleep to 30 seconds which seems more reasonable. Want to see if the 502s reappear. |
Sounds like you've probably got this fixed. Now we just need to do some deployments and see! If you want to force a few deployments as tests, that makes sense to me. |
Oh, sorry, one other thing. Whenever we do stuff like this, we should be thinking about bots.law too. It mostly borrows from the CL k8s stuff, so we should upgrade it too. |
I just saw that there was a deployment about 3 hours ago. No noticeable 502s spike can be seen. The rollout step did take 5m 29s that's roughly 2 more minutes than what it use to take. I've also manually triggered several rollout restart for the deployment and nothing shows. Additionally now there are less 502s in general now since this changes also cover the autoscaling. If these values settings seem reasonable I can go ahead and apply them to the bots-law deployment. Thank you! |
Let's bring these changes to bots.law too. Thank you! |
And...while I'm thinking about this, would this also affect our other microservices? I could see how they might have start up or shutdown delays that'd cause these kinds of issues. We have issues with unreliable microservices... |
hmm well. The This issue in particular seems to be with the deployments that have an ALB balancer as ingress. Containers communicating with each other within the cluster, since they use kubernetes svc resources, they are immediately removed as a target by BTW there was a 502 spike today, but it seems that there wasn't a deployment during that time and it matches a spike in the number of connections. |
Got it. Want to create a fresh issue to get that in place on all our microservices too? |
For the connection count and 502 spike, that feels like an issue we should also split into its own thing, and maybe do some load testing. Does that make sense to you? |
This is largely fixed. There are still a few 502's, but we'll address them in another issue, later. |
I think our problem lately is that we're doing a lot of code releases and whenever we do a rollout as part of our CD pipeline, it takes out pods, and things get nasty. I'm not sure the fix. Maybe a tweak in how we roll things out? Maybe a fix in the way gunicorn is killed in docker during the rollout?
Needs investigation and a fix. Hopefully an easy one.
The text was updated successfully, but these errors were encountered: