-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider delay/sleep on graceful termination #2764
Comments
We already have |
That would be a very nice addition for us! Thanks @tsandall 👍 We'll get right to work then :) |
Fixes open-policy-agent#2764 Signed-off-by: Björn Carlsson <bjorn.carlsson@bisnode.com>
Fixes #2764 Signed-off-by: Björn Carlsson <bjorn.carlsson@bisnode.com>
When rolling out OPA at scale in our kubernetes clusters, we have seen occasional errors reported by our microservices. We managed to track this down to their shutdown phase and the way the interact, or rather need to interact, with OPA running as the sidecar next to them. When aiming for zero downtime deployments in environments where external load balancers are involved (as opposed to purely internal traffic), these often need some time to register the fact that a pod is shutting down and update their list of available endpoints accordingly. This means that pods in a shutdown state may still receive traffic for (most often) a few seconds after kubernetes has marked them as shutting down and sent them a SIGTERM. The common way of dealing with this is to introduce a delay/sleep of a something like 15 seconds in the apps when receiving the SIGTERM signal before actually closing down. This could either be made inside the application logic, or more commonly by calling sleep in the containers
preStop
lifecycle configuration, like:As this
preStop
is called before the SIGTERM is sent, load balancers and other components in the chain has a chance to delist the pod from the list of available endpoints before the app shuts down. This workaround is documented here, there and elsewhere.This pattern, though simple. has proven to work really well. With OPA now running as a sidecar in many of our pods (and way more to come!) we're now facing the problem that OPA shuts down almost immediately when receiving the SIGTERM while the app container continues to run and serve traffic for the duration of the sleep. This of course leads to a situation where our apps are asking for decisions to OPA sidecars that have already terminated.
Since the OPA container does not include anything but the OPA binary, there's no way for us to call the
sleep
binary (or I guess in some systems, built-in). We'd like to find a way of keeping our zero downtime fix while running OPA as sidecars. Our options for making this work I think boils down to these:sleep
. This is our current workaround. While it works it adds to the burden of running and maintaining OPA in our clusters.I'd be curious to hear if others in the OPA community have dealt with this and if so how, and whether one of the proposed solutions for dealing with this upstream would be considered.
The text was updated successfully, but these errors were encountered: