Consider delay/sleep on graceful termination #2764

anderseknert · 2020-10-08T14:16:19Z

When rolling out OPA at scale in our kubernetes clusters, we have seen occasional errors reported by our microservices. We managed to track this down to their shutdown phase and the way the interact, or rather need to interact, with OPA running as the sidecar next to them. When aiming for zero downtime deployments in environments where external load balancers are involved (as opposed to purely internal traffic), these often need some time to register the fact that a pod is shutting down and update their list of available endpoints accordingly. This means that pods in a shutdown state may still receive traffic for (most often) a few seconds after kubernetes has marked them as shutting down and sent them a SIGTERM. The common way of dealing with this is to introduce a delay/sleep of a something like 15 seconds in the apps when receiving the SIGTERM signal before actually closing down. This could either be made inside the application logic, or more commonly by calling sleep in the containers preStop lifecycle configuration, like:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
    - name: nginx
      image: nginx
      lifecycle:
        preStop:
          exec:
            command: ["sleep", "15"]

As this preStop is called before the SIGTERM is sent, load balancers and other components in the chain has a chance to delist the pod from the list of available endpoints before the app shuts down. This workaround is documented here, there and elsewhere.

This pattern, though simple. has proven to work really well. With OPA now running as a sidecar in many of our pods (and way more to come!) we're now facing the problem that OPA shuts down almost immediately when receiving the SIGTERM while the app container continues to run and serve traffic for the duration of the sleep. This of course leads to a situation where our apps are asking for decisions to OPA sidecars that have already terminated.

Since the OPA container does not include anything but the OPA binary, there's no way for us to call the sleep binary (or I guess in some systems, built-in). We'd like to find a way of keeping our zero downtime fix while running OPA as sidecars. Our options for making this work I think boils down to these:

Build our own OPA images, including sleep. This is our current workaround. While it works it adds to the burden of running and maintaining OPA in our clusters.
Have sleep included in the official image.
Have OPA itself receive the SIGTERM but continue to serve for a configurable delay until initiating graceful shutdown.

I'd be curious to hear if others in the OPA community have dealt with this and if so how, and whether one of the proposed solutions for dealing with this upstream would be considered.

The text was updated successfully, but these errors were encountered:

tsandall · 2020-10-21T14:48:37Z

We already have --shutdown-grace-period which is essentially the opposite: if the server doesn't shutdown within the grace period, the process exits anyway. Adding an additional option like --shutdown-wait-period seems like the simplest solution--when a SIGINT or SIGTERM signal is received, wait for that period before commencing shutdown. The default should be zero.

anderseknert · 2020-10-21T18:12:40Z

That would be a very nice addition for us! Thanks @tsandall 👍 We'll get right to work then :)

Fixes open-policy-agent#2764 Signed-off-by: Björn Carlsson <bjorn.carlsson@bisnode.com>

Fixes #2764 Signed-off-by: Björn Carlsson <bjorn.carlsson@bisnode.com>

patrick-east added the feature-request label Oct 14, 2020

bcarlsson mentioned this issue Oct 28, 2020

Add flag to allow setting a shutdown wait period #2834

Merged

bcarlsson pushed a commit to Bisnode/opa that referenced this issue Oct 28, 2020

Add flag to allow setting a shutdown wait period

012fbfd

Fixes open-policy-agent#2764 Signed-off-by: Björn Carlsson <bjorn.carlsson@bisnode.com>

patrick-east closed this as completed in #2834 Oct 30, 2020

patrick-east pushed a commit that referenced this issue Oct 30, 2020

Add flag to allow setting a shutdown wait period

479f438

Fixes #2764 Signed-off-by: Björn Carlsson <bjorn.carlsson@bisnode.com>

Benjaminvdv mentioned this issue Nov 14, 2022

Consider adding graceful delay on termination fluent/fluent-bit#6400

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider delay/sleep on graceful termination #2764

Consider delay/sleep on graceful termination #2764

anderseknert commented Oct 8, 2020

tsandall commented Oct 21, 2020

anderseknert commented Oct 21, 2020

Consider delay/sleep on graceful termination #2764

Consider delay/sleep on graceful termination #2764

Comments

anderseknert commented Oct 8, 2020

tsandall commented Oct 21, 2020

anderseknert commented Oct 21, 2020