Fails mpi-operator early if access to list or watch objects is denied #619

emsixteeen · 2024-02-02T18:24:45Z

Currently the mpi-operator waits for informer caches to sync before being ready to do work.

If access to any objects fail due to permission issues, the failure is simply logged, and until or unless the access issues is resolved, the mpi-operator remains in this waiting state indefinitely.

This request adds an option to fatally fail the mpi-operator if the operator is unable to list or watch the objects due to Forbidden or Unauthorized errors.

The option defaults to false so that existing functionality / behavior is maintained.

alculquicondor · 2024-02-05T15:10:59Z

I wouldn't make it an option.
The kubelet (or deployment controller) would retry anyways, so the behavior is similar.

emsixteeen · 2024-02-05T15:45:46Z

The thought process for making it an option was to have the mpi-operator fail fast (upon the first permission error).

Right now the MPIJobController startup process is something like this:

Operator starts up, and spawns Goroutines to list/watch objects (ConfigMaps, Secrets, Services, etc.)
It waits for SharedInformer.HasSynced() to be true
Until the SharedInformer.HasSynced() returns true, the mpi-operator is in "waiting" state

Waiting forSharedInformer.HasSynced()to return true might be acceptable in some situations, such as if the permission issue is transient.

Making the "fail-fast" optional (and defaulting it to false) just leaves the current behavior as is.

The use this request is trying to address: when it'll be known upfront that the permission errors are not transient, and you want the mpi-controller to fail fast.

Regarding kubelet or the Deployment Controller handling this, it seems that they would only handle this based on a readiness probe, which currently – once the leader is elected – will always come back as healthy.

Per my other comment, all of this could technically be handled in a more robust readiness/health probe, but that poses other challenges (namely that InstallPathHandler can only be called once).

alculquicondor · 2024-02-05T15:09:39Z

pkg/controller/mpi_job_controller.go

+				cache.DefaultWatchErrorHandler(r, err)
+
+				if errors.IsUnauthorized(err) || errors.IsForbidden(err) {
+					klog.Fatalf("Unable to sync cache for informer %s: %s. Exiting.", name, err)


Return the error instead

https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/migration-to-structured-logging.md#replacing-fatal-calls

Hmm... Doesn't seem like it's possible to return an error:

SetWatchErrorHandler take a WatchErrorHandler – which does not return anything. If SetWatchErrorHandler fails, the code already returns an error for that.

With regard to Replacing Fatal Calls, klog.ErrorS and klog.FlushAndExit is in later version of klog, but the operator is using v1, which doesn't have those methods ... 😕

I made a mistake: SetWatchErrorHandler was calling klog.Fatalf() as well. Changed it to return an error if it fails.

Exploring some other ways a graceful shutdown can be initiated if the error handler goes get invoked (without calling klog.Fatalf()):

Upgrade to klog/v2

Call kubeapiserver.RequestShutdown()

alculquicondor · 2024-02-05T16:17:45Z

Regarding kubelet or the Deployment Controller handling this, it seems that they would only handle this based on a readiness probe, which currently – once the leader is elected – will always come back as healthy.

What I mean is that if you enable fail-fast, we would have the kubelet restart the container or deployment controller to recreate the pod. So effectively, the E2E behavior is that the system continues trying to start the manager. And that E2E behavior is pretty similar to the existing behavior, except that it retries without exiting the binary.

Users are not losing anything. Instead, they are getting more visibility into any possible failures, as they will see pods being recreated or restarted, causing them to investigate.

emsixteeen · 2024-02-05T16:47:25Z

Users are not losing anything. Instead, they are getting more visibility into any possible failures, as they will see pods being recreated or restarted, causing them to investigate.

Oh, I see what you mean: in effect the pod is exiting and thus being forced into an unhealthy state ... Do you suggest then that the option be completely removed? This will cause the "current behavior" to change ...

alculquicondor · 2024-02-05T18:11:01Z

I don't expect anyone to be relying on the "current behavior". The new behavior just moves the retry to a higher level controller, as opposed to being retried within the binary.

So yes, I think we can just remove the option.

emsixteeen · 2024-02-05T18:32:42Z

I don't expect anyone to be relying on the "current behavior".

I'm just trying not to break stuff 😄

So yes, I think we can just remove the option.

Removed

alculquicondor · 2024-02-05T18:33:00Z

pkg/controller/mpi_job_controller.go

+			cache.DefaultWatchErrorHandler(r, err)
+
+			if errors.IsUnauthorized(err) || errors.IsForbidden(err) {
+				klog.Fatalf("Unable to sync cache for informer %s: %s. Requesting controller to exit.", name, err)


return error here too.

This is the where an error can't be returned!

SetWatchErrorHandler takes a WatchErrorHandler which is defined as:

type WatchErrorHandler func(r *Reflector, err error)

I.e. its args are (*Reflector, error) – and returns nothing!

oh true.
Still, please don't use Fatalf. Follow this guide https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/migration-to-structured-logging.md#replacing-fatal-calls

To follow this guide, klog needs to be upgraded to klog/v2. The methods klog.ErrorS() and klog.FlushAndExit() are not available in klog/v1.0.0.

Should klog be updated to klog/v2 for this change?

:(

Let's leave that for a follow up

/lgtm
/approve

alculquicondor

Please squash

alculquicondor · 2024-02-05T18:33:52Z

I'm just trying not to break stuff

Absolutely, I appreciate the amount of consideration for backwards-compatibility.

…idden or unauthorized to resources

google-oss-prow · 2024-02-05T19:04:43Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [alculquicondor]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow bot requested review from carmark and gaocegege February 2, 2024 18:24

google-oss-prow bot added the size/M label Feb 2, 2024

emsixteeen mentioned this pull request Feb 2, 2024

adding timeout for cache sync #618

Closed

alculquicondor reviewed Feb 5, 2024

View reviewed changes

emsixteeen force-pushed the fail-informers branch from b597ebd to c7fda6e Compare February 5, 2024 16:29

google-oss-prow bot added size/S and removed size/M labels Feb 5, 2024

alculquicondor reviewed Feb 5, 2024

View reviewed changes

emsixteeen force-pushed the fail-informers branch from 38299ef to 5b2f51a Compare February 5, 2024 18:50

adding error handler for informers, which will exit if access is forb…

f8b77a4

…idden or unauthorized to resources

emsixteeen force-pushed the fail-informers branch from 5b2f51a to f8b77a4 Compare February 5, 2024 18:52

google-oss-prow bot assigned alculquicondor Feb 5, 2024

google-oss-prow bot added the lgtm label Feb 5, 2024

google-oss-prow bot added the approved label Feb 5, 2024

google-oss-prow bot merged commit 4c9ac06 into kubeflow:master Feb 5, 2024
10 checks passed

emsixteeen mentioned this pull request Feb 5, 2024

removing klog.Fatalf in favor of a shutdown request #625

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fails mpi-operator early if access to list or watch objects is denied #619

Fails mpi-operator early if access to list or watch objects is denied #619

emsixteeen commented Feb 2, 2024

alculquicondor commented Feb 5, 2024

emsixteeen commented Feb 5, 2024

alculquicondor Feb 5, 2024

emsixteeen Feb 5, 2024

emsixteeen Feb 5, 2024

alculquicondor commented Feb 5, 2024

emsixteeen commented Feb 5, 2024

alculquicondor commented Feb 5, 2024

emsixteeen commented Feb 5, 2024

alculquicondor Feb 5, 2024

emsixteeen Feb 5, 2024

alculquicondor Feb 5, 2024

emsixteeen Feb 5, 2024

alculquicondor Feb 5, 2024

alculquicondor left a comment

alculquicondor commented Feb 5, 2024

google-oss-prow bot commented Feb 5, 2024

Fails mpi-operator early if access to list or watch objects is denied #619

Fails mpi-operator early if access to list or watch objects is denied #619

Conversation

emsixteeen commented Feb 2, 2024

alculquicondor commented Feb 5, 2024

emsixteeen commented Feb 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor commented Feb 5, 2024

emsixteeen commented Feb 5, 2024

alculquicondor commented Feb 5, 2024

emsixteeen commented Feb 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor left a comment

Choose a reason for hiding this comment

alculquicondor commented Feb 5, 2024

google-oss-prow bot commented Feb 5, 2024