Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fails mpi-operator early if access to list or watch objects is denied #619

Merged
merged 1 commit into from
Feb 5, 2024

Conversation

emsixteeen
Copy link
Contributor

Currently the mpi-operator waits for informer caches to sync before being ready to do work.

If access to any objects fail due to permission issues, the failure is simply logged, and until or unless the access issues is resolved, the mpi-operator remains in this waiting state indefinitely.

This request adds an option to fatally fail the mpi-operator if the operator is unable to list or watch the objects due to Forbidden or Unauthorized errors.

The option defaults to false so that existing functionality / behavior is maintained.

@alculquicondor
Copy link
Collaborator

I wouldn't make it an option.
The kubelet (or deployment controller) would retry anyways, so the behavior is similar.

@emsixteeen
Copy link
Contributor Author

The thought process for making it an option was to have the mpi-operator fail fast (upon the first permission error).

Right now the MPIJobController startup process is something like this:

  • Operator starts up, and spawns Goroutines to list/watch objects (ConfigMaps, Secrets, Services, etc.)
  • It waits for SharedInformer.HasSynced() to be true
  • Until the SharedInformer.HasSynced() returns true, the mpi-operator is in "waiting" state

Waiting forSharedInformer.HasSynced()to return true might be acceptable in some situations, such as if the permission issue is transient.

Making the "fail-fast" optional (and defaulting it to false) just leaves the current behavior as is.

The use this request is trying to address: when it'll be known upfront that the permission errors are not transient, and you want the mpi-controller to fail fast.

Regarding kubelet or the Deployment Controller handling this, it seems that they would only handle this based on a readiness probe, which currently – once the leader is elected – will always come back as healthy.

Per my other comment, all of this could technically be handled in a more robust readiness/health probe, but that poses other challenges (namely that InstallPathHandler can only be called once).

cache.DefaultWatchErrorHandler(r, err)

if errors.IsUnauthorized(err) || errors.IsForbidden(err) {
klog.Fatalf("Unable to sync cache for informer %s: %s. Exiting.", name, err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... Doesn't seem like it's possible to return an error:

SetWatchErrorHandler take a WatchErrorHandler – which does not return anything. If SetWatchErrorHandler fails, the code already returns an error for that.

With regard to Replacing Fatal Calls, klog.ErrorS and klog.FlushAndExit is in later version of klog, but the operator is using v1, which doesn't have those methods ... 😕

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a mistake: SetWatchErrorHandler was calling klog.Fatalf() as well. Changed it to return an error if it fails.

Exploring some other ways a graceful shutdown can be initiated if the error handler goes get invoked (without calling klog.Fatalf()):

  1. Upgrade to klog/v2
  2. Call kubeapiserver.RequestShutdown()

@alculquicondor
Copy link
Collaborator

Regarding kubelet or the Deployment Controller handling this, it seems that they would only handle this based on a readiness probe, which currently – once the leader is elected – will always come back as healthy.

What I mean is that if you enable fail-fast, we would have the kubelet restart the container or deployment controller to recreate the pod. So effectively, the E2E behavior is that the system continues trying to start the manager. And that E2E behavior is pretty similar to the existing behavior, except that it retries without exiting the binary.

Users are not losing anything. Instead, they are getting more visibility into any possible failures, as they will see pods being recreated or restarted, causing them to investigate.

@emsixteeen
Copy link
Contributor Author

Users are not losing anything. Instead, they are getting more visibility into any possible failures, as they will see pods being recreated or restarted, causing them to investigate.

Oh, I see what you mean: in effect the pod is exiting and thus being forced into an unhealthy state ... Do you suggest then that the option be completely removed? This will cause the "current behavior" to change ...

@alculquicondor
Copy link
Collaborator

I don't expect anyone to be relying on the "current behavior". The new behavior just moves the retry to a higher level controller, as opposed to being retried within the binary.

So yes, I think we can just remove the option.

@google-oss-prow google-oss-prow bot added size/S and removed size/M labels Feb 5, 2024
@emsixteeen
Copy link
Contributor Author

I don't expect anyone to be relying on the "current behavior".

I'm just trying not to break stuff 😄

So yes, I think we can just remove the option.

Removed

cache.DefaultWatchErrorHandler(r, err)

if errors.IsUnauthorized(err) || errors.IsForbidden(err) {
klog.Fatalf("Unable to sync cache for informer %s: %s. Requesting controller to exit.", name, err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return error here too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the where an error can't be returned!

SetWatchErrorHandler takes a WatchErrorHandler which is defined as:

type WatchErrorHandler func(r *Reflector, err error)

I.e. its args are (*Reflector, error) – and returns nothing!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To follow this guide, klog needs to be upgraded to klog/v2. The methods klog.ErrorS() and klog.FlushAndExit() are not available in klog/v1.0.0.

Should klog be updated to klog/v2 for this change?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:(

Let's leave that for a follow up

/lgtm
/approve

Copy link
Collaborator

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please squash

@alculquicondor
Copy link
Collaborator

I'm just trying not to break stuff

Absolutely, I appreciate the amount of consideration for backwards-compatibility.

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 4c9ac06 into kubeflow:master Feb 5, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants