Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

⚠️ Refactor manager to avoid race conditions and provide clean shutdown #1695

Merged
merged 1 commit into from
Nov 4, 2021

Commits on Nov 3, 2021

  1. 🐛 Refactor manager to avoid race conditions and provide clean shutdown

    This changeset provides a series of improvements and refactors how the
    manager starts and stops. During testing with Cluster API (a user of
    controller runtime), folks noticed that the manager which runs a series
    of components can deadlock itself when using conversion webhooks, or
    health checks, or won't cleanly shutdown and cleanup all the running
    controller, runnables, caches, webhooks, and http servers.
    
    In particular:
    - The Manager internal mutex didn't actually lock operations while the
      manager was in the process of starting up. The manager internal
      Start() implementation started off a series of goroutines internally
      and then waits. Concurrent operations on the manager, like
      AddHealthzCheck or AddReadyzCheck or AddMetricsExtraHandler modified
      the internals map while or after their respective servers were being
      configured, causing potential races or being ineffective.
    
    - Unclear ordering of the manager caused deadlock when the caches would
      start up. Upon startup, conversion webhooks are required when
      waiting for the cache initial List() call, which warms the internal
      caches. If a webook server or a healthz/readyz probe didn't start in
      time, the cache List() call fails because the webhooks would be
      unavailable.
    
    - Manager would say it was Elected() (note: this is used regardless if
      leader election is enabled or not) without waiting for all the caches
      to warm up, which could result in failed client calls.
    
    - Stop proceduce cancelled everything at once regardless of ordering.
      Previously, the context cancelled all the runnables regardless of
      ordering which can also cause dependencies issues. With these changes,
      if graceful shutdown is set, we try to cancel and wait for runnable
      groups to be done in a strict order before proceeding to exit the
      program.
    
    - Stop procedure cancelled leader election only if graceful shutdown was
      set. This was probably an oversight, now we're cancelling leader
      election regardless if graceful timeout is set or not.
    
    - The http.Server used throughout the codebase now properly sets idle
      and read header timeout to match the api-server.
    
    Signed-off-by: Vince Prignano <vincepri@vmware.com>
    vincepri committed Nov 3, 2021
    Configuration menu
    Copy the full SHA
    612e9b2 View commit details
    Browse the repository at this point in the history