Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set up controllers using goroutines to start the manager quickly #1869

Merged
merged 1 commit into from
Jul 24, 2023

Conversation

tenzen-y
Copy link
Member

What this PR does / why we need it:
The training-operator has many controllers, and each controller needs access to the kube-apiserver while launching.
So, in the heavy load cluster, the training operator can not start by timeout period.

Related to: #1841

By this PR, the training operator starts the manager not to wait for completed to register all controllers.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

  • Docs included if any changes are user facing

@coveralls
Copy link

coveralls commented Jul 22, 2023

Pull Request Test Coverage Report for Build 5631716043

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 6 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.06%) to 40.336%

Files with Coverage Reduction New Missed Lines %
pkg/controller.v1/mpi/mpijob_controller.go 6 80.12%
Totals Coverage Status
Change from base Build 5625056103: -0.06%
Covered Lines: 3861
Relevant Lines: 9572

💛 - Coveralls

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
@johnugeorge
Copy link
Member

Great. This is a needed fix which will reduce the start up time. However, timeout issue can still happen right?

@tenzen-y
Copy link
Member Author

tenzen-y commented Jul 22, 2023

Great. This is a needed fix which will reduce the start up time. However, timeout issue can still happen right?

Yea, the potential issue still is left. However, I believe this improvement is effective for many situations since the manager starts without waiting for registering controllers.

  1. Register controllers to the manager in another thread.
  2. Register probe server to manager.
  3. Start the manager without waiting for register controllers to the manager. (probe servers are ready)

@johnugeorge
Copy link
Member

Are there side effects in declaring readiness before controllers are ready?

@tenzen-y
Copy link
Member Author

Are there side effects in declaring readiness before controllers are ready?

I think there aren't exist side effects even if users create XXXJobs before all controllers are registered to the manager.
Because controllers reconcile against XXXJobs after started.

@johnugeorge
Copy link
Member

Thanks @tenzen-y
/lgtm
/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit fdf6b83 into kubeflow:master Jul 24, 2023
25 checks passed
@tenzen-y tenzen-y deleted the improve-startup-time branch July 24, 2023 09:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants