Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix XGBoost conditions bug #1737

Merged
merged 1 commit into from
Jan 22, 2023

Conversation

tenzen-y
Copy link
Member

@tenzen-y tenzen-y commented Jan 22, 2023

Signed-off-by: Yuki Iwai yuki.iwai.tz@gmail.com

What this PR does / why we need it:
Currently, the xgboostjob-controller adds a Running condition to the status.conditions field of XGBoostJob when some containers have not been scheduled to Nodes, yet.

Although, other controllers (e.g., mxjob-controller and pytorchjob-controller) do not add Running condition to the status.condition of CustomJob in that situation.

For example, when we deploy XGBoostJob with coscheduling to insufficient resources K8s Cluster, the xgboostjob-controller adds a Running condition to XGBoostJob, regardless of Pod's status.

P.S. I found this issue in #1736

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

  • Docs included if any changes are user facing

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
@coveralls
Copy link

coveralls commented Jan 22, 2023

Pull Request Test Coverage Report for Build 3980353017

  • 0 of 8 (0.0%) changed or added relevant lines in 1 file are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+0.08%) to 38.993%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controller.v1/xgboost/xgboostjob_controller.go 0 8 0.0%
Files with Coverage Reduction New Missed Lines %
pkg/controller.v1/xgboost/xgboostjob_controller.go 1 0%
Totals Coverage Status
Change from base Build 3975942918: 0.08%
Covered Lines: 2680
Relevant Lines: 6873

💛 - Coveralls

@tenzen-y
Copy link
Member Author

@terrytangyuan @johnugeorge This bug makes confused users. So I would like to add this patch to the next training-operator release.

@johnugeorge
Copy link
Member

Thanks @tenzen-y

@tenzen-y tenzen-y mentioned this pull request Jan 22, 2023
1 task
Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y, terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tenzen-y
Copy link
Member Author

Can you restart failed CI?

@google-oss-prow google-oss-prow bot merged commit 83a6f33 into kubeflow:master Jan 22, 2023
@tenzen-y tenzen-y deleted the fix-xgboostjob-controller branch January 22, 2023 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants