Update deployment.yaml #1668

omrishiv · 2022-09-22T22:29:21Z

What this PR does / why we need it:
Training operator crashes continuously OOMKilled in certain circumstances. This should help out

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #1661

Checklist:

Docs included if any changes are user facing

google-cla · 2022-09-22T22:29:25Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

coveralls · 2022-09-23T06:19:05Z

Pull Request Test Coverage Report for Build 3114234558

0 of 0 changed or added relevant lines in 0 files are covered.
2 unchanged lines in 1 file lost coverage.
Overall coverage decreased (-0.03%) to 39.785%

Files with Coverage Reduction	New Missed Lines	%
pkg/controller.v1/mpi/mpijob_controller.go	2	77.65%

Totals
Change from base Build 3108713079:	-0.03%
Covered Lines:	2329
Relevant Lines:	5854

💛 - Coveralls

johnugeorge · 2022-09-23T06:20:07Z

Do we need resources limits/requests in manifests? This is specific to deployment environments.
Should we remove them?

/cc @kubeflow/wg-training-leads

tenzen-y · 2022-09-23T06:43:38Z

Do we need resources limits/requests in manifests? This is specific to deployment environments. Should we remove them?

/cc @kubeflow/wg-training-leads

It might be better to leave the current minimal settings since their settings depend on the size of the K8s Cluster.
But we can leave comments on the yaml file about it.

johnugeorge · 2022-09-23T09:01:06Z

But, it is a bad experience if operator goes OOM. The default limits if set should work with all medium level deployments. The current 30Mi looks less to me.

Note: In Katib controller, there are no limits/requests for resource

tenzen-y · 2022-09-23T10:00:06Z

But, it is a bad experience if operator goes OOM. The default limits if set should work with all medium level deployments. The current 30Mi looks less to me.

I see. That sounds good.
I agree with removing limits/requests. Users can configure it by using the kustomize patch.

Remove resource limits/requests to mirror Katib

omrishiv · 2022-09-23T16:35:12Z

Thank you for the Katib hint; I removed the resource line to follow that

johnugeorge · 2022-09-27T04:05:32Z

/lgtm
/approve

google-oss-prow · 2022-09-27T04:05:46Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge, OmriShiv

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [johnugeorge]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Update deployment.yaml

0fcdb05

google-oss-prow bot added the size/XS label Sep 22, 2022

google-oss-prow bot requested review from gaocegege, jinchihe and terrytangyuan September 22, 2022 22:29

google-oss-prow bot requested a review from a team September 23, 2022 06:20

Update deployment.yaml

ae6d4dc

Remove resource limits/requests to mirror Katib

google-oss-prow bot assigned johnugeorge Sep 27, 2022

google-oss-prow bot added the lgtm label Sep 27, 2022

google-oss-prow bot added the approved label Sep 27, 2022

google-oss-prow bot merged commit e5f372f into kubeflow:master Sep 27, 2022

tenzen-y mentioned this pull request Jan 7, 2023

Training Operator in CrashLoopBackOff #1717

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update deployment.yaml #1668

Update deployment.yaml #1668

omrishiv commented Sep 22, 2022

google-cla bot commented Sep 22, 2022

coveralls commented Sep 23, 2022 •

edited

Loading

johnugeorge commented Sep 23, 2022

tenzen-y commented Sep 23, 2022 •

edited

Loading

johnugeorge commented Sep 23, 2022

tenzen-y commented Sep 23, 2022 •

edited

Loading

omrishiv commented Sep 23, 2022

johnugeorge commented Sep 27, 2022

google-oss-prow bot commented Sep 27, 2022

Update deployment.yaml #1668

Update deployment.yaml #1668

Conversation

omrishiv commented Sep 22, 2022

google-cla bot commented Sep 22, 2022

coveralls commented Sep 23, 2022 • edited Loading

Pull Request Test Coverage Report for Build 3114234558

💛 - Coveralls

johnugeorge commented Sep 23, 2022

tenzen-y commented Sep 23, 2022 • edited Loading

johnugeorge commented Sep 23, 2022

tenzen-y commented Sep 23, 2022 • edited Loading

omrishiv commented Sep 23, 2022

johnugeorge commented Sep 27, 2022

google-oss-prow bot commented Sep 27, 2022

coveralls commented Sep 23, 2022 •

edited

Loading

tenzen-y commented Sep 23, 2022 •

edited

Loading

tenzen-y commented Sep 23, 2022 •

edited

Loading