Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update deployment.yaml #1668

Merged
merged 2 commits into from
Sep 27, 2022
Merged

Update deployment.yaml #1668

merged 2 commits into from
Sep 27, 2022

Conversation

omrishiv
Copy link
Contributor

What this PR does / why we need it:
Training operator crashes continuously OOMKilled in certain circumstances. This should help out

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #1661

Checklist:

  • Docs included if any changes are user facing

@google-cla
Copy link

google-cla bot commented Sep 22, 2022

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@coveralls
Copy link

coveralls commented Sep 23, 2022

Pull Request Test Coverage Report for Build 3114234558

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.03%) to 39.785%

Files with Coverage Reduction New Missed Lines %
pkg/controller.v1/mpi/mpijob_controller.go 2 77.65%
Totals Coverage Status
Change from base Build 3108713079: -0.03%
Covered Lines: 2329
Relevant Lines: 5854

💛 - Coveralls

@johnugeorge
Copy link
Member

Do we need resources limits/requests in manifests? This is specific to deployment environments.
Should we remove them?

/cc @kubeflow/wg-training-leads

@google-oss-prow google-oss-prow bot requested a review from a team September 23, 2022 06:20
@tenzen-y
Copy link
Member

tenzen-y commented Sep 23, 2022

Do we need resources limits/requests in manifests? This is specific to deployment environments. Should we remove them?

/cc @kubeflow/wg-training-leads

It might be better to leave the current minimal settings since their settings depend on the size of the K8s Cluster.
But we can leave comments on the yaml file about it.

@johnugeorge
Copy link
Member

But, it is a bad experience if operator goes OOM. The default limits if set should work with all medium level deployments. The current 30Mi looks less to me.

Note: In Katib controller, there are no limits/requests for resource

@tenzen-y
Copy link
Member

tenzen-y commented Sep 23, 2022

But, it is a bad experience if operator goes OOM. The default limits if set should work with all medium level deployments. The current 30Mi looks less to me.

I see. That sounds good.
I agree with removing limits/requests. Users can configure it by using the kustomize patch.

Remove resource limits/requests to mirror Katib
@omrishiv
Copy link
Contributor Author

Thank you for the Katib hint; I removed the resource line to follow that

@johnugeorge
Copy link
Member

/lgtm
/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge, OmriShiv

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit e5f372f into kubeflow:master Sep 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Training Operator pod failed to start on OCP 4.10.30 with error "memory limit too low"
4 participants