Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase resource limits #367

Merged
merged 2 commits into from
Oct 20, 2023
Merged

Increase resource limits #367

merged 2 commits into from
Oct 20, 2023

Conversation

chimanjain
Copy link
Contributor

@chimanjain chimanjain commented Oct 11, 2023

Description

Increase resource limits to fix OOMKilled error.

GitHub Issues

List the GitHub issues impacted by this PR:

GitHub Issue #
dell/csm#982

Checklist:

  • I have performed a self-review of my own code to ensure there are no formatting, vetting, linting, or security issues
  • I have verified that new and existing unit tests pass locally with my changes
  • I have not allowed coverage numbers to degenerate
  • I have maintained at least 90% code coverage
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • I have maintained backward compatibility

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Please also list any relevant details for your test configuration

  • Sanity Test

nitesh3108
nitesh3108 previously approved these changes Oct 12, 2023
Copy link
Contributor

@jooseppi-luna jooseppi-luna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested this with the customer configuration (Unity w/ health monitor on OCP) as well as a heavy install of any driver (e.g. PFlex with health monitor and sdc monitor enabled along with multiple modules)?

alikdell
alikdell previously approved these changes Oct 12, 2023
@@ -928,8 +928,8 @@ spec:
periodSeconds: 10
resources:
limits:
cpu: 200m
memory: 256Mi
cpu: 400m
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any specific data on why we are incrementing to these values?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bharathsreekanth AFAIK we don't have any specific data, but when Ninjas found the limits to be too low last time, we doubled them, and now a customer ran into the same issue so we are doubling again. It's not unreasonable that we would be running out of memory now, since a lot of additional features have been added to operator since the v0.1.0 release. If we double them now (to 500 mb) we probably won't have to touch them again for a long time. I don't think we have much specific data on this, but I've been under the impression that in other areas of CS it is not unusual to double memory when you run out (e.g. with dynamic arrays). I tried to find resources on memory for containers but wasn't able to find anything.

I do think we should test to see if the CPU limits benefit by being increased or not -- e.g., test 10 installs of a driver with some sidecars with the old cpu limit, then increase it, reinstall the operator, and do the ten installs again and compare the approximate time the csm object takes to go into the ready state with 200m v 400m. If there isn't a significant difference, then I don't think it necessarily makes sense to increase the cpu limit. I documented this suggestion in the Jira defect for this PR as well.

@chimanjain chimanjain force-pushed the increase-resource-limits branch from 57cbae2 to cf9ee59 Compare October 16, 2023 11:22
@chimanjain
Copy link
Contributor Author

chimanjain commented Oct 17, 2023

Have you tested this with the customer configuration (Unity w/ health monitor on OCP) as well as a heavy install of any driver (e.g. PFlex with health monitor and sdc monitor enabled along with multiple modules)?

I tried to replicate it by doing heavy install, but it was installing with no issues.
But the limit should be increased as:

  • Customer is facing OOMKilled failure.
  • Many modules have been introduced since we last defined the resources.
  • As we are increasing the limit and not the request of the resources, it won't impact the initial request of the resources, only the edge cases where we are doing heavy install.

@chimanjain chimanjain force-pushed the increase-resource-limits branch from cf9ee59 to 82d6f83 Compare October 17, 2023 13:02
@chimanjain chimanjain force-pushed the increase-resource-limits branch from 82d6f83 to 0710659 Compare October 18, 2023 12:46
@chimanjain chimanjain dismissed stale reviews from nitesh3108 and alikdell via f9701d2 October 18, 2023 12:57
@chimanjain
Copy link
Contributor Author

PTAL

Copy link
Contributor

@HarishH-DELL HarishH-DELL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chimanjain chimanjain merged commit e60a6d1 into main Oct 20, 2023
@chimanjain chimanjain deleted the increase-resource-limits branch October 20, 2023 06:07
ChristianAtDell added a commit that referenced this pull request Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants