Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enforce memory usage for agent #2671

Merged
merged 4 commits into from
Oct 6, 2022

Conversation

nagworld9
Copy link
Contributor

Description

Instead enforcing memory with cgroup, agent itself monitor the current memory usage at regular intervals and do the safe exit when memory reaches the limit. This approach avoids OOM kills on system.

By default, memory usage check set to false for now. We will enable once we know the agent memory limit value.

Issue #


PR information

  • The title of the PR is clear and informative.
  • There are a small number of commits, each of which has an informative message. This means that previously merged commits do not appear in the history of the PR. For information on cleaning up the commits in your pull request, see this page.
  • If applicable, the PR references the bug/issue that it fixes in the description.
  • New Unit tests were added for the changes made

Quality of Code and Contribution Guidelines

@codecov
Copy link

codecov bot commented Sep 27, 2022

Codecov Report

Merging #2671 (2b179f3) into develop (aeda2c1) will increase coverage by 0.04%.
The diff coverage is 88.88%.

❗ Current head 2b179f3 differs from pull request most recent head 503474b. Consider uploading reports for the commit 503474b to get more accurate results

@@             Coverage Diff             @@
##           develop    #2671      +/-   ##
===========================================
+ Coverage    71.98%   72.02%   +0.04%     
===========================================
  Files          103      103              
  Lines        15698    15740      +42     
  Branches      2245     2254       +9     
===========================================
+ Hits         11300    11337      +37     
  Misses        3881     3881              
- Partials       517      522       +5     
Impacted Files Coverage Δ
azurelinuxagent/common/cgroupconfigurator.py 73.65% <80.00%> (+0.06%) ⬆️
azurelinuxagent/ga/update.py 89.36% <90.90%> (+0.02%) ⬆️
azurelinuxagent/common/conf.py 79.83% <100.00%> (+0.34%) ⬆️
azurelinuxagent/common/event.py 85.39% <100.00%> (+0.03%) ⬆️
azurelinuxagent/common/exception.py 96.07% <100.00%> (+0.11%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.


NOTE: This option is experimental and may be removed in later versions of the Agent.
"""
return conf.get_int("Debug.AgentMemoryQuota", 30 * 1024 ** 2)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a placeholder value. We will update once we have concrete value.


NOTE: This option is experimental and may be removed in later versions of the Agent.
"""
return conf.get_switch("Debug.EnableAgentMemoryUsageCheck", False)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feature set to false for now

"""
try:
if conf.get_enable_agent_memory_usage_check() and self._extensions_summary.converged:
if self._last_check_memory_usage == datetime.min or datetime.utcnow() >= (self._last_check_memory_usage + UpdateHandler.CHECK_MEMORY_USAGE_PERIOD):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would check if goal state completed and also, cgroup monitor period.

Note: I'm using same cgroup monitor period to run this check assuming that we evaluate and calculate memory limit based on the values we get from monitoring thread.

msg = "Check on agent memory usage:\n{0}".format(ustr(exception))
logger.info(msg)
add_event(AGENT_NAME, op=WALAEventOperation.CGroupsInfo, is_success=True, message=msg)
raise ExitException("Agent {0} is reached memory limit -- exiting".format(CURRENT_AGENT))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ExitException and rest of the logic of sys.exit already handled in main thread for safe exit.

add_event(AGENT_NAME, op=WALAEventOperation.CGroupsInfo, is_success=True, message=msg)
raise ExitException("Agent {0} is reached memory limit -- exiting".format(CURRENT_AGENT))
except Exception as exception:
if self._check_memory_usage_last_error_report == datetime.min or (self._check_memory_usage_last_error_report + timedelta(hours=6)) > datetime.now():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to avoid flooding error msgs in agent log

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Could also add a short string such as "[Won't report the same error for 6 hours]"?

Copy link
Member

@narrieta narrieta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long, long delay. Added a couple of comments/requests.

current_usage += metric.value

if current_usage > conf.get_agent_memory_quota():
raise CGroupsException("The agent memory limit {0} bytes exceeded. The current reported usage is {1} bytes.".format(conf.get_agent_memory_quota(), current_usage))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use a new exception type for this (AgentMemoryExceededException?). CGroupsException is too generic, and also in general the code handles cgroups issues in such a way that the correspoding cgroup task becomes a no-op.

@@ -186,6 +187,7 @@ def load_conf_from_file(conf_file_path, conf=__conf__):
"Debug.CgroupCheckPeriod": 300,
"Debug.AgentCpuQuota": 50,
"Debug.AgentCpuThrottledTimeThreshold": 120,
"Debug.AgentMemoryQuota": 31457280,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use the same notation as below? (30 * 1024 ** 2)

except CGroupsException as exception:
msg = "Check on agent memory usage:\n{0}".format(ustr(exception))
logger.info(msg)
add_event(AGENT_NAME, op=WALAEventOperation.CGroupsInfo, is_success=True, message=msg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a new WALAEventOperation for this event

add_event(AGENT_NAME, op=WALAEventOperation.CGroupsInfo, is_success=True, message=msg)
raise ExitException("Agent {0} is reached memory limit -- exiting".format(CURRENT_AGENT))
except Exception as exception:
if self._check_memory_usage_last_error_report == datetime.min or (self._check_memory_usage_last_error_report + timedelta(hours=6)) > datetime.now():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Could also add a short string such as "[Won't report the same error for 6 hours]"?

@nagworld9 nagworld9 merged commit faa8b14 into Azure:develop Oct 6, 2022
@nagworld9 nagworld9 deleted the agent-memory-enforce branch October 6, 2022 23:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants