Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitor RAM usage for VMAgent #2597

Merged
merged 19 commits into from
Jun 2, 2022
Merged

Conversation

nagworld9
Copy link
Contributor

@nagworld9 nagworld9 commented May 20, 2022

Description

Following cgroup counters monitor and report for VMAgent RAM usage.

Total Memory Usage: Collect RSS+Cache counters from memory.stat
Max Memory Usage: Collect from memory.max_usage_in_bytes
Total Swap Memory Usage: Collect SWAP counter from memory.stat

Note: Max Memory and SWAP Memory reported less frequently which is every hour or when data changes.

2022-05-20T22:30:37.076634Z INFO MonitorHandler ExtHandler Event: name=WALinuxAgent, op=METRIC, message=Metric Memory/Total Memory Usage [walinuxagent.service] = 30240768.0, duration=0
2022-05-20T22:30:37.078927Z INFO MonitorHandler ExtHandler Event: name=WALinuxAgent, op=METRIC, message=Metric Memory/Max Memory Usage [walinuxagent.service] = 35487744.0, duration=0
2022-05-20T22:30:37.079692Z INFO MonitorHandler ExtHandler Event: name=WALinuxAgent, op=METRIC, message=Metric Memory/Total Swap Memory Usage [walinuxagent.service] = 0.0, duration=0

Issue #


PR information

  • The title of the PR is clear and informative.
  • There are a small number of commits, each of which has an informative message. This means that previously merged commits do not appear in the history of the PR. For information on cleaning up the commits in your pull request, see this page.
  • If applicable, the PR references the bug/issue that it fixes in the description.
  • New Unit tests were added for the changes made

Quality of Code and Contribution Guidelines

@codecov
Copy link

codecov bot commented May 20, 2022

Codecov Report

Merging #2597 (c01ab48) into develop (48158a8) will increase coverage by 0.02%.
The diff coverage is 87.05%.

@@             Coverage Diff             @@
##           develop    #2597      +/-   ##
===========================================
+ Coverage    71.89%   71.92%   +0.02%     
===========================================
  Files          103      103              
  Lines        15515    15577      +62     
  Branches      2467     2478      +11     
===========================================
+ Hits         11155    11203      +48     
- Misses        3860     3864       +4     
- Partials       500      510      +10     
Impacted Files Coverage Δ
azurelinuxagent/ga/monitor.py 74.43% <80.00%> (+0.01%) ⬆️
azurelinuxagent/common/cgroup.py 87.74% <86.36%> (-3.34%) ⬇️
azurelinuxagent/common/cgroupconfigurator.py 73.68% <92.85%> (+0.15%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 48158a8...c01ab48. Read the comment docs.

MetricValue(MetricsCategory.MEMORY_CATEGORY, MetricsCounter.MAX_MEM_USAGE, self.name,
self.get_max_memory_usage(), logger.EVERY_HOUR),
MetricValue(MetricsCategory.MEMORY_CATEGORY, MetricsCounter.TOTAL_SWAP_MEM_USAGE, self.name,
self.get_swap_memory_usage(), logger.EVERY_HOUR)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rationale behind choosing every hour based on the fact that this would give us enough data points (roughly 1400+) over 60 day retention period of Kusto if vm is up and agent is running all the time.

Note: we would report If data changes during that window too.

@@ -482,7 +503,6 @@ def enable(self):
"Attempted to enable cgroups, but they are not supported on the current platform")
self._agent_cgroups_enabled = True
self._extensions_cgroups_enabled = True
self.__set_cpu_quota(conf.get_agent_cpu_quota())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now Agent enable should be independent of cpu quota. We should enable even if any one of the controller (cpu or memory) is mounted/enabled.


PollResourceUsage().run()
self.assertEqual(0, patch_periodic_warn.call_count)
self.assertEqual(0, patch_add_metric.call_count) # No metrics should be sent.

def test_generate_extension_metrics_telemetry_dictionary(self, *args): # pylint: disable=unused-argument
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicate test which is already tested in cgrouptelemtry class and it's appropriate to test there.

azurelinuxagent/common/event.py Outdated Show resolved Hide resolved
@@ -27,7 +27,8 @@

AGENT_NAME_TELEMETRY = "walinuxagent.service" # Name used for telemetry; it needs to be consistent even if the name of the service changes

MetricValue = namedtuple('Metric', ['category', 'counter', 'instance', 'value'])
MetricValue = namedtuple('Metric', ['category', 'counter', 'instance', 'value', 'report_period'])
MetricValue.__new__.__defaults__ = (None,) # namedtuple() assigns the values in the defaults iterable to the rightmost fields
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the motivation for defaulting to None? explicit initialization may be a better option

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This default only for report_period field. A very fee candidates (Metrics) have frequency set and rest don't have it. I thought explicit initialization with default on every metric is overhead when compared to this. This is like a class with default values for instance variables.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defaulting to None makes it easier to introduces bugs. let's do explicit initialization

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite not sure If I followed, are you suggesting explicit initialization means other than None?

narrieta
narrieta previously approved these changes May 23, 2022
@@ -40,6 +41,7 @@ class MetricsCounter(object):
TOTAL_MEM_USAGE = "Total Memory Usage"
MAX_MEM_USAGE = "Max Memory Usage"
THROTTLED_TIME = "Throttled Time"
TOTAL_SWAP_MEM_USAGE = "Total Swap Memory Usage"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "Total" needed in the name? are we adding multiple properties?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe not If I think now as this just one property

azurelinuxagent/common/cgroup.py Outdated Show resolved Hide resolved
azurelinuxagent/common/cgroup.py Show resolved Hide resolved
azurelinuxagent/common/event.py Outdated Show resolved Hide resolved
@narrieta narrieta self-requested a review May 23, 2022 17:23
Copy link
Member

@narrieta narrieta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

posted some comment

Copy link
Member

@narrieta narrieta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a couple of minor comments

from azurelinuxagent.common.exception import CGroupsException
from azurelinuxagent.common.future import ustr
from azurelinuxagent.common.osutil import get_osutil
from azurelinuxagent.common.utils import fileutil

REPORT_EVERY_HOUR = timedelta(hours=1)
DEFAULT_REPORT_PERIOD = timedelta(seconds=conf.get_cgroup_check_period())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these 2 should be private

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump on this


return tracked


class MemoryCgroup(CGroup):

def _get_memory_stat_counter(self, counter_name, raise_error_if_counter_not_present=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'raise_error_if_counter_not_present' is a little odd... can't the caller just ignore the exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean consider '0' if counter not present and don't throw exception for all counters. I'm ignoring the exception for SWAP as this is an optional value and may not be present in stat file but others RSS, cache should be there all the time and returning exception if that's not the case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, what i meant is that passing that argument to suppress the exception is odd... normally the caller would just handle the exception.

for example, _get_memory_stat_counter could always raise FileNotFoundError (or similar) and then get_swap_memory_usage would do

        try:
            return self._get_memory_stat_counter("swap")
        except FileNotFoundError:
            return 0

cache = self._get_memory_stat_counter("cache")
rss = self._get_memory_stat_counter("rss")
return cache + rss
except (IOError, OSError) as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid the repetition of the exception handling logic in _get_memory_stat_counter, _get_memory_stat_counter and get_swap_memory_usage?

"""

def __init__(self, category, counter, instance, value, report_period=DEFAULT_REPORT_PERIOD):
self.__category = category
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These would probably be better as starting with a single underscore, not two. See this stack overflow question and its answers for some interesting discussion.

"""
try:
return self._get_memory_stat_counter("swap")
except CounterNotFound:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_memory_usage above looks like it will possibly raise a CounterNotFound exception, whereas this code will handle it. I think it makes more sense to have similar convention (i.e. both get*_memory_usage methods handle CounterNotFound, or neither do).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SWAP is on optional field, not necessarily present in the stat file whereas other counters should be found otherwise something got messed up in cgroups, in that case I want to raise exception.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the convention the code base follows is that get_* can raise, and try_* returns a default value.

Maybe we can switch to use that to make the current behavior more explicit to the caller?

key = metric.category + metric.counter + metric.instance
if key not in self.__periodic_metrics or (self.__periodic_metrics[key] + metric.report_period) <= datetime.datetime.now():
report_metric(metric.category, metric.counter, metric.instance, metric.value, log_event=self.__log_metrics)
self.__periodic_metrics[key] = datetime.datetime.now()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the bounds on the growth of this map? These entries are one to one with the number of extensions, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, one to one with extensions

test_mem_cg.get_memory_usage()

swap_memory_usage = test_mem_cg.get_swap_memory_usage()
self.assertEqual(0, swap_memory_usage)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is indicative of what I mention above. The caller needs to know which method returns a zero on error and which throws an exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exact behavior I want to implement

@nagworld9 nagworld9 merged commit 8320fee into Azure:develop Jun 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants