-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Monitor RAM usage for VMAgent #2597
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #2597 +/- ##
===========================================
+ Coverage 71.89% 71.92% +0.02%
===========================================
Files 103 103
Lines 15515 15577 +62
Branches 2467 2478 +11
===========================================
+ Hits 11155 11203 +48
- Misses 3860 3864 +4
- Partials 500 510 +10
Continue to review full report at Codecov.
|
azurelinuxagent/common/cgroup.py
Outdated
MetricValue(MetricsCategory.MEMORY_CATEGORY, MetricsCounter.MAX_MEM_USAGE, self.name, | ||
self.get_max_memory_usage(), logger.EVERY_HOUR), | ||
MetricValue(MetricsCategory.MEMORY_CATEGORY, MetricsCounter.TOTAL_SWAP_MEM_USAGE, self.name, | ||
self.get_swap_memory_usage(), logger.EVERY_HOUR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rationale behind choosing every hour based on the fact that this would give us enough data points (roughly 1400+) over 60 day retention period of Kusto if vm is up and agent is running all the time.
Note: we would report If data changes during that window too.
@@ -482,7 +503,6 @@ def enable(self): | |||
"Attempted to enable cgroups, but they are not supported on the current platform") | |||
self._agent_cgroups_enabled = True | |||
self._extensions_cgroups_enabled = True | |||
self.__set_cpu_quota(conf.get_agent_cpu_quota()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now Agent enable should be independent of cpu quota. We should enable even if any one of the controller (cpu or memory) is mounted/enabled.
|
||
PollResourceUsage().run() | ||
self.assertEqual(0, patch_periodic_warn.call_count) | ||
self.assertEqual(0, patch_add_metric.call_count) # No metrics should be sent. | ||
|
||
def test_generate_extension_metrics_telemetry_dictionary(self, *args): # pylint: disable=unused-argument |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicate test which is already tested in cgrouptelemtry class and it's appropriate to test there.
azurelinuxagent/common/cgroup.py
Outdated
@@ -27,7 +27,8 @@ | |||
|
|||
AGENT_NAME_TELEMETRY = "walinuxagent.service" # Name used for telemetry; it needs to be consistent even if the name of the service changes | |||
|
|||
MetricValue = namedtuple('Metric', ['category', 'counter', 'instance', 'value']) | |||
MetricValue = namedtuple('Metric', ['category', 'counter', 'instance', 'value', 'report_period']) | |||
MetricValue.__new__.__defaults__ = (None,) # namedtuple() assigns the values in the defaults iterable to the rightmost fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the motivation for defaulting to None? explicit initialization may be a better option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This default only for report_period field. A very fee candidates (Metrics) have frequency set and rest don't have it. I thought explicit initialization with default on every metric is overhead when compared to this. This is like a class with default values for instance variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defaulting to None makes it easier to introduces bugs. let's do explicit initialization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite not sure If I followed, are you suggesting explicit initialization means other than None?
azurelinuxagent/common/cgroup.py
Outdated
@@ -40,6 +41,7 @@ class MetricsCounter(object): | |||
TOTAL_MEM_USAGE = "Total Memory Usage" | |||
MAX_MEM_USAGE = "Max Memory Usage" | |||
THROTTLED_TIME = "Throttled Time" | |||
TOTAL_SWAP_MEM_USAGE = "Total Swap Memory Usage" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "Total" needed in the name? are we adding multiple properties?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe not If I think now as this just one property
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
posted some comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, a couple of minor comments
azurelinuxagent/common/cgroup.py
Outdated
from azurelinuxagent.common.exception import CGroupsException | ||
from azurelinuxagent.common.future import ustr | ||
from azurelinuxagent.common.osutil import get_osutil | ||
from azurelinuxagent.common.utils import fileutil | ||
|
||
REPORT_EVERY_HOUR = timedelta(hours=1) | ||
DEFAULT_REPORT_PERIOD = timedelta(seconds=conf.get_cgroup_check_period()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these 2 should be private
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bump on this
azurelinuxagent/common/cgroup.py
Outdated
|
||
return tracked | ||
|
||
|
||
class MemoryCgroup(CGroup): | ||
|
||
def _get_memory_stat_counter(self, counter_name, raise_error_if_counter_not_present=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'raise_error_if_counter_not_present' is a little odd... can't the caller just ignore the exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean consider '0' if counter not present and don't throw exception for all counters. I'm ignoring the exception for SWAP as this is an optional value and may not be present in stat file but others RSS, cache should be there all the time and returning exception if that's not the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, what i meant is that passing that argument to suppress the exception is odd... normally the caller would just handle the exception.
for example, _get_memory_stat_counter could always raise FileNotFoundError (or similar) and then get_swap_memory_usage would do
try:
return self._get_memory_stat_counter("swap")
except FileNotFoundError:
return 0
azurelinuxagent/common/cgroup.py
Outdated
cache = self._get_memory_stat_counter("cache") | ||
rss = self._get_memory_stat_counter("rss") | ||
return cache + rss | ||
except (IOError, OSError) as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we avoid the repetition of the exception handling logic in _get_memory_stat_counter, _get_memory_stat_counter and get_swap_memory_usage?
azurelinuxagent/common/cgroup.py
Outdated
""" | ||
|
||
def __init__(self, category, counter, instance, value, report_period=DEFAULT_REPORT_PERIOD): | ||
self.__category = category |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These would probably be better as starting with a single underscore, not two. See this stack overflow question and its answers for some interesting discussion.
azurelinuxagent/common/cgroup.py
Outdated
""" | ||
try: | ||
return self._get_memory_stat_counter("swap") | ||
except CounterNotFound: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_memory_usage
above looks like it will possibly raise a CounterNotFound
exception, whereas this code will handle it. I think it makes more sense to have similar convention (i.e. both get*_memory_usage
methods handle CounterNotFound
, or neither do).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SWAP is on optional field, not necessarily present in the stat file whereas other counters should be found otherwise something got messed up in cgroups, in that case I want to raise exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the convention the code base follows is that get_*
can raise, and try_*
returns a default value.
Maybe we can switch to use that to make the current behavior more explicit to the caller?
key = metric.category + metric.counter + metric.instance | ||
if key not in self.__periodic_metrics or (self.__periodic_metrics[key] + metric.report_period) <= datetime.datetime.now(): | ||
report_metric(metric.category, metric.counter, metric.instance, metric.value, log_event=self.__log_metrics) | ||
self.__periodic_metrics[key] = datetime.datetime.now() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the bounds on the growth of this map? These entries are one to one with the number of extensions, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, one to one with extensions
test_mem_cg.get_memory_usage() | ||
|
||
swap_memory_usage = test_mem_cg.get_swap_memory_usage() | ||
self.assertEqual(0, swap_memory_usage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is indicative of what I mention above. The caller needs to know which method returns a zero on error and which throws an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's exact behavior I want to implement
Description
Following cgroup counters monitor and report for VMAgent RAM usage.
Total Memory Usage: Collect RSS+Cache counters from memory.stat
Max Memory Usage: Collect from memory.max_usage_in_bytes
Total Swap Memory Usage: Collect SWAP counter from memory.stat
Note: Max Memory and SWAP Memory reported less frequently which is every hour or when data changes.
Issue #
PR information
Quality of Code and Contribution Guidelines