-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not use --unit with systemd-cgls #1910
Conversation
command_name = command[0] if isinstance(command, list) and len(command) > 0 else command | ||
return "'{0}' failed: {1}".format(command_name, returncode) | ||
return "'{0}' failed: {1} ({2})".format(command_name, return_code, stderr.rstrip()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debugging failures with just the error code in the exception message can be hard; added stderr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can there be a case where stderr is None? If it is stderr.rstrip()
would throw
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, it'd be an empty string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome (Y)
message = "The agent's cgroup includes unexpected processes: {0}".format(error) | ||
logger.info(message) | ||
add_event(op=WALAEventOperation.CGroupsDebug, message=message) | ||
processes_check_error = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any exception in the code to check processes should not prevent us from reporting metrics
Codecov Report
@@ Coverage Diff @@
## develop #1910 +/- ##
===========================================
- Coverage 69.49% 69.47% -0.02%
===========================================
Files 85 85
Lines 11864 11870 +6
Branches 1666 1667 +1
===========================================
+ Hits 8245 8247 +2
- Misses 3249 3252 +3
- Partials 370 371 +1
Continue to review full report at Codecov.
|
command_name = command[0] if isinstance(command, list) and len(command) > 0 else command | ||
return "'{0}' failed: {1}".format(command_name, returncode) | ||
return "'{0}' failed: {1} ({2})".format(command_name, return_code, stderr.rstrip()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can there be a case where stderr is None? If it is stderr.rstrip()
would throw
processes_check_error = ustr(e) | ||
|
||
# Report a small sample of errors | ||
if processes_check_error != self._last_error and self._error_count < 5: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed this in the previous PR, but I noticed we're not resetting the error count ever. I think we should reset it once a day or something to also get newer errors that might occur
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was intentional; there is no need for that, i just want a sample of possible errors
@@ -140,7 +139,8 @@ def test_run_command_should_raise_an_exception_when_the_command_fails(self): | |||
shellutil.run_command(command) | |||
|
|||
exception = context_manager.exception | |||
self.assertEquals(str(exception), "'ls' failed: 2") | |||
self.assertIn("'ls' failed: 2", str(exception)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw - python 2.6 doesn't have an assert to match a regex, I need to add that to the test utilities.
i'll do that on a separate PR, in the meanwhile I split the check on 2 asserts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -223,8 +225,9 @@ def get_processes_in_agent_cgroup(self): | |||
The return value can be None if cgroups are not enabled or if an error occurs during the operation. | |||
""" | |||
def __impl(): | |||
agent_unit = self._cgroups_api.get_agent_unit_name() | |||
return self._cgroups_api.get_processes_in_cgroup(agent_unit) | |||
if self._agent_cpu_cgroup_path is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be better to use the memory cgroup here since we know CPU is not mounted by default in some distros, whereas memory is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it is CPU that we are interested in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why CPU specifically? Aren't we only using the cgroup path to get the PIDs? They are also stored in the memory cgroup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to enforce CPU, so it is the CPU cgroup that we need to check.
azurelinuxagent/ga/monitor.py
Outdated
if processes_check_error != self._last_error and self._error_count < 5: | ||
self._error_count += 1 | ||
self._last_error = processes_check_error | ||
message = "The agent's cgroup includes unexpected processes: {0}".format(processes_check_error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message now doesn't match the intention when process_check_error
just contains the stack trace of an exception that occurred when we were trying to check processes in the agent cgroup. I know you are only using this event to gather diagnostics, so it's up to you if you want to make it clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks; fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
systemd-cgls doesn't support --unit on ubuntu 16; using the cgroup path instead.
also, improved error handling and reporting.