Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extension telemetry pipeline #1918

Merged
merged 35 commits into from
Aug 17, 2020

Conversation

larohra
Copy link
Contributor

@larohra larohra commented Jun 24, 2020

Description

This PR contains the major changes for enabling Extension Telemetry pipeline for Extensions.
It adds a new thread to the Agent which comes alive every 5 mins and reads the events directory per extension and sends those extensions to Wireserver.

Issue #


PR information

  • The title of the PR is clear and informative.
  • There are a small number of commits, each of which has an informative message. This means that previously merged commits do not appear in the history of the PR. For information on cleaning up the commits in your pull request, see this page.
  • Except for special cases involving multiple contributors, the PR is started from a fork of the main repository, not a branch.
  • If applicable, the PR references the bug/issue that it fixes in the description.
  • New Unit tests were added for the changes made and Travis.CI is passing.

Quality of Code and Contribution Guidelines

@codecov
Copy link

codecov bot commented Jun 24, 2020

Codecov Report

Merging #1918 into develop will increase coverage by 0.45%.
The diff coverage is 92.88%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #1918      +/-   ##
===========================================
+ Coverage    70.69%   71.14%   +0.45%     
===========================================
  Files           85       86       +1     
  Lines        12055    12285     +230     
  Branches      1685     1728      +43     
===========================================
+ Hits          8522     8740     +218     
- Misses        3152     3159       +7     
- Partials       381      386       +5     
Impacted Files Coverage Δ
azurelinuxagent/ga/extension_telemetry.py 91.42% <91.42%> (ø)
azurelinuxagent/common/event.py 86.61% <100.00%> (+0.08%) ⬆️
azurelinuxagent/common/exception.py 98.91% <100.00%> (+0.07%) ⬆️
azurelinuxagent/ga/env.py 65.00% <100.00%> (+1.20%) ⬆️
azurelinuxagent/ga/exthandlers.py 87.72% <100.00%> (+0.22%) ⬆️
azurelinuxagent/ga/monitor.py 78.45% <100.00%> (+0.48%) ⬆️
azurelinuxagent/ga/update.py 88.43% <100.00%> (+0.30%) ⬆️
azurelinuxagent/pa/deprovision/default.py 67.85% <100.00%> (ø)
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a4d6404...79a47d3. Read the comment docs.

Copy link
Member

@narrieta narrieta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few initial comments. I still need to look at extension_telemetry.py and tests.

@@ -60,7 +61,8 @@

_VALID_HANDLER_STATUS = ['Ready', 'NotReady', "Installing", "Unresponsive"]

HANDLER_NAME_PATTERN = re.compile(_HANDLER_PATTERN + r'$', re.IGNORECASE)
HANDLER_NAME_PATTERN = re.compile(_HANDLER_NAME_PATTERN, re.IGNORECASE)
HANDLER_COMPLETE_NAME_PATTERN = re.compile(_HANDLER_PATTERN + r'$', re.IGNORECASE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this in the previous PR, but is_extension_telemetry_pipeline_enabled should be defined elsewhere (extension_telemetry.py?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does make more sense for this flag to be in extension_telemetry.py but I didnt put it there because that creates a circular dependency (because I need to read the HANDLER_NAME from exthandlers.py) and moving these globals (HANDLER_NAME, etc) outside doesn't make logical sense either since they're defined for the Extension Handlers.

To avoid all this confusion I just left the is_extension_telemetry_pipeline_enabled in exthandlers because that's where we are reading the flag. If you feel strongly about it I can move it to a common place though (maybe AgentGlobals.py or version.py) to avoid the circular dependency.

azurelinuxagent/ga/update.py Show resolved Hide resolved
azurelinuxagent/common/event.py Outdated Show resolved Hide resolved
azurelinuxagent/common/event.py Outdated Show resolved Hide resolved
azurelinuxagent/ga/update.py Show resolved Hide resolved
Copy link
Member

@narrieta narrieta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few more comments, still need to look at the tests

azurelinuxagent/ga/extension_telemetry.py Outdated Show resolved Hide resolved

# Limits
MAX_NUMBER_OF_EVENTS_PER_EXTENSION_PER_PERIOD = 300
EXTENSION_EVENT_FILE_MAX_SIZE = 4 * 1024 * 1024 # 4 MB = 4 * 1,048,576 Bytes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have similar limits for the agent? should they be the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have some limiters for the Agent events but they're more relaxed as compared to these. These are mainly needed to ensure that the extensions dont abuse the pipeline. I dont think these limiters are needed for the agent events.

azurelinuxagent/ga/extension_telemetry.py Outdated Show resolved Hide resolved
azurelinuxagent/ga/extension_telemetry.py Outdated Show resolved Hide resolved
event_file_path, convert_to_mb(event_file_size),
convert_to_mb(self.EXTENSION_EVENT_FILE_MAX_SIZE))
logger.warn(msg)
add_log_event(level=logger.LogLevel.WARNING, message=msg, forced=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why add_log_event? (instead of add_event)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add_event goes to the ExtensionEvents table and add_log_event goes to the GenericLogs table. I figured for the extension related stuff, it would make sense if all the data (including errors) are in 1 table so that its easier for debugging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But above you are using add_event(op=WALAEventOperation.ExtensionTelemetryEventProcessing, message=msg, is_success=False)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is basically that any error thrown by the agent while processing Extension Telemetry will be logged in ExtensionEvents table and any errors thrown due to the telemetry events (malformed, limit exceeds, etc) will be logged in the GenericLogs table for easier lookup

@@ -210,6 +210,28 @@ def __init__(self, msg=None, inner=None):
super(ResourceGoneError, self).__init__(msg, inner)


class RemoteAccessError(AgentError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like you need to update your branch

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still something funny with your branch. This exception was removed by #1935

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I dont think the merge happened properly/correctly. Even Travis is failing with some weird error. I'll try fixing it today properly if I get time. Will let you know once its ready for review!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what went wrong here, I tried the merge again but somehow this is still there. I'll just cherry-pick that commit into my branch to get this change in. Apart from this I dont see anything out of ordinary

narrieta
narrieta previously approved these changes Aug 3, 2020
Copy link
Contributor

@pgombar pgombar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, still haven't looked at the test code.

azurelinuxagent/common/exception.py Show resolved Hide resolved
azurelinuxagent/ga/extension_telemetry.py Show resolved Hide resolved
event_file_path, convert_to_mb(event_file_size),
convert_to_mb(self.EXTENSION_EVENT_FILE_MAX_SIZE))
logger.warn(msg)
add_log_event(level=logger.LogLevel.WARNING, message=msg, forced=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But above you are using add_event(op=WALAEventOperation.ExtensionTelemetryEventProcessing, message=msg, is_success=False)

azurelinuxagent/ga/extension_telemetry.py Show resolved Hide resolved
azurelinuxagent/ga/extension_telemetry.py Outdated Show resolved Hide resolved

# EventName maps to HandlerName + '-' + Version from event file
expected_mapping = {
GuestAgentGenericLogsSchema.EventName: ExtensionEventSchema.Version,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confused between this line and the comment above, the definitions don't seem to match?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh yeah sorry for the confusion, in the code below I'm actually mapping it to "Name-Version". This dict is supposed to just be a placeholder

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its happening in Line 433 of this file

pgombar
pgombar previously approved these changes Aug 14, 2020
Copy link
Contributor

@pgombar pgombar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments on the tests, but otherwise LGTM. I think it's important to re-enable the special characters test since that scenario has been problematic so far.

…try-pipeline

# Conflicts:
#	azurelinuxagent/ga/exthandlers.py
#	azurelinuxagent/ga/update.py
#	tests/ga/test_extension.py
pgombar
pgombar previously approved these changes Aug 14, 2020
@larohra larohra merged commit 5d2ac9f into Azure:develop Aug 17, 2020
@larohra larohra deleted the extension-telemetry-pipeline branch August 17, 2020 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants