Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Populate telemetry events at creation time #1791

Merged
merged 16 commits into from
Mar 2, 2020
Merged

Populate telemetry events at creation time #1791

merged 16 commits into from
Mar 2, 2020

Conversation

narrieta
Copy link
Member

@narrieta narrieta commented Feb 26, 2020

Replaces #1757

Updates:

  • Instead of patching old agent events at startup, it patches them at collection time
  • To distinguish between events from old agents and current agent, saves telemetry files as *.waagent.tld instead of *.tld
  • Initializes common fields in the daemon (they were initialized only on the extension handler)
  • Merges SysInfo into EventLogger
  • Uses a static member (GoalState.ContainerID) instead of an environment variable to keep the latest container ID
  • Simplifies MonitorHandle.collect_and_send_events and improves error handling collecting events
  • Removes the call to update_goal_state() from the intialization of the MonitorHandle
  • Internal refactoring of the logic that adds common parameters to events; merged a few methods together
  • Many test improvements to avoid testing via private implementation details

This change is Reviewable

@codecov
Copy link

codecov bot commented Feb 26, 2020

Codecov Report

Merging #1791 into develop will decrease coverage by 0.08%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #1791      +/-   ##
===========================================
- Coverage    68.95%   68.87%   -0.09%     
===========================================
  Files           81       81              
  Lines        11322    11360      +38     
  Branches      1597     1601       +4     
===========================================
+ Hits          7807     7824      +17     
- Misses        3187     3204      +17     
- Partials       328      332       +4     
Impacted Files Coverage Δ
azurelinuxagent/common/telemetryevent.py 72.41% <0.00%> (-27.59%) ⬇️
azurelinuxagent/ga/monitor.py 83.65% <0.00%> (-5.31%) ⬇️
azurelinuxagent/ga/update.py 88.24% <0.00%> (+0.01%) ⬆️
azurelinuxagent/common/logger.py 90.90% <0.00%> (+0.11%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ce8e736...57a9735. Read the comment docs.

@@ -53,15 +57,13 @@

MAX_NUMBER_OF_EVENTS = 1000

EVENT_FILE_EXTENSION = '.waagent.tld'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Events produced by >= 2.2.47 will be saved as *.waagent.tld

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Rename this to AGENT_EVENT_FILE_EXTENSION

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, thanks!

@staticmethod
def add_default_parameters_to_event(event_parameters, set_values_for_agent=True):
def _trim_extension_event_parameters(event):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comes from #1757; we need to improve the way we handle parameters to avoid all those sequential traversals of the array

def _update_old_event_schema(self, event, event_creation_time):
# Ensure that if an agent event is missing a field from the schema defined since 2.2.47, the missing fields
# will be appended, ensuring the event schema is complete before the event is reported.
new_event = TelemetryEvent()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comes from #1757. I discussed this @pgombar - we should not just add any missing fields since we may hide bugs elsewhere or populate with invalid values. Some missing fields must be added, though. For example, OpCode (timestamp) will be in 2.2.46, but not on previous agents.


# Checking if the particular param name is in the TelemetryEvent.
def __contains__(self, param_name):
return param_name in [param.name for param in self.parameters]

def is_extension_event(self):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From #1757 - Needs improvement to avoid traversing the array

Copy link
Contributor

@pgombar pgombar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed agent code, will look at tests later.

azurelinuxagent/common/event.py Outdated Show resolved Hide resolved
azurelinuxagent/common/event.py Show resolved Hide resolved
azurelinuxagent/common/event.py Show resolved Hide resolved
azurelinuxagent/common/event.py Show resolved Hide resolved
version=version,
message=random_generator(size) if size != 0 else message)
event_file = os.path.join(self.event_dir, "{0}.tld".format(int(time.time() * 1000000)))
with open(event_file, 'wb+') as fd:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use save_event here? It handles the naming and event_dir creation if needed.

Copy link
Member Author

@narrieta narrieta Feb 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

save_event should be a private of EventLogger, we should not introduce dependencies on private members

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, see my comment below

fd.write(event_data.encode('utf-8'))

@staticmethod
def _get_event_data(duration, is_success, message, name, op, version, eventId=1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyCharm identified that this method is nearly identical to tests.protocol.test_wire.get_event. In fact, the one in test_wire is outdated since we are also adding the IsInternal parameter which is removed with this change. Can you please merge the two methods?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to deal with so many test breaks that I did not cleanup all the tests and some of them (like this one) I only modified to allow them to pass. The original code was creating a weird event that did not follow the schema for an extension or an agent event, but was saving it using the "*.tld" extension so it looked to the new code as an extension event, causing many of these tests to break. Had it been done using the public interface (add_event, for example) the internal refactoring would have not caused any issues with it.

As we do improvements on the existing code we are going to have to decide at what point to stop. I had already done a fair amount of work investigating test failures and rewriting many of them and I did not go further on this one on specific. Each one of us will have to make that call. I stopped here.

When writing new code though, we should make every effort to avoid this kind of issues, though. I believe these tests are relatively recent, but to be fair we have not been paying enough attention to design as a team. Let's try to improve from now on.

self._create_test_event_file("custom_script_utf-16.tld")
self._create_test_event_file("custom_script_invalid_json.tld")
os.chmod(self._create_test_event_file("custom_script_no_read_access.tld"), 0o200)
self._create_test_event_file("custom_script_valid_json.tld")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between custom_script_valid_json.tld and custom_script.tld?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

none, i just want 2 valid events -- i'll rename to custom-script-1 & 2

Comment on lines 374 to 375
# expected_warn_message = r'Failed to process event file.*custom_script_with_invalid_json.tld.*Error parsing event'
# self.assertIsNotNone(re.search(expected_warn_message, str(mock_warn.call_args[0])))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually it is an assert i forgot to put back. will do

vrdmr
vrdmr previously approved these changes Feb 28, 2020
azurelinuxagent/common/event.py Show resolved Hide resolved
Comment on lines 645 to 649
def test_collect_events_should_add_all_the_parameters_in_the_telemetry_schema_to_extension_events(self):
self._assert_extension_event_includes_all_parameters_in_the_telemetry_schema('custom_script.tld')

def test_collect_events_should_ignore_extra_parameters_in_extension_events(self):
self._assert_extension_event_includes_all_parameters_in_the_telemetry_schema('custom_script_extra_parameters.tld')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also test extension events that are XML, not just JSON. The original need for this PR was because we were incorrectly handling XML extension events, because they already had ContainerId present, but empty, which we honored instead of overriding the value.

There is an existing (real) VMAccess XML event in \tests\data\ext\event_from_extension.xml.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filename event_from_extension.xml should be updated too, since extension events are .tld, regardless of if they're XML or JSON.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline - this is covered in the test case using custom_script_extra_parameters.tld

Copy link
Contributor

@pgombar pgombar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added review of tests

larohra
larohra previously approved these changes Feb 29, 2020
Copy link
Contributor

@larohra larohra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 small comment else LGTM

@@ -53,15 +57,13 @@

MAX_NUMBER_OF_EVENTS = 1000

EVENT_FILE_EXTENSION = '.waagent.tld'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Rename this to AGENT_EVENT_FILE_EXTENSION

@@ -53,15 +58,13 @@

MAX_NUMBER_OF_EVENTS = 1000

EVENT_FILE_EXTENSION = '.waagent.tld'
EVENT_FILE_REGEX = re.compile(r'(?P<agent_event>\.waagent)?\.tld$')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, rename this to AGENT_EVENT_FILE_REGEX for clarity

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one is different: it matches all event files (new agent, old agent and extensions)

@narrieta narrieta dismissed stale reviews from larohra and vrdmr via 57a9735 March 2, 2020 18:25
Copy link
Contributor

@pgombar pgombar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for all these improvements, especially the tests!

@narrieta narrieta merged commit 148bf48 into develop Mar 2, 2020
@narrieta narrieta deleted the telemetry branch March 2, 2020 19:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants