Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add telemetry for vmSettings #2454

Merged
merged 7 commits into from
Jan 4, 2022
Merged

Conversation

narrieta
Copy link
Member

@narrieta narrieta commented Dec 31, 2021

Adding telemetry for interesting errors in vmSettings (see _VmSettingsErrorReporter in wire.py for a description of the counters we keep track of)

Also, if the vmSettings API is not supported stop using it for 6 hours.

@codecov
Copy link

codecov bot commented Dec 31, 2021

Codecov Report

Merging #2454 (6443708) into develop (a576a50) will increase coverage by 0.04%.
The diff coverage is 93.40%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2454      +/-   ##
===========================================
+ Coverage    71.89%   71.93%   +0.04%     
===========================================
  Files          101      101              
  Lines        14970    15020      +50     
  Branches      2370     2384      +14     
===========================================
+ Hits         10762    10805      +43     
- Misses        3728     3731       +3     
- Partials       480      484       +4     
Impacted Files Coverage Δ
azurelinuxagent/common/utils/restutil.py 84.03% <33.33%> (-1.35%) ⬇️
azurelinuxagent/common/protocol/wire.py 81.56% <97.61%> (+0.68%) ⬆️
azurelinuxagent/common/event.py 85.36% <100.00%> (+0.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a576a50...6443708. Read the comment docs.

@@ -354,7 +354,14 @@ def http_request(method,
max_retry=None,
retry_codes=None,
retry_delay=DELAY_IN_SECONDS,
redact_data=False):
redact_data=False,
return_raw_response=False):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error handling done by http_request mixes logic for several APIs, and makes hard (or impossible) to handle specific errors for new APIs (in this case vmSettings). Our mock infrastructure is based on this method, though, so I decided to add a new parameter to bypass all the error handling here.

Eventually this method needs to be rewritten/replaced to do better error handling.


with patch("azurelinuxagent.common.utils.restutil._http_request", side_effect=http_get_vm_settings):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was mocking the internal method to bypass the error handling done by the public method, but that is no longer needed.

# log the HostGAPlugin version when we fetch the vmSettings for the first time
is_first_vm_settings = self._vm_settings_goal_state is None
if is_first_vm_settings:
# log the HostGAPlugin version
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@larohra this fixes the issue you pointed out in my previous PR (always logging this when vmSettings is not supported)

except Exception as exception:
error = textutil.format_exception(exception) # since this is a generic error, we format the exception to include the traceback
raise Exception("Error fetching vmSettings [correlation ID: {0} eTag: {1}]: {2}".format(correlation_id, etag, error))
if isinstance(exception, IOError) and "timed out" in ustr(exception):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this string comparision but I couldn't find a version agnostic way of testing for this. Looks like the classing heirarchy is different for 2 and 3. I tried checking errno, but that was None on my python 3.8 install...

Any other possibilities? Have you looked deeply into this yet?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't like it either, but i found no way other than this check

the existing code in restapi.py handles IOError. telemetry shows that the exception is indeed an IOError and that the message is "timed out"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from what I could find, the exception is a socket.timeout error, but from what I could tell, there's differences between the definition of that class in 2 vs. 3.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to telemetry it's an IOError (derived from OSError after 3.3, I think). These are the errors we have got for the last week

[HttpError] [HTTP Retry] GET http://168.63.129.16:32526/vmSettings -- Status Code 502
[HttpError] [HTTP Retry] GET http://168.63.129.16:32526/vmSettings -- Status Code 500
[HttpError] [HTTP Retry] GET http://168.63.129.16:32526/vmSettings -- Status Code 403
[HttpError] [HTTP Failed] GET http://168.63.129.16:32526/vmSettings -- IOError timed out
[HttpError] [HTTP Failed] GET http://168.63.129.16:32526/vmSettings -- IOError [Errno 113] No route to host
[HttpError] [HTTP Failed] GET http://168.63.129.16:32526/vmSettings -- IOError [Errno 111] Connection refused
[HttpError] [HTTP Failed] GET http://168.63.129.16:32526/vmSettings -- IOError [Errno 110] Connection timed out
[HttpError] [HTTP Failed] GET http://168.63.129.16:32526/vmSettings -- IOError [Errno 104] Connection reset by peer
[HttpError] [HTTP Failed] GET http://168.63.129.16:32526/vmSettings -- IOError [Errno 101] Network is unreachable

with patch("azurelinuxagent.common.protocol.wire.add_event") as add_event:
mock_response = None # stop producing errors
protocol.client._vm_settings_error_reporter._next_period = datetime.now()
protocol.client.update_goal_state()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can't be replaced by calling the private _reset function on the error reporter, can it? That would have the benefit of avoiding any potential side effects of the update_goal_state call and removing the need for the + 2 down on the definition for expected['requests'].

Copy link
Member Author

@narrieta narrieta Jan 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, _reset sets the counters to 0; this actually outputs the summary

kevinclark19a
kevinclark19a previously approved these changes Jan 3, 2022
@@ -1271,7 +1271,7 @@ def http_get_handler(url, *_, **__):
test_error_in_http_request("Internal error in the HostGAPlugin", MockHttpResponse(httpclient.BAD_GATEWAY), "[Internal error in HostGAPlugin] [HTTP Failed] [502: None]")
test_error_in_http_request("Arbitrary error in the request (BAD_REQUEST)", MockHttpResponse(httpclient.BAD_REQUEST), "[HTTP Failed] [400: None]")
test_error_in_http_request("Generic error in the request", Exception("GENERIC REQUEST ERROR"), "GENERIC REQUEST ERROR")
test_error_in_http_request("Response headers with no Etag", MockHttpResponse(200, b""), "The vmSettings do no include an Etag")
test_error_in_http_request("Response headers with no Etag", MockHttpResponse(200, b""), "The vmSettings response does no include an Etag header")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: is The vmSettings response does no include supposed to be ...does not include?

kevinclark19a
kevinclark19a previously approved these changes Jan 4, 2022
@narrieta narrieta merged commit 677cb30 into Azure:develop Jan 4, 2022
@narrieta narrieta deleted the fast-track-telemetry branch January 4, 2022 01:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants