-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add telemetry for vmSettings #2454
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #2454 +/- ##
===========================================
+ Coverage 71.89% 71.93% +0.04%
===========================================
Files 101 101
Lines 14970 15020 +50
Branches 2370 2384 +14
===========================================
+ Hits 10762 10805 +43
- Misses 3728 3731 +3
- Partials 480 484 +4
Continue to review full report at Codecov.
|
@@ -354,7 +354,14 @@ def http_request(method, | |||
max_retry=None, | |||
retry_codes=None, | |||
retry_delay=DELAY_IN_SECONDS, | |||
redact_data=False): | |||
redact_data=False, | |||
return_raw_response=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error handling done by http_request mixes logic for several APIs, and makes hard (or impossible) to handle specific errors for new APIs (in this case vmSettings). Our mock infrastructure is based on this method, though, so I decided to add a new parameter to bypass all the error handling here.
Eventually this method needs to be rewritten/replaced to do better error handling.
|
||
with patch("azurelinuxagent.common.utils.restutil._http_request", side_effect=http_get_vm_settings): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was mocking the internal method to bypass the error handling done by the public method, but that is no longer needed.
# log the HostGAPlugin version when we fetch the vmSettings for the first time | ||
is_first_vm_settings = self._vm_settings_goal_state is None | ||
if is_first_vm_settings: | ||
# log the HostGAPlugin version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@larohra this fixes the issue you pointed out in my previous PR (always logging this when vmSettings is not supported)
except Exception as exception: | ||
error = textutil.format_exception(exception) # since this is a generic error, we format the exception to include the traceback | ||
raise Exception("Error fetching vmSettings [correlation ID: {0} eTag: {1}]: {2}".format(correlation_id, etag, error)) | ||
if isinstance(exception, IOError) and "timed out" in ustr(exception): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this string comparision but I couldn't find a version agnostic way of testing for this. Looks like the classing heirarchy is different for 2 and 3. I tried checking errno
, but that was None
on my python 3.8 install...
Any other possibilities? Have you looked deeply into this yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't like it either, but i found no way other than this check
the existing code in restapi.py handles IOError. telemetry shows that the exception is indeed an IOError and that the message is "timed out"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from what I could find, the exception is a socket.timeout
error, but from what I could tell, there's differences between the definition of that class in 2 vs. 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to telemetry it's an IOError (derived from OSError after 3.3, I think). These are the errors we have got for the last week
[HttpError] [HTTP Retry] GET http://168.63.129.16:32526/vmSettings -- Status Code 502
[HttpError] [HTTP Retry] GET http://168.63.129.16:32526/vmSettings -- Status Code 500
[HttpError] [HTTP Retry] GET http://168.63.129.16:32526/vmSettings -- Status Code 403
[HttpError] [HTTP Failed] GET http://168.63.129.16:32526/vmSettings -- IOError timed out
[HttpError] [HTTP Failed] GET http://168.63.129.16:32526/vmSettings -- IOError [Errno 113] No route to host
[HttpError] [HTTP Failed] GET http://168.63.129.16:32526/vmSettings -- IOError [Errno 111] Connection refused
[HttpError] [HTTP Failed] GET http://168.63.129.16:32526/vmSettings -- IOError [Errno 110] Connection timed out
[HttpError] [HTTP Failed] GET http://168.63.129.16:32526/vmSettings -- IOError [Errno 104] Connection reset by peer
[HttpError] [HTTP Failed] GET http://168.63.129.16:32526/vmSettings -- IOError [Errno 101] Network is unreachable
with patch("azurelinuxagent.common.protocol.wire.add_event") as add_event: | ||
mock_response = None # stop producing errors | ||
protocol.client._vm_settings_error_reporter._next_period = datetime.now() | ||
protocol.client.update_goal_state() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can't be replaced by calling the private _reset
function on the error reporter, can it? That would have the benefit of avoiding any potential side effects of the update_goal_state
call and removing the need for the + 2
down on the definition for expected['requests']
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, _reset sets the counters to 0; this actually outputs the summary
tests/protocol/test_wire.py
Outdated
@@ -1271,7 +1271,7 @@ def http_get_handler(url, *_, **__): | |||
test_error_in_http_request("Internal error in the HostGAPlugin", MockHttpResponse(httpclient.BAD_GATEWAY), "[Internal error in HostGAPlugin] [HTTP Failed] [502: None]") | |||
test_error_in_http_request("Arbitrary error in the request (BAD_REQUEST)", MockHttpResponse(httpclient.BAD_REQUEST), "[HTTP Failed] [400: None]") | |||
test_error_in_http_request("Generic error in the request", Exception("GENERIC REQUEST ERROR"), "GENERIC REQUEST ERROR") | |||
test_error_in_http_request("Response headers with no Etag", MockHttpResponse(200, b""), "The vmSettings do no include an Etag") | |||
test_error_in_http_request("Response headers with no Etag", MockHttpResponse(200, b""), "The vmSettings response does no include an Etag header") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: is The vmSettings response does no include
supposed to be ...does not include
?
Adding telemetry for interesting errors in vmSettings (see _VmSettingsErrorReporter in wire.py for a description of the counters we keep track of)
Also, if the vmSettings API is not supported stop using it for 6 hours.