-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add context timeout for checkaccess requests and fix metrics #350
Conversation
84dcd0a
to
2954694
Compare
@@ -321,31 +364,33 @@ func (a *AccessInfo) sendCheckAccessRequest(checkAccessURL url.URL, checkAccessB | |||
klog.V(10).Infof("binary data:%s", binaryData) | |||
} | |||
|
|||
req, err := http.NewRequest(http.MethodPost, checkAccessURL.String(), buf) | |||
req, err := http.NewRequestWithContext(ctx, http.MethodPost, checkAccessURL.String(), buf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add client request id
in each call and log that with error code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding
resp, err := a.client.Do(req) | ||
duration := time.Since(start).Seconds() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of emitting checkAccessDuration
on multiple places, should we emit metric of duration just after call is returned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i know we had discussed this but i realized now that we can't do that - we have code as dimension for checkaccess duration. so we can emit metric only after we know the code
authz/providers/azure/rbac/rbac.go
Outdated
return | ||
checkAccessTotal.WithLabelValues(internalServerCode).Inc() | ||
checkAccessDuration.WithLabelValues(internalServerCode).Observe(duration) | ||
return errutils.WithCode(errors.Wrap(err, "error in check access request execution"), http.StatusInternalServerError) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will gate way error & client time out be part of this error? If so, should we try to retrieve error code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resp and error cannot be non-nil at the same time
util/azure/utils.go
Outdated
@@ -359,9 +416,12 @@ func fetchDataActionsList(settings *DiscoverResourcesSettings) ([]Operation, err | |||
} | |||
|
|||
if resp.StatusCode != http.StatusOK { | |||
counterGetOperationsResources.WithLabelValues(ConvertIntToString(resp.StatusCode)).Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be moved to line 417.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as discussed , cannot move this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also i removed this metric as we will never receive this metric in error case. Since, the endpoints of the guard service are assigned once the readiness probe has succeeded. So if it's not ready, prom will not be scrape the metrics anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🕐
c849ea6
to
68bcc63
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Add Checkaccess latency metrics Fix checkaccess counters Signed-off-by: Anumita <ansheno@microsoft.com>
Add metrics for discover resources Signed-off-by: Anumita <ansheno@microsoft.com>
Signed-off-by: Anumita <ansheno@microsoft.com>
Signed-off-by: Anumita <ansheno@microsoft.com>
Signed-off-by: Anumita <ansheno@microsoft.com>
Signed-off-by: Anumita <ansheno@microsoft.com>
0685e5d
to
8520d3e
Compare
Hey @tamalsaha , could you please release image for guard? We have merged both the PR's Thanks! |
Hey @tamalsaha , could you please release the image today if possible? We want to get this fix out asap, hence the urgency. Thanks! |
Context timeout -
Change to using https://pkg.go.dev/golang.org/x/sync/errgroup which can propagate error , it is a wrapper on waitgroups. Added a context timeout of 23 seconds.
Added contexttimeout metric as well
Discover Resources metrics:
Added metrics for -
Fix metrics:
SAR status was returning 200 code regardless of whether there were any errors or not. Utilized existing withCode struct to make sure we send an appropriate errorcode. The divisions of errorcode are:
a. if checkaccess fails , we will send back errorcode which will be the response status code.
b. if it's any other error it will either be statusbadrequest if client related or statusInternalServerError otherwise
Fixed checkaccess requests total and failed metrics to included statuscode as a dimension. This will help us get the success rate
Added metrics for checkaccess latency as well