Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add context timeout for checkaccess requests and fix metrics #350

Merged
merged 6 commits into from
Jan 23, 2023

Conversation

Anumita
Copy link
Contributor

@Anumita Anumita commented Jan 15, 2023

Context timeout -
Change to using https://pkg.go.dev/golang.org/x/sync/errgroup which can propagate error , it is a wrapper on waitgroups. Added a context timeout of 23 seconds.

Added contexttimeout metric as well

Discover Resources metrics:
Added metrics for -

  1. Total duration for discovering resources which includes both apiserver and get operations call
  2. Duration for apiserver call
  3. Duration for get operations call

Fix metrics:

  1. SAR status was returning 200 code regardless of whether there were any errors or not. Utilized existing withCode struct to make sure we send an appropriate errorcode. The divisions of errorcode are:
    a. if checkaccess fails , we will send back errorcode which will be the response status code.
    b. if it's any other error it will either be statusbadrequest if client related or statusInternalServerError otherwise

  2. Fixed checkaccess requests total and failed metrics to included statuscode as a dimension. This will help us get the success rate

  3. Added metrics for checkaccess latency as well

@Anumita Anumita requested review from a team as code owners January 15, 2023 11:51
@Anumita Anumita force-pushed the ansheno/fixesandtimeout branch 2 times, most recently from 84dcd0a to 2954694 Compare January 16, 2023 13:38
@@ -321,31 +364,33 @@ func (a *AccessInfo) sendCheckAccessRequest(checkAccessURL url.URL, checkAccessB
klog.V(10).Infof("binary data:%s", binaryData)
}

req, err := http.NewRequest(http.MethodPost, checkAccessURL.String(), buf)
req, err := http.NewRequestWithContext(ctx, http.MethodPost, checkAccessURL.String(), buf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add client request id in each call and log that with error code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding

resp, err := a.client.Do(req)
duration := time.Since(start).Seconds()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of emitting checkAccessDuration on multiple places, should we emit metric of duration just after call is returned?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i know we had discussed this but i realized now that we can't do that - we have code as dimension for checkaccess duration. so we can emit metric only after we know the code

return
checkAccessTotal.WithLabelValues(internalServerCode).Inc()
checkAccessDuration.WithLabelValues(internalServerCode).Observe(duration)
return errutils.WithCode(errors.Wrap(err, "error in check access request execution"), http.StatusInternalServerError)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will gate way error & client time out be part of this error? If so, should we try to retrieve error code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resp and error cannot be non-nil at the same time

@@ -359,9 +416,12 @@ func fetchDataActionsList(settings *DiscoverResourcesSettings) ([]Operation, err
}

if resp.StatusCode != http.StatusOK {
counterGetOperationsResources.WithLabelValues(ConvertIntToString(resp.StatusCode)).Inc()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be moved to line 417.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed , cannot move this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also i removed this metric as we will never receive this metric in error case. Since, the endpoints of the guard service are assigned once the readiness probe has succeeded. So if it's not ready, prom will not be scrape the metrics anyway

Copy link
Contributor

@krdhruva krdhruva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🕐

@Anumita Anumita force-pushed the ansheno/fixesandtimeout branch 2 times, most recently from c849ea6 to 68bcc63 Compare January 20, 2023 07:23
krdhruva
krdhruva previously approved these changes Jan 23, 2023
weinong
weinong previously approved these changes Jan 23, 2023
Copy link
Contributor

@weinong weinong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Anumita Anumita dismissed stale reviews from weinong and krdhruva via 0685e5d January 23, 2023 18:15
Anumita and others added 6 commits January 23, 2023 23:46
Add Checkaccess latency metrics
Fix  checkaccess counters

Signed-off-by: Anumita <ansheno@microsoft.com>
Add metrics for discover resources

Signed-off-by: Anumita <ansheno@microsoft.com>
Signed-off-by: Anumita <ansheno@microsoft.com>
Signed-off-by: Anumita <ansheno@microsoft.com>
Signed-off-by: Anumita <ansheno@microsoft.com>
Signed-off-by: Anumita <ansheno@microsoft.com>
@Anumita Anumita force-pushed the ansheno/fixesandtimeout branch from 0685e5d to 8520d3e Compare January 23, 2023 18:16
@krdhruva krdhruva added the automerge Kodiak will auto merge PRs that have this label label Jan 23, 2023
@kodiakhq kodiakhq bot merged commit 09ab161 into kubeguard:master Jan 23, 2023
@Anumita
Copy link
Contributor Author

Anumita commented Jan 23, 2023

Hey @tamalsaha , could you please release image for guard? We have merged both the PR's Thanks!

@Anumita
Copy link
Contributor Author

Anumita commented Jan 24, 2023

Hey @tamalsaha , could you please release the image today if possible? We want to get this fix out asap, hence the urgency. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automerge Kodiak will auto merge PRs that have this label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants