-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add prom metrics for connector->server requests #3200
Conversation
17bbd0e
to
d51b3b2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change seems reasonable to me. I guess someone can use this to solve #2664 by building an alert that checks for the time size the last 200 response code from this metric?
I think this metric is valuable for other purposes, but generally how I would expect to monitor something like #2664 would be with a gauge metric that emits the last time the reconciler ran successfully. If the reconcile loop is failing for any reason (not just calls to the infra API), that's a signal that the role bindings are out of date. I think monitoring the success of the reconcile operation is the most accurate signal. I'd use the gauge metric in a monitor that checks when the value is older than some threshold, and send an alert. Not a blocker, just something to consider.
@dnephin you're right that this does not capture the sync status of the cluster's roles but that was never the intent. My understanding of #2664 is that it's specifically concerned with the connectivity between the server and the connector, which tracking requests will address. The sync status of cluster roles is out of scope. |
d51b3b2
to
65cd5c0
Compare
api/client.go
Outdated
if client.ObserveFunc != nil { | ||
client.ObserveFunc(start, req, resp) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I really like this approach for being able to instrument the api client.
e000c63
to
e914921
Compare
internal/connector/connector.go
Outdated
ObserveFunc: func(start time.Time, request *http.Request, response *http.Response, err error) { | ||
errorLabel := "" | ||
if err != nil { | ||
errorLabel = err.Error() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this be too high cardinality, since these are effectively infinite error strings?
I was thinking of just a true
or false
value for the error label.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically yes. My concern with setting just true
or false
is it provides no information as to why it failed. With the status code, there's at least some context. Setting a true
or false
error is almost like setting status to -1
on error instead of a meaningful HTTP code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true. We could attempt to detect the error and reduce it to a few well known constant values, but I'm not sure it's worth the effort right now. We can always add that later as a separate errorClass
or some other label.
Generally I would not expect metrics to tell me about the error. If I see the number of errors increasing I would go to logs to figure out more details about the errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true. It's nice when metrics tell you what the problem is but not necessary. In this case, I'd prefer setting status
to -1
since status
and error=true
are (probably) mutually exclusive; adding a separate labels seems unnecessary
e914921
to
34c248c
Compare
34c248c
to
e7b8344
Compare
Summary
Observe the requests the connector makes to the server and report them as Prometheus metrics.
Checklist
Related Issues
Resolves #2664