-
Notifications
You must be signed in to change notification settings - Fork 546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[metrics] Dependency resolution metrics #1657
[metrics] Dependency resolution metrics #1657
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, a few nits.
83de9b0
to
f075a5d
Compare
This PR failed tests for 1 times with 0 individual failed tests and 119 skipped tests. A test is considered flaky if failed on multiple commits. totaltestcount: 1
|
f075a5d
to
d87bd37
Compare
This PR failed tests for 1 times with 0 individual failed tests and 119 skipped tests. A test is considered flaky if failed on multiple commits. totaltestcount: 1
|
d87bd37
to
8952043
Compare
This PR failed tests for 1 times with 1 individual failed tests and 4 skipped tests. A test is considered flaky if failed on multiple commits. totaltestcount: 1
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work @anik120 - a few nits and design suggestions.
res := resolver.NewOperatorsV1alpha1Resolver(lister, crClient, opClient.KubernetesInterface()) | ||
success := metrics.RegisterDependencyResolutionSuccess | ||
failure := metrics.RegisterDependencyResolutionFailure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preference would be to inline these on L144 so that readers don't need to jump around to find out what these parameters are, but that's a nit and I can agree to disagree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Ben's suggestion would make this clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benluddy @awgreene the reason I chose not to go inline was coz then that line reads as
resolver: resolver.NewInstrumentedResolver(resolver: resolver.NewInstrumentedResolver(operatorV1alpha1Resolver, successMetricsEmitter, failureMetricsEmitter),, metrics.RegisterDependencyResolutionSuccess, metrics.RegisterDependencyResolutionFailure),
which goes way beyond the 120 character vertical ruler I keep for myself.
I am changing the variable names to operatorsV1alpha1Resolver
successMetricsEmitter
and failureMetricsEmitter
to have the reader hopefully not jump around too much. Let me know if that looks good once I've pushed those changes.
pkg/metrics/metrics.go
Outdated
@@ -269,3 +286,11 @@ func UpdateSubsSyncCounterStorage(sub *olmv1alpha1.Subscription) { | |||
counterValues.channel = sub.Spec.Channel | |||
} | |||
} | |||
|
|||
func RegisterDependencyResolutionSuccess(time *float64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the argument a pointer? Can it be a time.Duration
value instead? Converting to float64 and throwing away type information at the measurement site seems premature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benluddy I had this argument as just float64
to begin with, but switched over to a pointer when I realized implementing a fake metricEmitter
func will be easier if I can modify a value in place.
Converting to float64 and throwing away type information at the measurement site seems premature.
My argument against that is that for this particular function, RegisterDependencyResolutionSuccess
that registers a value for a summary, we will only ever need a float64
coz the function Observe only takes a float64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The contract of the function is to record a metric corresponding to the duration of a dependency resolution success. Why does the caller of this function need to adapt the input to the Prometheus client? If the underlying metrics reporter were to be swapped out with a client that expects a different value type, this function's signature should not need to change.
pkg/metrics/metrics.go
Outdated
@@ -170,6 +170,21 @@ var ( | |||
[]string{NAMESPACE_LABEL, NAME_LABEL, VERSION_LABEL, PHASE_LABEL, REASON_LABEL}, | |||
) | |||
|
|||
dependencyResolutionSucceeded = prometheus.NewSummary( | |||
prometheus.SummaryOpts{ | |||
Name: "dependency_resolution_succeeded", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to at least include the unit suffix in the name (https://prometheus.io/docs/practices/naming/#metric-names).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may also want to consider that these metrics will be available on telemeter - it's somewhat difficult to know if this metric is an "OLM Metric" based on this name. We had questions on slack regarding if OLM has any operator metrics because the existing metrics didn't make it clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we were to incorporate both points, olm_dependency_metrics_succeeded_summary
is the name coming to my mind. Isn't that a tad bit long?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would pick something like olm_resolution_duration_seconds
.
pkg/metrics/metrics.go
Outdated
@@ -194,6 +209,8 @@ func RegisterCatalog() { | |||
prometheus.MustRegister(subscriptionCount) | |||
prometheus.MustRegister(catalogSourceCount) | |||
prometheus.MustRegister(SubscriptionSyncCount) | |||
prometheus.MustRegister(dependencyResolutionSucceeded) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can success/failure be two values for a single label instead of having two completely separate metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benluddy if we want two separate percentile brackets for the two scenarios, I think have two separate metrics is cleaner.
At the server side we can register values into individual metrics cleanly, and at the client side (whatever visualization tool we use) we can separately query for the metrics without jumping too many hoops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each combination of label values gets its own time series. They're designed to be queried this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benluddy changed it to SummaryVec that allows labels.
This PR failed tests for 1 times with 1 individual failed tests and 4 skipped tests. A test is considered flaky if failed on multiple commits. totaltestcount: 1
|
8952043
to
cc35135
Compare
This PR failed tests for 1 times with 2 individual failed tests and 4 skipped tests. A test is considered flaky if failed on multiple commits. totaltestcount: 1
|
/test e2e-aws-console-olm |
cc35135
to
f8fda06
Compare
This PR failed tests for 1 times with 2 individual failed tests and 4 skipped tests. A test is considered flaky if failed on multiple commits. totaltestcount: 1
|
f8fda06
to
47e52d4
Compare
47e52d4
to
9b025fa
Compare
@benluddy updated the PR, PTAL. Thanks! |
/test e2e-gcp-upgrade |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only had super minor nits, nice work.
} | ||
|
||
func TestInstrumentedResolver(t *testing.T) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Please remove the newline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. I only had a few minor comments.
This PR introduces two histogram metrics to help in analysing the time taken by dependency resolution requests.
9b025fa
to
1db2017
Compare
/approve 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: anik120, awgreene, benluddy The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
7 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
Tagging @openshift/openshift-team-monitoring to review the new metric we plan to expose. If you have any complains, please let us know and we will update the metric. |
@smarterclayton OLM would like to add this metric to Telemeter, can you please review? |
Description of the change:
This PR introduces two histrogram metrics to help in analysing the
time taken by dependency resolution requests.
Motivation for the change:
Reviewer Checklist
/docs