-
Notifications
You must be signed in to change notification settings - Fork 544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(metrics) Limit Cardinality of CSV metrics #1099
feat(metrics) Limit Cardinality of CSV metrics #1099
Conversation
fba91bb
to
f8c01b9
Compare
Example Cases:When a CSV is updated but not in the succeeded phase
When a CSV reaches the succeeded phase
|
@ecordell it should be noted that this data will get messy if multiple instances of the same operator are installed. |
37278d8
to
c2adb1c
Compare
/test e2e-gcp-upgrade |
c2adb1c
to
571ec70
Compare
571ec70
to
288425b
Compare
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/hold while I spin up a cluster to look at this :) |
Sure - let me know if you find anything interesting! |
288425b
to
2a93602
Compare
This commit introduces a change that limits the number of metrics that an OLM cluster reports at any given time for a CSV. The first metric introduced is called csv_up, which tracks CSVs that have reached the succeeded phase. The following information is provided about the CSV via labels: namespace, name, version. The value of this metric will always be 0 or 1. The second metric introduced is called csv_abnormal, which is reported whenever the CSV is updated and has not reached the succeeded phase. The following information is provided about the CSV via labels: namespace, name, version, phase, reason. Whenever a CSV is updated, the existing timeseries is deleted and replaced by an updated version.
2a93602
to
5884308
Compare
@ecordell Maybe this is a silly question, but what value do we get from seeing every step here rather than just seeing "it's in a good state" or "it's not in a good state" ? Is there some unique knowledge we would gain from seeing all of those steps? I was thinking that we care more about "what step are we stuck in" in the same way that I can learn that by My concern is that being aware of the state of these things and deleting them later seems like it adds a medium amount of complexity to managing the metric endpoint's state for the value that it adds. |
That's a good point, and I do think it would be reasonable to merge this as-is and accept that not all states will necessarily get picked up by prometheus. It would be nice to see the full lifecycle of operators reported if possible though. Perhaps we do that as a follow up. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: awgreene, ecordell The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
2 similar comments
/retest |
/retest |
This commit introduces a change that limits the number of metrics that
an OLM cluster reports at any given time for a CSV.
The first metric introduced is called csv_succeeded, which tracks CSVs that
have reached the succeeded phase. The following information is
provided about the CSV via labels: name, version. The value of this
metric will always be 0 or 1.
The second metric introduced is called csv_abnormal, which is reported
whenever the CSV is updated and has not reached the succeeded phase. The
following information is provided about the CSV via labels: name,
version, phase, reason. Whenever a CSV is updated, the existing
timeseries is deleted and replaced by an updated version.
Reviewer Checklist
/docs