add metrics for TemplateInstance controller #16455

jim-minter · 2017-09-20T17:34:06Z

https://trello.com/c/wDxqVOqy/1181-5-prometheus-metrics-for-template-broker-techdebt

jim-minter · 2017-09-20T17:47:49Z

@smarterclayton @bparees there are 2 commits here: the 1st does things approximately like the existing build metrics; however I think the way in the 2nd commit is a better approach for both sets of metrics. Example output for discussion:

TemplateInstanceController_TemplateInstances_active_waiting_time_seconds_bucket{le="1200"} 1
TemplateInstanceController_TemplateInstances_active_waiting_time_seconds_bucket{le="300"} 1
TemplateInstanceController_TemplateInstances_active_waiting_time_seconds_bucket{le="3600"} 1
TemplateInstanceController_TemplateInstances_active_waiting_time_seconds_bucket{le="60"} 0
TemplateInstanceController_TemplateInstances_active_waiting_time_seconds_bucket{le="600"} 1
TemplateInstanceController_TemplateInstances_active_waiting_time_seconds_bucket{le="+Inf"} 1
TemplateInstanceController_TemplateInstances_active_waiting_time_seconds_count 1
TemplateInstanceController_TemplateInstances_active_waiting_time_seconds_sum 74
TemplateInstanceController_TemplateInstances_total{status="False",type="Ready"} 1
TemplateInstanceController_TemplateInstances_total{status="",type=""} 1

Notes:

providing a histogram of durations for objects that are currently waiting allows an end user to rationalise about count, mean and general distribution as a point in time snapshot and over time ranges. I think they are clearer to work with than unix timestamps. Histograms are explicitly recommended in the prometheus docs for handling duration information.
I think using namespace and name as labels is a bad idea:
a) because of cardinality (see https://prometheus.io/docs/practices/naming/ : "Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.")
b) because of information leakage
I suggest reporting on total numbers categorised by condition rather than "phase" here to stick closer to kubernetes best practice. I'm additionally suggesting {status="",type=""} to report on the total number of objects independent of conditions.

jim-minter · 2017-09-20T18:45:42Z

flake #16414
/retest

gabemontero · 2017-09-20T19:32:07Z

Quick note: on the use of namespace/name/ etc. labels with the active build metric, those stemmed from @smarterclayton 's desire to have a "constant metric" where the value was the actual start time in unix domain time. That approach was derived from a similar metric that cadvisor had.

My original approach to those active builds was a histogram similar to what I see in this PR @jim-minter for active/waiting template instances. But through iterating with @smarterclayton this was removed, at least for now. We did talk about revisiting/returning some of the histogram based metrics later on.

Although not an exact apples to apples comparison, it is conceivable that your active/waiting templates instances metric could follow a similar "evolutionary path".

bparees · 2017-09-21T03:20:46Z

1+2) what @gabemontero said. Essentially there was a desire to have a data point representation of each active build. I agree from a cardinality perspective that seems problematic, but that's the direction we got and should probably stick with for templateinstances

well templateinstances don't even have phases, so no argument there. Similarly builds don't have conditions, which is why build metrics are reported by phase, not condition. For objects that have conditions, reporting how many objects are in each terminal condition, as well as how many objects exist in terminal conditions in total, seems reasonable.

i'm also a little confused by the histogram behavior in your example... you've got one data point with a duration of 74s and it seems to be represented in all the buckets except "60s". The only conclusion I can reach is that prometheus counts it in the bucket as long as the value is "less than" the bucket value? Seems strange. ( would have expected it to put it in exactly one bucket, the one with the smallest value that's larger than the datapoint's value, ie 300 in this case)

smarterclayton · 2017-09-21T03:27:33Z

Please follow prometheus conventions regarding labels (i.e. no upper case, match with other resources people have created in terms of general name ordering).

…

On Wed, Sep 20, 2017 at 11:20 PM, Ben Parees ***@***.***> wrote: 1+2) what @gabemontero <https://github.com/gabemontero> said. Essentially there was a desire to have a data point representation of each active build. I agree from a cardinality perspective that seems problematic, but that's the direction we got and should probably stick with for templateinstances 1. well templateinstances don't even have phases, so no argument there. Similarly builds don't have conditions, which is why build metrics are reported by phase, not condition. For objects that have conditions, reporting how many objects are in each terminal condition, as well as how many objects exist in terminal conditions in total, seems reasonable. i'm also a little confused by the histogram behavior in your example... you've got one data point with a duration of 74s and it seems to be represented in all the buckets except "60s". The only conclusion I can reach is that prometheus counts it in the bucket as long as the value is "less than" the bucket value? Seems strange. ( would have expected it to put it in exactly one bucket, the one with the smallest value that's larger than the datapoint's value, ie 300 in this case) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#16455 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p0JKL8nPOJqOVk31G90zpjpWXzSRks5skdYYgaJpZM4PeOAe> .

gabemontero · 2017-09-21T13:27:03Z

pkg/template/controller/metrics.go

+func newTemplateInstancesTotal() *prometheus.GaugeVec {
+	return prometheus.NewGaugeVec(
+		prometheus.GaugeOpts{
+			Name: "TemplateInstanceController_TemplateInstances_total",


I would suggest openshift_template_instance_total

if we rename the controller, I'm supposing that templateinstance_controller_templateinstances_total might work, for example.

gabemontero · 2017-09-21T13:29:37Z

pkg/template/controller/metrics.go

+func newTemplateInstancesWaiting() prometheus.Histogram {
+	return prometheus.NewHistogram(
+		prometheus.HistogramOpts{
+			Name:    "TemplateInstanceController_TemplateInstances_active_waiting_time_seconds",


Whether this remains a histogram or becomes a constant metric per the PR discussion thread most likely would NOT affect the name.

So still offering name suggestions prior to reaching consensus on that point seems OK :-)

I think openshift_template_active_wait_time_seconds would be good.

gabemontero · 2017-09-21T13:34:36Z

pkg/template/controller/metrics.go

+		templateInstancesTotal.WithLabelValues("", "").Inc()
+
+		for _, cond := range templateInstance.Status.Conditions {
+			templateInstancesTotal.WithLabelValues(string(cond.Type), string(cond.Status)).Inc()


Heads up, during the build metrics review, @smarterclayton was big on following the code stylings of kube_state_metrics for actually registering the metrics.

See addCountGauge and addTimeGauge in pkg/build/metrics/prometheus/metrics.go

I'm reluctant to follow this coding approach because it duplicates logic that sits in the prometheus client library. My approach uses the client library logic to populate all the histogram buckets before sending them off. Although it matters less than the Histogram case, func (bc *buildCollector) Collect is basically reimplementing the prometheus client Gauge.Inc() when I don't believe it needs to.

jim-minter · 2017-09-25T16:31:06Z

i'm also a little confused by the histogram behavior in your example... you've got one data point with a duration of 74s and it seems to be represented in all the buckets except "60s". The only conclusion I can reach is that prometheus counts it in the bucket as long as the value is "less than" the bucket value? Seems strange. ( would have expected it to put it in exactly one bucket, the one with the smallest value that's larger than the datapoint's value, ie 300 in this case)

@bparees prometheus stores cumulative histograms.

jim-minter · 2017-09-25T16:48:19Z

Please follow prometheus conventions regarding labels (i.e. no upper case,
match with other resources people have created in terms of general name
ordering).

@smarterclayton I named these TemplateInstanceController_* because automatic controller metrics already exist under this name because that's what the controller name is defined as. Should I rename the controller? To templateinstance_controller or template_instance_controller or template_controller or something else?

jim-minter · 2017-09-25T16:55:25Z

Although TemplateInstanceController isn't the only one: APIServiceRegistrationController, AvailableConditionController, DiscoveryController

smarterclayton · 2017-09-25T17:36:20Z

Those are wrong, and must also be fixed. openshift_template_instance_* is acceptable On Sep 25, 2017, at 12:55 PM, Jim Minter <notifications@github.com> wrote: Although TemplateInstanceController isn't the only one: APIServiceRegistrationController, AvailableConditionController, DiscoveryController — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#16455 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p4Fi-lYvcXBxsUi1B9Nc28c7O7wvks5sl9sBgaJpZM4PeOAe> .

jim-minter · 2017-09-26T15:21:14Z

updated:

openshift_template_instance_active_start_time_seconds{name="a71f7ab8-e448-4826-8f05-32a185222dd7",namespace="demo"} 1.50652884e+09
openshift_template_instance_controller_adds 4
openshift_template_instance_controller_depth 0
openshift_template_instance_controller_queue_latency_count 4
openshift_template_instance_controller_queue_latency{quantile="0.5"} 18
openshift_template_instance_controller_queue_latency{quantile="0.9"} 20
openshift_template_instance_controller_queue_latency{quantile="0.99"} 20
openshift_template_instance_controller_queue_latency_sum 227
openshift_template_instance_controller_retries 2
openshift_template_instance_controller_work_duration_count 4
openshift_template_instance_controller_work_duration{quantile="0.5"} 20250
openshift_template_instance_controller_work_duration{quantile="0.9"} 80237
openshift_template_instance_controller_work_duration{quantile="0.99"} 80237
openshift_template_instance_controller_work_duration_sum 395409
openshift_template_instance_total{status="False",type="Ready"} 1
openshift_template_instance_total{status="",type=""} 1

jim-minter · 2017-09-26T20:51:16Z

/retest

jim-minter · 2017-09-27T16:15:49Z

@smarterclayton @gabemontero @bparees ptal - looking for sign-off on this so that it can merge today

gabemontero · 2017-09-27T17:27:18Z

Stylistically, I like what @jim-minter has done here.

As I previously noted, to some degree it breaks from some patterns that were imposed from our previous build work; but as generally acknowledged, we are very much in an iteration / evaluation cycle. No reason that can't apply to the way the metrics are coded, in addition to the specifics of the metrics.

As to the specifics of the metrics, certainly and apples to apples comparison between templates and builds is not viable. Aside from there intrinsic differences, templates are not something the ops team has been monitoring with zabbix. Existing ops metrics in part have driven the path of build metrics.

With that preamble, similar to the questions @smarterclayton posed on build metrics, I'd be curious @jim-minter how you might envision using the template metrics in online to gauge health of that component.

Perhaps you can update the readme with some example queries, where even if we don't explicitly proclaim it yet, those queries might be of a flavor of something we might run in online to get a sense of the health of the component. For example, I'm assuming the status label will have some indication of success / failure / problems.

Or template instance activation ... how long we expect those to be running ...

Those type of elaborations would help me better review the precise contents of the metric. Of course if all that has been discussed previously and I've missed it, just point me to that or summarize, whatever is easier.

thanks

bparees · 2017-09-27T18:49:04Z

talked with @jim-minter a bit about the merits of the empty string label mechanism for showing the total count.. for now we agreed to leave it, but may want to introduce an explicit total count metric in the future since the empty string label feels a bit hacky.

Agree with @gabemontero that contributions to the readme for some sample queries that use these metrics would be good, that can be done as a follow up though, so i'm going to lgtm this as it currently stands, I think it covers the fundamental metrics we'd be interested in seeing for template instance usage.

/lgtm

openshift-merge-robot · 2017-09-27T18:49:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bparees, jim-minter

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

~~pkg/build/OWNERS~~ [bparees]
~~pkg/template/OWNERS~~ [bparees]
~~test/integration/OWNERS~~ [bparees]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

gabemontero · 2017-09-27T19:03:12Z

On Wed, Sep 27, 2017 at 2:49 PM, Ben Parees ***@***.***> wrote: talked with @jim-minter <https://github.com/jim-minter> a bit about the merits of the empty string label mechanism for showing the total count.. for now we agreed to leave it, but may want to introduce an explicit total count metric in the future since the empty string label feels a bit hacky. Agree with @gabemontero <https://github.com/gabemontero> that contributions to the readme for some sample queries that use these metrics would be good, that can be done as a follow up though, so i'm going to lgtm this as it currently stands, I think it covers the fundamental metrics we'd be interested in seeing for template instance usage. /lgtm

For what it is worth I'm good with this.

…

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#16455 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADbadEnDxbPI1ASktvm4mz8XMG4DCiFHks5smpi6gaJpZM4PeOAe> .

smarterclayton · 2017-09-27T20:24:01Z

Don't use empty string labels. We have summarization in prometheus for that.

smarterclayton · 2017-09-27T20:25:40Z

https://prometheus.io/docs/practices/naming/#metric-names last paragraph alludes to it, but if prometheus sum(metric) captures what you are trying to report here, you don't need it, because sum can do it for you.

bparees · 2017-09-27T20:26:57Z

https://prometheus.io/docs/practices/naming/#metric-names last paragraph alludes to it, but if prometheus sum(metric) captures what you are trying to report here, you don't need it, because sum can do it for you.

sum would double count items that have multiple conditions.

smarterclayton · 2017-09-27T20:31:19Z

Then split the metrics out so they don't as KSM has done

smarterclayton · 2017-09-27T20:31:32Z

When in doubt look at how KSM does it.

openshift-merge-robot · 2017-09-27T20:59:38Z

Automatic merge from submit-queue (batch tested with PRs 16293, 16455)

smarterclayton · 2017-09-27T21:09:06Z

Please fix the issue I mentioned.

bparees · 2017-09-27T21:11:14Z

Then split the metrics out so they don't as KSM has done

a separate metric for each condition?

(and another separate metric for "total"?)

smarterclayton · 2017-09-27T21:23:37Z

You don't need total because you can sum the metrics. But yes, a separate one. I recommend looking at KSM. Or you can look at us-west-1 prometheus to see what it outputs.

…

On Wed, Sep 27, 2017 at 5:11 PM, Ben Parees ***@***.***> wrote: Then split the metrics out so they don't as KSM has done a separate metric for each condition? (and another separate metric for "total"?) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#16455 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_pwKzja96vT-XGQlADJkxpEuG6H55ks5smrn0gaJpZM4PeOAe> .

bparees · 2017-09-27T21:26:32Z

You don't need total because you can sum the metrics.

we can't sum the metrics, you'd be double counting items which have two or more conditions associated with them.

smarterclayton · 2017-09-27T23:06:21Z

Sorry, I wasn't clear. I wanted one series "_total" per condition type, with different labels for the readiness state. When I said "like KSM" i meant "they do one metric per condition" but not "one series per template instance"

…

On Wed, Sep 27, 2017 at 5:26 PM, Ben Parees ***@***.***> wrote: You don't need total because you can sum the metrics. we can't sum the metrics, you'd be double counting items which have two or more conditions associated with them. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#16455 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_pwvS_A6qgVnwqte2r49b9q2TZR89ks5smr2MgaJpZM4PeOAe> .

bparees · 2017-09-28T00:19:21Z

"they do one metric per condition" but not "one series per template
instance"

and when @smarterclayton and I talked further, we would not do one metric per condition either. We'd do one series per condition. (one metric per condition means dynamically defining new metrics if new conditions are introduced).

so one metric that's just a constant total(summed in the collector).

and one metric that's instanceByCondition with {condition,value} labels, with the value being the count of instances that have that condition/value combination.

smarterclayton · 2017-09-28T04:19:25Z

Yeah, we don't need two metrics because you can sum by condition if necessary.

bparees · 2017-09-28T04:26:05Z

Yeah, we don't need two metrics because you can sum by condition if
necessary.

i feel like we're going in circles. we can't sum by condition because templateinstances can potentially have multiple conditions, so summing by condition is going to double count things.

so we still need two metrics.

smarterclayton · 2017-09-28T14:01:48Z

sum by (condition) YOUR_METRIC is not going to double count things

…

On Thu, Sep 28, 2017 at 12:26 AM, Ben Parees ***@***.***> wrote: Yeah, we don't need two metrics because you can sum by condition if necessary. i feel like we're going in circles. we can't sum by condition because templateinstances can potentially have multiple conditions, so summing by condition is going to double count things. so we still need two metrics. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#16455 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p9PtgQqVHPMCvB2GhZJ5oBio9Am6ks5smx_fgaJpZM4PeOAe> .

bparees · 2017-09-28T14:05:21Z

sum by (condition) YOUR_METRIC is not going to double count things

if i specify an explicit condition, sure... it's also not going to give me the total in the system.

smarterclayton · 2017-09-28T23:32:09Z

Which is a separate metric for total. I was referring to separate metrics per condition, which are unnecessary. Originally I was talking about one metric for total and one metric per condition. On Sep 28, 2017, at 10:05 AM, Ben Parees <notifications@github.com> wrote: sum by (condition) YOUR_METRIC is not going to double count things if i specify an explicit condition, sure... it's also not going to give me the total in the system. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#16455 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p494f3Frw3Nx4EFIPL2JN0P7gucvks5sm6eogaJpZM4PeOAe> .

@smarterclayton

Automatic merge from submit-queue. separate openshift_template_instance_status_condition_total and openshift_template_instance_total metrics follow-up from #16455 @smarterclayton @bparees ptal @gabemontero fyi

openshift-merge-robot assigned soltysh and smarterclayton Sep 20, 2017

openshift-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 20, 2017

jim-minter assigned bparees and unassigned soltysh and smarterclayton Sep 20, 2017

gabemontero suggested changes Sep 21, 2017

View reviewed changes

jim-minter force-pushed the trello138-tsb-prometheus-2 branch from 61798ff to 86e0679 Compare September 26, 2017 15:20

implement prometheus metrics for the TemplateInstance controller

b3451c8

jim-minter force-pushed the trello138-tsb-prometheus-2 branch from 86e0679 to b3451c8 Compare September 26, 2017 18:57

openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 26, 2017

jim-minter mentioned this pull request Sep 27, 2017

prometheus alerts for openshift build subsystem #16495

Merged

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 27, 2017

openshift-merge-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 27, 2017

openshift-merge-robot merged commit 6d590d6 into openshift:master Sep 27, 2017

jim-minter mentioned this pull request Sep 27, 2017

separate openshift_template_instance_status_condition_total and openshift_template_instance_total metrics #16588

Merged

add metrics for TemplateInstance controller #16455

add metrics for TemplateInstance controller #16455

Conversation

jim-minter commented Sep 20, 2017

jim-minter commented Sep 20, 2017

jim-minter commented Sep 20, 2017

gabemontero commented Sep 20, 2017

bparees commented Sep 21, 2017

smarterclayton commented Sep 21, 2017 via email

gabemontero Sep 21, 2017

Choose a reason for hiding this comment

jim-minter Sep 25, 2017 • edited Loading

Choose a reason for hiding this comment

gabemontero Sep 21, 2017

Choose a reason for hiding this comment

gabemontero Sep 21, 2017

Choose a reason for hiding this comment

jim-minter Sep 25, 2017

Choose a reason for hiding this comment

jim-minter commented Sep 25, 2017

jim-minter commented Sep 25, 2017

jim-minter commented Sep 25, 2017

smarterclayton commented Sep 25, 2017 via email

jim-minter commented Sep 26, 2017 • edited Loading

jim-minter commented Sep 26, 2017

jim-minter commented Sep 27, 2017

gabemontero commented Sep 27, 2017

bparees commented Sep 27, 2017

openshift-merge-robot commented Sep 27, 2017

gabemontero commented Sep 27, 2017 via email

smarterclayton commented Sep 27, 2017

smarterclayton commented Sep 27, 2017

bparees commented Sep 27, 2017

smarterclayton commented Sep 27, 2017

smarterclayton commented Sep 27, 2017

openshift-merge-robot commented Sep 27, 2017

smarterclayton commented Sep 27, 2017

bparees commented Sep 27, 2017

smarterclayton commented Sep 27, 2017 via email

bparees commented Sep 27, 2017

smarterclayton commented Sep 27, 2017 via email

bparees commented Sep 28, 2017

smarterclayton commented Sep 28, 2017 via email

bparees commented Sep 28, 2017

smarterclayton commented Sep 28, 2017 via email

bparees commented Sep 28, 2017

smarterclayton commented Sep 28, 2017 via email

jim-minter Sep 25, 2017 •

edited

Loading

jim-minter commented Sep 26, 2017 •

edited

Loading