Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add metrics for TemplateInstance controller #16455

Merged

Conversation

jim-minter
Copy link
Contributor

@openshift-ci-robot openshift-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 20, 2017
@jim-minter
Copy link
Contributor Author

@smarterclayton @bparees there are 2 commits here: the 1st does things approximately like the existing build metrics; however I think the way in the 2nd commit is a better approach for both sets of metrics. Example output for discussion:

TemplateInstanceController_TemplateInstances_active_waiting_time_seconds_bucket{le="1200"} 1
TemplateInstanceController_TemplateInstances_active_waiting_time_seconds_bucket{le="300"} 1
TemplateInstanceController_TemplateInstances_active_waiting_time_seconds_bucket{le="3600"} 1
TemplateInstanceController_TemplateInstances_active_waiting_time_seconds_bucket{le="60"} 0
TemplateInstanceController_TemplateInstances_active_waiting_time_seconds_bucket{le="600"} 1
TemplateInstanceController_TemplateInstances_active_waiting_time_seconds_bucket{le="+Inf"} 1
TemplateInstanceController_TemplateInstances_active_waiting_time_seconds_count 1
TemplateInstanceController_TemplateInstances_active_waiting_time_seconds_sum 74
TemplateInstanceController_TemplateInstances_total{status="False",type="Ready"} 1
TemplateInstanceController_TemplateInstances_total{status="",type=""} 1

Notes:

  1. providing a histogram of durations for objects that are currently waiting allows an end user to rationalise about count, mean and general distribution as a point in time snapshot and over time ranges. I think they are clearer to work with than unix timestamps. Histograms are explicitly recommended in the prometheus docs for handling duration information.
  2. I think using namespace and name as labels is a bad idea:
    a) because of cardinality (see https://prometheus.io/docs/practices/naming/ : "Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.")
    b) because of information leakage
  3. I suggest reporting on total numbers categorised by condition rather than "phase" here to stick closer to kubernetes best practice. I'm additionally suggesting {status="",type=""} to report on the total number of objects independent of conditions.

@jim-minter jim-minter assigned bparees and unassigned soltysh and smarterclayton Sep 20, 2017
@jim-minter
Copy link
Contributor Author

flake #16414
/retest

@gabemontero
Copy link
Contributor

Quick note: on the use of namespace/name/ etc. labels with the active build metric, those stemmed from @smarterclayton 's desire to have a "constant metric" where the value was the actual start time in unix domain time. That approach was derived from a similar metric that cadvisor had.

My original approach to those active builds was a histogram similar to what I see in this PR @jim-minter for active/waiting template instances. But through iterating with @smarterclayton this was removed, at least for now. We did talk about revisiting/returning some of the histogram based metrics later on.

Although not an exact apples to apples comparison, it is conceivable that your active/waiting templates instances metric could follow a similar "evolutionary path".

@bparees
Copy link
Contributor

bparees commented Sep 21, 2017

1+2) what @gabemontero said. Essentially there was a desire to have a data point representation of each active build. I agree from a cardinality perspective that seems problematic, but that's the direction we got and should probably stick with for templateinstances

  1. well templateinstances don't even have phases, so no argument there. Similarly builds don't have conditions, which is why build metrics are reported by phase, not condition. For objects that have conditions, reporting how many objects are in each terminal condition, as well as how many objects exist in terminal conditions in total, seems reasonable.

i'm also a little confused by the histogram behavior in your example... you've got one data point with a duration of 74s and it seems to be represented in all the buckets except "60s". The only conclusion I can reach is that prometheus counts it in the bucket as long as the value is "less than" the bucket value? Seems strange. ( would have expected it to put it in exactly one bucket, the one with the smallest value that's larger than the datapoint's value, ie 300 in this case)

@smarterclayton
Copy link
Contributor

smarterclayton commented Sep 21, 2017 via email

func newTemplateInstancesTotal() *prometheus.GaugeVec {
return prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "TemplateInstanceController_TemplateInstances_total",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest openshift_template_instance_total

Copy link
Contributor Author

@jim-minter jim-minter Sep 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we rename the controller, I'm supposing that templateinstance_controller_templateinstances_total might work, for example.

func newTemplateInstancesWaiting() prometheus.Histogram {
return prometheus.NewHistogram(
prometheus.HistogramOpts{
Name: "TemplateInstanceController_TemplateInstances_active_waiting_time_seconds",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether this remains a histogram or becomes a constant metric per the PR discussion thread most likely would NOT affect the name.

So still offering name suggestions prior to reaching consensus on that point seems OK :-)

I think openshift_template_active_wait_time_seconds would be good.

templateInstancesTotal.WithLabelValues("", "").Inc()

for _, cond := range templateInstance.Status.Conditions {
templateInstancesTotal.WithLabelValues(string(cond.Type), string(cond.Status)).Inc()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heads up, during the build metrics review, @smarterclayton was big on following the code stylings of kube_state_metrics for actually registering the metrics.

See addCountGauge and addTimeGauge in pkg/build/metrics/prometheus/metrics.go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reluctant to follow this coding approach because it duplicates logic that sits in the prometheus client library. My approach uses the client library logic to populate all the histogram buckets before sending them off. Although it matters less than the Histogram case, func (bc *buildCollector) Collect is basically reimplementing the prometheus client Gauge.Inc() when I don't believe it needs to.

@jim-minter
Copy link
Contributor Author

i'm also a little confused by the histogram behavior in your example... you've got one data point with a duration of 74s and it seems to be represented in all the buckets except "60s". The only conclusion I can reach is that prometheus counts it in the bucket as long as the value is "less than" the bucket value? Seems strange. ( would have expected it to put it in exactly one bucket, the one with the smallest value that's larger than the datapoint's value, ie 300 in this case)

@bparees prometheus stores cumulative histograms.

@jim-minter
Copy link
Contributor Author

Please follow prometheus conventions regarding labels (i.e. no upper case,
match with other resources people have created in terms of general name
ordering).

@smarterclayton I named these TemplateInstanceController_* because automatic controller metrics already exist under this name because that's what the controller name is defined as. Should I rename the controller? To templateinstance_controller or template_instance_controller or template_controller or something else?

@jim-minter
Copy link
Contributor Author

Although TemplateInstanceController isn't the only one: APIServiceRegistrationController, AvailableConditionController, DiscoveryController

@smarterclayton
Copy link
Contributor

smarterclayton commented Sep 25, 2017 via email

@jim-minter
Copy link
Contributor Author

jim-minter commented Sep 26, 2017

updated:

openshift_template_instance_active_start_time_seconds{name="a71f7ab8-e448-4826-8f05-32a185222dd7",namespace="demo"} 1.50652884e+09
openshift_template_instance_controller_adds 4
openshift_template_instance_controller_depth 0
openshift_template_instance_controller_queue_latency_count 4
openshift_template_instance_controller_queue_latency{quantile="0.5"} 18
openshift_template_instance_controller_queue_latency{quantile="0.9"} 20
openshift_template_instance_controller_queue_latency{quantile="0.99"} 20
openshift_template_instance_controller_queue_latency_sum 227
openshift_template_instance_controller_retries 2
openshift_template_instance_controller_work_duration_count 4
openshift_template_instance_controller_work_duration{quantile="0.5"} 20250
openshift_template_instance_controller_work_duration{quantile="0.9"} 80237
openshift_template_instance_controller_work_duration{quantile="0.99"} 80237
openshift_template_instance_controller_work_duration_sum 395409
openshift_template_instance_total{status="False",type="Ready"} 1
openshift_template_instance_total{status="",type=""} 1

@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 26, 2017
@jim-minter
Copy link
Contributor Author

/retest

@jim-minter
Copy link
Contributor Author

@smarterclayton @gabemontero @bparees ptal - looking for sign-off on this so that it can merge today

@gabemontero
Copy link
Contributor

Stylistically, I like what @jim-minter has done here.

As I previously noted, to some degree it breaks from some patterns that were imposed from our previous build work; but as generally acknowledged, we are very much in an iteration / evaluation cycle. No reason that can't apply to the way the metrics are coded, in addition to the specifics of the metrics.

As to the specifics of the metrics, certainly and apples to apples comparison between templates and builds is not viable. Aside from there intrinsic differences, templates are not something the ops team has been monitoring with zabbix. Existing ops metrics in part have driven the path of build metrics.

With that preamble, similar to the questions @smarterclayton posed on build metrics, I'd be curious @jim-minter how you might envision using the template metrics in online to gauge health of that component.

Perhaps you can update the readme with some example queries, where even if we don't explicitly proclaim it yet, those queries might be of a flavor of something we might run in online to get a sense of the health of the component. For example, I'm assuming the status label will have some indication of success / failure / problems.

Or template instance activation ... how long we expect those to be running ...

Those type of elaborations would help me better review the precise contents of the metric. Of course if all that has been discussed previously and I've missed it, just point me to that or summarize, whatever is easier.

thanks

@bparees
Copy link
Contributor

bparees commented Sep 27, 2017

talked with @jim-minter a bit about the merits of the empty string label mechanism for showing the total count.. for now we agreed to leave it, but may want to introduce an explicit total count metric in the future since the empty string label feels a bit hacky.

Agree with @gabemontero that contributions to the readme for some sample queries that use these metrics would be good, that can be done as a follow up though, so i'm going to lgtm this as it currently stands, I think it covers the fundamental metrics we'd be interested in seeing for template instance usage.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 27, 2017
@openshift-merge-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bparees, jim-minter

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 27, 2017
@gabemontero
Copy link
Contributor

gabemontero commented Sep 27, 2017 via email

@smarterclayton
Copy link
Contributor

Don't use empty string labels. We have summarization in prometheus for that.

@smarterclayton
Copy link
Contributor

https://prometheus.io/docs/practices/naming/#metric-names last paragraph alludes to it, but if prometheus sum(metric) captures what you are trying to report here, you don't need it, because sum can do it for you.

@bparees
Copy link
Contributor

bparees commented Sep 27, 2017

https://prometheus.io/docs/practices/naming/#metric-names last paragraph alludes to it, but if prometheus sum(metric) captures what you are trying to report here, you don't need it, because sum can do it for you.

sum would double count items that have multiple conditions.

@smarterclayton
Copy link
Contributor

Then split the metrics out so they don't as KSM has done

@smarterclayton
Copy link
Contributor

When in doubt look at how KSM does it.

@openshift-merge-robot
Copy link
Contributor

Automatic merge from submit-queue (batch tested with PRs 16293, 16455)

@openshift-merge-robot openshift-merge-robot merged commit 6d590d6 into openshift:master Sep 27, 2017
@smarterclayton
Copy link
Contributor

Please fix the issue I mentioned.

@bparees
Copy link
Contributor

bparees commented Sep 27, 2017

Then split the metrics out so they don't as KSM has done

a separate metric for each condition?

(and another separate metric for "total"?)

@smarterclayton
Copy link
Contributor

smarterclayton commented Sep 27, 2017 via email

@bparees
Copy link
Contributor

bparees commented Sep 27, 2017

You don't need total because you can sum the metrics.

we can't sum the metrics, you'd be double counting items which have two or more conditions associated with them.

@smarterclayton
Copy link
Contributor

smarterclayton commented Sep 27, 2017 via email

@bparees
Copy link
Contributor

bparees commented Sep 28, 2017

"they do one metric per condition" but not "one series per template
instance"

and when @smarterclayton and I talked further, we would not do one metric per condition either. We'd do one series per condition. (one metric per condition means dynamically defining new metrics if new conditions are introduced).

so one metric that's just a constant total(summed in the collector).

and one metric that's instanceByCondition with {condition,value} labels, with the value being the count of instances that have that condition/value combination.

@smarterclayton
Copy link
Contributor

smarterclayton commented Sep 28, 2017 via email

@bparees
Copy link
Contributor

bparees commented Sep 28, 2017

Yeah, we don't need two metrics because you can sum by condition if
necessary.

i feel like we're going in circles. we can't sum by condition because templateinstances can potentially have multiple conditions, so summing by condition is going to double count things.

so we still need two metrics.

@smarterclayton
Copy link
Contributor

smarterclayton commented Sep 28, 2017 via email

@bparees
Copy link
Contributor

bparees commented Sep 28, 2017

sum by (condition) YOUR_METRIC is not going to double count things

if i specify an explicit condition, sure... it's also not going to give me the total in the system.

@smarterclayton
Copy link
Contributor

smarterclayton commented Sep 28, 2017 via email

openshift-merge-robot added a commit that referenced this pull request Oct 4, 2017
Automatic merge from submit-queue.

separate openshift_template_instance_status_condition_total and openshift_template_instance_total metrics

follow-up from #16455

@smarterclayton @bparees ptal
@gabemontero fyi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants