Add a proposal for high level volume metrics #809

gnufied · 2017-07-13T18:05:06Z

This document adds a proposal for gathering metrics
at operation level for volume operations. This ensures
metrics can be captured regardless of individual volume
plugin implementation.

xref kubernetes/enhancements#349

gnufied · 2017-07-13T18:05:18Z

cc @saad-ali @childsb @piosz

gnufied · 2017-07-13T18:05:26Z

/sig storage

gnufied · 2017-07-13T18:05:38Z

@kubernetes/sig-storage-api-reviews

piosz · 2017-07-13T20:01:35Z

cc @kubernetes/sig-instrumentation-misc

brancz

Just a small nit otherwise looks ok at first glance.

brancz · 2017-07-17T09:45:26Z

contributors/design-proposals/volume-metrics.md

+Similarly errors will be named:
+
+```
+storage_volume_attach_errors { plugin = "aws-ebs" }


The suffix should be _errors_total.

fabiand · 2017-07-17T10:52:13Z

What about "pro-active" monitoring, i.e. performing reads on a volume while it is attatched, to identify issues at runtime?

Is this in general interesting, and should this be a spearate issue?

gnufied · 2017-07-17T13:12:23Z

@fabiand can you elaborate more? What sort of metrics we are talking about?

But it does sound like something out of scope for this proposal since this proposal is more about metric collection at controller level.

fabiand · 2017-07-17T14:23:21Z

Ah yes, you can capture the events at the controller level.

I was thinking of i.e. doing active monitoring of attached storage, i.e. general connectivity checks, or read/write/seek latency checks while storage is attached, to ensure that the storage is not malfunctioning.

piosz

I skipped Implementation Detail part, as I'm not familiar with volume handling code. Please make sure that some will review this.

piosz · 2017-07-18T08:37:42Z

contributors/design-proposals/volume-metrics.md

+
+
+The metrics will be emitted using [Prometheus format](https://prometheus.io/docs/instrumenting/exposition_formats/) and available for collection
+from `/metrics` HTTP endpoint of kubelet, controller etc. All Kubernetes core components already emit


Can you please clarify which components will expose those metrics?
Will this be implemented in a common library?

This would be available from both kubelet and controller-manager depending on component where particular operation was performed. This isn't implemented as a common library, more like hook into place where volume operations are executed.

So please update this in the text.

from /metrics HTTP endpoint of kubelet, controller etc.

suggests that there is more

piosz · 2017-07-18T08:38:50Z

contributors/design-proposals/volume-metrics.md

+Any collector which can parse Prometheus metric format should be able to collect
+metrics from these endpoints.
+
+A more detailed description of monitoring pipeline can be found in [Monitoring architecture] (https://github.com/kubernetes/community/blob/master/contributors/design-proposals/monitoring_architecture.md#monitoring-pipeline) document.


Broken link. Remove space between ] and (.

piosz · 2017-07-18T08:44:29Z

contributors/design-proposals/volume-metrics.md

+emitting these metrics.
+
+We will be using `HistogramVec` type so as we can attach dimensions at runtime. Name of operation will become
+part of metric name and at minimum name of volume plugin will be emitted as a dimension.


Why not having operation as label as well? @brancz @fabxc wdyt?

minimum name of volume plugin will be emitted as a dimension

Sounds weird to me, do you mean that all metrics will be labeled with plugin? What is the definition of plugin?

looking at the examples, maybe provider would be clearer than plugin?

Why not having operation as label as well? @brancz @fabxc wdyt?

Because name of operation is already in the metric name.

Also a volume plugin is typically a third party service that provides actual volume services. Such as - EBS or GCE-PD or Openstack-Cinder etc. The idea behind labeling the metrics with plugin name is - typically in a cluster user may have more than one volume plugin configured. Using plugin name as a dimension allows them to isolate operation timing from one plugin to another.

I wasn't questioning including it, rather the naming 🙂 .

@brancz yeah I was answering to @piosz's question above. I think naming the label to provider is fine.

I have renamed plugin label to volume_plugin - hopefully this would make it clearer.

volume_plugin sounds good to me as well

@gnufied I mean having one metric, where operation is specified through label (not as a part of the name). This would for example allow you to see metrics regarding all operations easier (with the current approach you need to sum up across multiple metrics).

cc @loburm

saad-ali · 2017-07-18T17:51:06Z

contributors/design-proposals/volume-metrics.md

+## Motivation
+
+Currently we don't have high level metrics that captures time taken
+for operations like volume attach or volume mounting etc.


..and success/failure rates of these operations...

saad-ali · 2017-07-19T15:51:00Z

+@bowei @msau42

saad-ali · 2017-07-19T15:51:44Z

CC @kubernetes/sig-storage-feature-requests @kubernetes/sig-instrumentation-feature-requests

msau42 · 2017-07-19T17:56:10Z

contributors/design-proposals/volume-metrics.md

+
+
+```
+storage_volume_attach_seconds { volume_plugin = "aws-ebs" }


Can we have operation as a parameter instead of in the metric name? That's more similar to what we did for cloud provider metrics.

What advantage we are looking to gain from doing that? The reason we chose same name in cloudprovider metrics is because - we were interested in metrics such as, how many cloudprovider API calls kubernetes makes per minute in total. That metric is useful because it gives the user a good idea of whether he is within API quota or not.

Are we looking to do similar aggregation for these metrics? I think not and hence different metric name might be better. Is aggregating say volume_attach and mount_device metric useful in any sense?

It makes it easier to add the metrics to any querying/display system, especially if we want to add more volume operations in the future. Then all the consumers of these metrics don't need to update every time. My understanding of the Prometheus format, is that you can filter by the labels, so you could implement a display by iterating through all the ops, instead of hardcoding every op.

right, but moving the operation name to label introduces more cardinality to same metric. It can go either way tbh. There is another advantage of moving operation name to dimension - it makes code maintainence bit easier since each new metric (for an operation) has to be registered separately whereas using label means - we need to register just one metric.

By default if a metric has too many dimensions, some dimensions are elided in dashboards until you apply filters. lets ask what @brancz and @piosz think on this one.

This is what I suggested in #809 (comment)

As this is rather controlled information, I think this is generally fine to do. Where it's important to look out for these problems is when the label values can be completely arbitrary, let's say if you gave a request an ID, that would make the time-series created explode and naturally put high time-series churn on any tsdb. If I understand correctly the "operation" is controlled by us implementing said operations, so that generally sounds sane to me.

ack, I will move the operation name to label. Wondering, what will be a good generic name for the metric itself then - "storage_operation_duration_seconds` is what I am thinking.

storage_operation_duration_seconds sounds good to me

+1 but I'll let the other sig-storage folks to decide on the actual naming.

msau42 · 2017-07-19T17:57:53Z

contributors/design-proposals/volume-metrics.md

+
+   ```go
+   GenerateMountVolumeFunc(waitForAttachTimeout time.Duration,
+       volumeToMount VolumeToMount,


Can the plugin name be returned in VolumeToMount instead?

VolumeToMount contains volume spec and at that point plugin name is kind of unknown. The resolution of which plugin will perform mounting or attaching or some other volume operation - usually happens inside GenXXX functions of operationGenerator module - https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/util/operationexecutor/operation_generator.go#L360

So returning plugin that was chosen to perform certain operation avoids having to modify all these internal structs.

gnufied · 2017-07-20T15:16:44Z

@msau42 @brancz @piosz addressed most of the concerns on this document. ptal.

msau42 · 2017-07-20T15:57:03Z

/lgtm

piosz · 2017-07-21T05:28:01Z

/lgtm
please squash commits before merge

piosz · 2017-07-21T05:28:35Z

@bowei @saad-ali do you want to review it?

brancz · 2017-07-21T07:42:33Z

from metric design and Prometheus perspective this looks good to me, but can't comment on the implementation details

wongma7 · 2017-07-24T20:34:13Z

contributors/design-proposals/volume-metrics.md

+storage_operation_duration_seconds { volume_plugin = "iscsi" , operation_name = "volume_unmount" }
+storage_operation_duration_seconds { volume_plugin = "aws-ebs", operation_name = "mount_device" }
+storage_operation_duration_seconds { volume_plugin = "aws-ebs", operation_name = "unmount_device" }
+storage_operation_duration_seconds { volume_plugin = "cinder" , operation_name = "verify_volume" }


there may be some troubles implementing metrics for VerifyVolumesAreAttached using the proposed method because it's done on a per-node basis, not per-plugin. Unless the plugin supports bulk verification, then it's fine.

anyway we can discuss this offline i.e. how verify_volume implementation detail may deviate from the others

if plugin name is not available we can emit the volume_plugin dimension as <n/a> if applicable. Usually, people don't mix the plugins so users would know which volume plugin is that, but we will at least have some metrics in that case. I will update the proposal

gnufied · 2017-07-24T20:49:12Z

@saad-ali @bowei please have a look when you get a chance. This design is just waiting for approval from one of #sig-storage members.

bowei · 2017-07-25T06:17:57Z

lgtm

This document adds a proposal for gathering metrics at operation level for volume operations. This ensures metrics can be captured regardless of individual volume plugin implementation.

saad-ali · 2017-07-28T01:57:56Z

/lgtm
/approve

@gnufied

Automatic merge from submit-queue Add volume operation metrics to operation executor and PV controller This PR implements the proposal for high level volume metrics kubernetes/community#809 **Special notes for your reviewer**: ~Differences from proposal:~ all resolved ~"verify_volume" is now "verify_volumes_are_attached" + "verify_volumes_are_attached_per_node" + "verify_controller_attached_volume." Which of them do we want?~ ~There is no "mount_device" metric because the MountVolume operation combines MountDevice and mount (plugin.Setup). Do we want to extract the mount_device metric or is it okay to keep mountvolume as one? For attachable volumes, MountDevice is the actual mount and Setup is a bindmount + setvolumeownership. For unattachable, mountDevice does not occur and Setup is an actual mount + setvolumeownership.~ ~PV controller metrics I did not implement following the proposal at all. I did not change goroutinemap nor scheduleOperation. Because provisionClaimOperation does not return an error, so it's impossible for the caller to know if there is actually a failure worth reporting. So I manually create a new metric inside the function according to some conditions.~ @gnufied I have tested the operationexecutor metrics but not provision & delete. Sample: ![screen shot 2017-08-02 at 15 01 08](https://user-images.githubusercontent.com/13111288/28889980-a7093526-7793-11e7-9aa9-ad7158be76fa.png) **Release note**: ```release-note Add error count and time-taken metrics for storage operations such as mount and attach, per-volume-plugin. ```

Add a proposal for high level volume metrics

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 13, 2017

gnufied mentioned this pull request Jul 13, 2017

Add support for high level volume operation metrics kubernetes/enhancements#349

Closed

piosz assigned brancz, saad-ali and piosz Jul 13, 2017

gnufied changed the title ~~Add a proposal for high level metrics~~ Add a proposal for high level volume metrics Jul 13, 2017

dvonthenen mentioned this pull request Jul 14, 2017

Formalize Plan for Selected Instrumentation Stack thecodeteam/roadmap#175

Closed

9 tasks

brancz reviewed Jul 17, 2017

View reviewed changes

piosz reviewed Jul 18, 2017

View reviewed changes

dvonthenen mentioned this pull request Jul 19, 2017

PR: Implement Prometheus/OpenTracing Instrumentation in k8s Storage Shim thecodeteam/roadmap#180

Closed

saad-ali reviewed Jul 19, 2017

View reviewed changes

saad-ali assigned bowei and msau42 Jul 19, 2017

msau42 reviewed Jul 19, 2017

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 20, 2017

wongma7 reviewed Jul 24, 2017

View reviewed changes

Add a proposal for high level metrics

4a80e58

This document adds a proposal for gathering metrics at operation level for volume operations. This ensures metrics can be captured regardless of individual volume plugin implementation.

gnufied force-pushed the high-level-volume-metrics branch from c8fd9e8 to 4a80e58 Compare July 25, 2017 20:17

saad-ali merged commit 724d653 into kubernetes:master Jul 28, 2017

wongma7 mentioned this pull request Aug 2, 2017

Add volume operation metrics to operation executor and PV controller kubernetes/kubernetes#50036

Merged

MadhavJivrajani pushed a commit to MadhavJivrajani/community that referenced this pull request Nov 30, 2021

Merge pull request kubernetes#809 from gnufied/high-level-volume-metrics

86cacc2

Add a proposal for high level volume metrics



		The metrics will be emitted using [Prometheus format](https://prometheus.io/docs/instrumenting/exposition_formats/) and available for collection
		from `/metrics` HTTP endpoint of kubelet, controller etc. All Kubernetes core components already emit



		```
		storage_volume_attach_seconds { volume_plugin = "aws-ebs" }

Add a proposal for high level volume metrics #809

Add a proposal for high level volume metrics #809

Conversation

gnufied commented Jul 13, 2017 • edited Loading

gnufied commented Jul 13, 2017

gnufied commented Jul 13, 2017

gnufied commented Jul 13, 2017

piosz commented Jul 13, 2017

brancz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fabiand commented Jul 17, 2017

gnufied commented Jul 17, 2017

fabiand commented Jul 17, 2017

piosz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saad-ali commented Jul 19, 2017

saad-ali commented Jul 19, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnufied Jul 20, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piosz Jul 20, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnufied Jul 19, 2017 • edited Loading

Choose a reason for hiding this comment

gnufied commented Jul 20, 2017

msau42 commented Jul 20, 2017

piosz commented Jul 21, 2017

piosz commented Jul 21, 2017

brancz commented Jul 21, 2017

wongma7 Jul 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnufied commented Jul 24, 2017

bowei commented Jul 25, 2017

saad-ali commented Jul 28, 2017

gnufied commented Jul 13, 2017 •

edited

Loading

gnufied Jul 20, 2017 •

edited

Loading

piosz Jul 20, 2017 •

edited

Loading

gnufied Jul 19, 2017 •

edited

Loading

wongma7 Jul 24, 2017 •

edited

Loading