Implement SBD watchdog and msgwait metrics #174

MalloZup · 2020-09-01T18:59:12Z

todo:

add documentation for metrics

Note:

I have implemented the metric like

ha_cluster_sbd_devices{device="/dev/vdd",status="healthy"} 1
ha_cluster_sbd_devices_watchdog_timeout{device="/dev/vdc"} 60
ha_cluster_sbd_devices_msgwait_timeout{device="/dev/vdd"} 120

I didn't found any useful doc about the loop and allocate metric so I even wondered if we should expose them. (In fact I haven't but if they are meant to be useful, we should definitely document this in the suse-doc)
See:
https://documentation.suse.com/sle-ha/15-SP2/html/SLE-HA-all/cha-ha-storage-protect.html#sec-ha-storage-protect-watchdog-timings

@diegoakechi cc
@nick-wang cc

I personally think that if we are exposing only 2 metrics we can create for each one a new metric without label.

If we want to expose 4 then I will refactor accordingly with a type label. I am ok with both directions

diegoakechi · 2020-09-01T19:27:38Z

@MalloZup How are these kind of metric being formatted on other cases or maybe on other exporters? I think the usage of label more flexible and elegant, but at the same time it might make harder for the consumer. Imagine that we use label and implement only 2 metrics. When a third label is introduced we might break dashboards that might not have consistent filters, while a new metric would be ignored.

With that my vote is for separated metrics.

stefanotorresi

@MalloZup as I mentioned in the JIRA issue, the way I would go about this is to add a single metric with both a device and a type label, and add all the timeouts.
There is no real downside to expose also the other two types that you have left out.

The final result should be something like this:

# TYPE ha_cluster_sbd_device_timeout gauge
ha_cluster_sbd_device_timeout{type="watchdog",device="/dev/vdc"} 5
ha_cluster_sbd_device_timeout{type="allocate",device="/dev/vdc"} 2
ha_cluster_sbd_device_timeout{type="loop",device="/dev/vdc"} 1
ha_cluster_sbd_device_timeout{type="msgwait",device="/dev/vdc"} 10

stefanotorresi · 2020-09-02T11:10:08Z

@diegoakechi for these kind of use cases, the best practice is exactly the opposite of using separate metrics. See https://prometheus.io/docs/practices/instrumentation/#use-labels

collector/sbd/sbd.go

diegoakechi · 2020-09-02T11:23:50Z

@diegoakechi for these kind of use cases, the best practice is exactly the opposite of using separate metrics. See https://prometheus.io/docs/practices/instrumentation/#use-labels

Sure, in that case lets go this way. Also, make sure to include all the metrics, so we avoid any filtering problem in the future. Thanks @stefanotorresi

MalloZup · 2020-09-02T12:34:23Z

@stefanotorresi I'm ok to use the type even for 2 metric.

ha_cluster_sbd_device_timeout{type="watchdog",device="/dev/vdc" 1
ha_cluster_sbd_device_timeout{type="msgwait",device="/dev/vdc"} 1

as follow-up with @yan-gao discussion we don't need to expose the other 2 metrics since they are more confusing then helping

thx for reviews. I will rework this during the day.

stefanotorresi · 2020-09-02T14:18:03Z

as follow-up with @yan-gao discussion we don't need to expose the other 2 metrics since they are more confusing then helping

okay, sounds good to me!

stefanotorresi

Looks good! I suggest just a few minor tweaks to the labels and names.

collector/sbd/sbd.go

doc/metrics.md

test/sbd.metrics

MalloZup

thx should be ok now

MalloZup added 5 commits September 1, 2020 18:32

First skeletron of metric sbd timeout

35e2ea1

split design in 2 different metrics, add test file

fe8ba26

Implement test and metric watchdog_timeout

bb59640

implement msg_wait metric and tests

c758011

lint

6172624

MalloZup requested a review from stefanotorresi September 1, 2020 18:59

stefanotorresi requested changes Sep 2, 2020

View reviewed changes

stefanotorresi reviewed Sep 2, 2020

View reviewed changes

collector/sbd/sbd.go Outdated Show resolved Hide resolved

Refactor metric to and use label type

b320f5b

MalloZup changed the title ~~[WIP] Implement SBD watchdog and msgwait metrics~~ Implement SBD watchdog and msgwait metrics Sep 2, 2020

MalloZup requested a review from stefanotorresi September 2, 2020 20:53

stefanotorresi requested changes Sep 3, 2020

View reviewed changes

Add documentation

428cb9f

MalloZup force-pushed the sbd-timeouts branch from 5c0ec06 to 428cb9f Compare September 3, 2020 11:00

MalloZup requested a review from stefanotorresi September 3, 2020 11:00

MalloZup commented Sep 3, 2020

View reviewed changes

stefanotorresi approved these changes Sep 3, 2020

View reviewed changes

stefanotorresi merged commit 8912739 into ClusterLabs:master Sep 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement SBD watchdog and msgwait metrics #174

Implement SBD watchdog and msgwait metrics #174

MalloZup commented Sep 1, 2020 •

edited

Loading

diegoakechi commented Sep 1, 2020

stefanotorresi left a comment •

edited

Loading

stefanotorresi commented Sep 2, 2020

diegoakechi commented Sep 2, 2020

MalloZup commented Sep 2, 2020 •

edited

Loading

stefanotorresi commented Sep 2, 2020

stefanotorresi left a comment

MalloZup left a comment

Implement SBD watchdog and msgwait metrics #174

Implement SBD watchdog and msgwait metrics #174

Conversation

MalloZup commented Sep 1, 2020 • edited Loading

todo:

Note:

diegoakechi commented Sep 1, 2020

stefanotorresi left a comment • edited Loading

Choose a reason for hiding this comment

stefanotorresi commented Sep 2, 2020

diegoakechi commented Sep 2, 2020

MalloZup commented Sep 2, 2020 • edited Loading

stefanotorresi commented Sep 2, 2020

stefanotorresi left a comment

Choose a reason for hiding this comment

MalloZup left a comment

Choose a reason for hiding this comment

MalloZup commented Sep 1, 2020 •

edited

Loading

stefanotorresi left a comment •

edited

Loading

MalloZup commented Sep 2, 2020 •

edited

Loading