-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Linux HA ClusterLabs exporterr to prometheus doc #1540
Conversation
Thanks for your PR. What exactly is the name of the piece of software that this exporter is for? It seems a bit odd to mention "HA" or a company name in here, unless it's actually required to disambiguate the name. This also looks like it's actually several exporters in one, which is not a pattern we encourage. There's possibly some overlap with the node exporter on DRDB at least too. Looking through the code I'm not sure that the split of counters and gauges is correct, and all counters (and only counters) should end in _total. Similarly fail_count sounds like a gauge rather than part of a summary/historgram so shouldn't end in _count. |
@brian-brazil thanks for taking the time to vet this project! I'm co-maintainer of it together with @MalloZup. The word "HA" refers to what historically was the Linux HA project (yep, not a great name), all the components of which are now maintained under the ClusterLabs org. The The DRBD collector actually is mutually exclusive with the one provided by node_exporter because that only supports the DRBD version bundled with the linux kernel (8.x), while ours only support the version provided within SUSE (9.x). In hindsight, I realise that we could have named it something different, but you know how it goes about naming things in software... 😉 That said, we'll look into renaming the counters and gauges according to your indications, thank you very much! We looked at https://prometheus.io/docs/practices/naming/ but it wasn't immediately understandable that |
thx @brian-brazil and @stefanotorresi for answer. JFYI Imho it would disambiguate the name because imho HA is to ambigous. |
The question here on the naming is what would an existing user of this software look for if they were on this page? If a piece of software has gone through several names and is still commonly known by some of them, we may end up including a few of them. At a first glance HA doesn't seem useful here, unless Linux is in front of it.
That sounds like there may be an acceptable exception in this case, which I think would be the first time that would have happened. Would every machine run all of them? Do they expose per-machine data, cluster data, or a mix? I note that DRDB is not in that list however, and those are machine metrics which would almost certainly not fall into that.
In such cases it'd better to look at adding support to the node exporter, rather than creating a brand new exporter every time there's a new protocol version. |
That's a very fair point. I guess that would be "pacemaker" or "clusterlabs", which is why we added these words as the text of the link. Would something like "Linux HA ClusterLabs exporter" be more satisfactory to you?
Yes, a mix. 😉
You're right that DRBD is not part of the ClusterLabs stack, but it's very commonly associated with it, in fact, I will cite the official docs saying "Using DRBD in conjunction with the Pacemaker cluster stack is arguably DRBD’s most frequently found use case". I hope you understand that we're doing everything in our power to find a compromise, but there is more to it than you might think! |
I'm concerned about what we write on this page so users can find your exporter within it. What your git repo is called is not for me to determine. I'm here to provide a curated list of exporters while improving their quality in passing, not to make busywork.
That's a potential issue, as semantically cluster-level views should not be mixed with machine/process-level views of stuff (but each processes providing its view as to what the cluster state currently is is okay) and instead should be separate exporters - or separate endpoints on the same exporter. Put another way, don't mix metrics that represent the current official quorum with metrics that are not from quorum. For example for Kubernetes you have the /metrics for the kubelet process, the cadvisor metrics exposed by the kubelet, the /metrics for the API process, and then kube-state-metrics for the user-facing data kubernetes stores. This is a bit of uncharted territory here, it's not at all required that exporters listed here perfectly meet the guidelines however we do require that they generally make sense in the first place (e.g. something that's simply not going to work out cardinality wise outside of very niche circumstances).
Only systemd really (and there's talk of moving that out), the rest is all Unix kernels.
This case is kinda weird. Tradeoffs are an fundamental part of designing an exporter, however "uberexporters" are a particularly bad smell when it comes to exporters. Overlapping with an existing exporter is also something I try to avoid. What I'd hope is that there'd at least be a good faith attempt to move the DRDB 9 support into the node exporter or otherwise clean this situation up at some undefined point in the future. |
Signed-off-by: dmaiocchi <dmaiocchi@suse.com>
Apologies, I understood that you didn't approve the actual name of the project (and you would be right in doing so 😀) Well, then, we have settled with "Linux HA ClusterLabs exporter" as the link text. That should work.
That's exactly what's happening: the Pacemaker state is always cluster-wide.
Absolutely. We certainly don't want the project to grow out of proportion. |
That's not quite right then, as all the non-Pacemaker state is not cluster wise. What happens if one of the machines lags behind and doesn't have the latest state? Are you always talking to a leader or some form, or getting each machine's current view? |
Well, the cluster state is cluster wise, all the rest is only relevant to just the nodes.
There is no such concept of "leader" in this stack. |
by the way, we also have given thought about the fact that we export the same metrics from multiple nodes, but the RabbitMQ folks have faced the same issue and we are evaluating the same approach. |
Not quite. Are you saying it's impossible for a node to have stale data, even briefly or via configuration/operator error? That seems a bit implausible to me. We'd usually have each process expose what it currently knows, as if it differs across the processes that's interesting. Leaderness (if any) would be a boolean gauge, that you could join with in PromQL.
That approach would not be good practice. Firstly the right way to do that would be to have a singleton target to scrape for this information, as |
Hi @brian-brazil and @stefanotorresi, I will try to help here...
Not really. So, even that the cluster-wise, the cluster state should be the same on all the nodes and following the principles @stefanotorresi has mentioned, pacemaker does that via replicating the metadata between all the cluster nodes, and we can have problems on this based on many reasons like network problems, misconfiguration, and etc.
This is exactly what the exporter is currently doing. It exposes what the node currently knows about the cluster status + local metrics for the other components that compose the cluster (Corosync as communication layer, SBD as fencing mechanism, etc). If the cluster metrics differ between nodes, are exactly the situations that indicate some problem has occurred. In that sense, even if it is called cluster metrics, on my POV it is a local node metric, since it is the current cluster status present on the local node metadata replica. For example, there is a so-called Cluster partition situation, where some nodes lost membership to the clusters and form a new cluster. The way to identify that is by comparing the cluster membership of each node and check how the cluster wise information differs between them. So, in that sense, I don't know how we could tackle such situations without comparing the cluster status reported by each node. Also just for completeness, it is not true to say that there is no leader on Pacemaker cluster. What we don't have is a Controller/Master node. Instead, Pacemaker has a concept of Hope it makes a little bit clearer... |
That all sounds fine then. These distinctions can be a bit subtle. |
@brian-brazil I am glad I was able to clarify that. Thanks for your support on this. |
Thx for merging and discussion |
Signed-off-by: dmaiocchi <dmaiocchi@suse.com>
Add Linux HA ClusterLabs exporterr to prometheus doc