[DO-NOT-MERGE] Separate NetworkAttachments lifecycle latency Prometheus metrics #111

matte21 · 2020-05-05T22:44:47Z

This PR breaks down the Prometheus metrics that track the latencies from creation of NetworkAttachments (NA) to relevant lifecycle events (e.g. assignment of a virtual IP address, creation of a network interface on the node the NA is bound to).

Latencies are broken down on whether the NA was created before the IPAM controller was running, on whether the NA was given an address before its local Connection Agent (LCA) was running, on whether the NA became implementable (virtual IP and MAC addresses written in status) on the remote nodes before the remote Connection Agents (RCA) on such nodes were running, and on whether the NA became implementable on the remote nodes before it was relevant (i.e. first NA with same VNI is created on remote nodes) on such nodes.

Besides tracking metrics for the happy cases (i.e. all controllers always running), this PR currently comprises metrics that measure the latencies from NA creation to relevant lifecycle events even in the case of failures/downtimes of controllers (i.e. there's a metric that tracks the latency from NA creation to virtual IP assignment for the case were the IPAM was not running when the NA was created).
That was indeed my initial idea and is still in this PR for discussion ([WIP] will be removed after that discussion), but I now think that tracking latencies for those unhappy cases is not worth it, because those are the latencies that are observed when rare(ish), undesired events happen. We're not really interested in knowing their values. We only care about the values for the happy cases. What's more, measuring the latencies for the unhappy cases is difficult/costly. I now favor a PR which is less than this one: we should keep all the "infrastructure" (e.g. NA's status fields) that allow discerning the happy cases from the unhappy ones, and use this ability to record latencies for the happy cases and ignore latencies for unhappy ones.
Follows an example illustrating why tracking latencies for unhappy cases is difficult/costly.
Consider the histogram (part of a HistogramVec) that tracks the latency from NA creation to virtual IP assignment for the case were the NA is created before the IPAM controller starts. When the IPAM starts, it will process in burst all pre-existing NAs (among which there are all those without a virtual IP). This means that ALL the observations on the aforementioned histogram will take place in a very short amount of time. This is shortly after the IPAM starts and likely BEFORE the first Prometheus scrape. This means Prometheus' first scrape will already see the histogram with all of its observations, hence PromQL rate() will completely miss the increment, and so will Grafana plots. Currently this PR uses a workaround: it makes the metrics for the unhappy cases "visible" to Prometheus with a call to .With(<histogram labels>), and then sleeps for a time interval shortly longer than Prometheus scrape period immediately after the metrics start being served, to make sure Prometheus' first scrape sees the histogram with no observations. Besides slowing controllers boot time and not being resilient to changes to the Prometheus scrape interval, this solution could challenge Prometheus scalability, because all the metrics for the unhappy cases are defined (hence scraped and stored) eagerly and each CA has a fair amount of them, while the vast majority of such metrics will never see an observation and need not being defined (and scraped and stored).

This PR does not further break down NA lifecycle latency metrics on whether the NA had to wait for a virtual IP address because its subnet became valid after the NA was created. This PR does not break down metrics for the latency from subnet creation to subnet validation on whether the subnet was created before the validator was running. These two additions are left for a follow-up PR.

Note: This PR does not update Grafana dashboards in this repo, some of such dashboards are now broken. There's a parallel PR in kos-test that updates the Dashboards there. Soon there will be an effort to unify the dashboards here and in kos-test under a single-repo, that should fix the broken dashboards.

Edit: Notice that the PR comprises small bug fixes and tweaks that are unrelated to the main goal.

Also, small code improvements here and there.

staging/kos/pkg/controllers/connectionagent/connection_agent.go

staging/kos/pkg/controllers/connectionagent/network_interface.go

MikeSpreitzer · 2020-05-08T05:25:07Z

staging/kos/pkg/controllers/connectionagent/network_interface.go

+		if naImplWriteTime.Before(vnRelevanceTime) {
+			pl[remoteCAStartedLastLabel] = strconv.FormatBool(ca.startTime.After(vnRelevanceTime))
+			pl[naReadyBeforeVNRelevanceLabel] = "true"
+			attCreateToRemIfcDt = tAfter.Sub(vnRelevanceTime).Seconds() + naImplWriteTime.Sub(naSpecWriteTime).Seconds()


This is excluding some latency that actually happened. I think I have a preference for not excluding any latency but rather adding an explanation of it.

I was (and still am) undecided on what's best.

I feel uncomfortable with excluding latency too.

OTOH, there are good reasons to do that:

At a high-level that metric tracks the latency it takes KOS to fulfill a user desire expressed via a NetAtt API object (say X) and satisfied with a remote network interface for X on a remote node (say N). But during the time interval that goes from the moment X's status is written with its local implementation information (e.g. MAC address) to the moment X becomes relevant on N the user does not have any desire to have a remote network interface for X on N, so it is logical not to count that time interval.

The portion of latency which is currently excluded depends in no way on KOS and the health of the underlying infrastructure (except for API servers and etcd), it only depends on the timing with which NetAtts are created and on the nodes onto which they are created.

Because of 2. if we didn't exclude the excluded latency the observations for that metric would have an incredibly high variance under normal circumstances. The likelihood of receiving observations that are less than a second would be the same as the likelihood of seeing latencies that are tens of seconds. We would need many, linearly-distributed buckets, end even then it would be hard to get meaningful information from the plots that we could derive.

Let's think about the two common cases, using the example and names in https://github.com/MikeSpreitzer/kube-examples/blob/add-kos/staging/kos/doc/KOS-causaility.pdf . In the common cases all the process starts precede the creations of the objects involved and no object deletes are inolved. That leaves two cases for a remote NA implementation, suggested by the horizontal positioning in that diagram: (1) the VN is relevant on the node before the remote NA is created, and (2) the VN becomes relevant to the node after the remote NA is created. In case 1, it is meaningful to report the latency from NA creation to remote implementation. In case 2, no number is meaningful for that latency; that is about a story that simply did not happen. In case 2 the CA should report latency from when the VN became relevant to when the NA's remote implementation was done.

MikeSpreitzer · 2020-05-08T07:03:49Z

I agree, this multiplies the histograms by too much --- a factor of 16 in some cases. I am also not a fan of the way the information is put into metrics.

I have another idea for how to focus. The idea is, roughly, for each action by the implementation look at the last client write that preceded it. Look at https://github.com/MikeSpreitzer/kube-examples/blob/add-kos/staging/kos/doc/KOS-causaility.pdf ; it shows an example of two NetworkAttachments on the same virtual network, each placed on a different node. For each action by the implementation we can identify the latest client precursor. For a Subnet validation, it is the Subnet creation. For an IP address assignment, it is either the Subnet creation or the NetworkAttachment creation. The same for a local NA implementation. For a remote NA implementation, it is one of four things: the creation time of the Subnet of the triggering local NA, the creation time of the triggering local NA, the creation time of the Subnet of the remote NA, or the creation time of the remote NA; here the triggering NA is the one that first made the VN relevant (and, as observed in this PR, there is a difficulty knowing that). The time from last client precusor to implementation action reflects the performance delivered to clients, and is what we want to evaluate. The metric for this is a one-dimensional HistogramVec, whose length is the number of possible last client precursors (i.e., 1 for Subnet validation, 2 for IP address assignment and local NA impl, 4 for remote NA impl).

There are also implementation precursors: start of the Subnet validator, start of the IPAM controller, and start of a Connection Agent. When the last implementation precursor is later than the last client precursor, there has been a delay due to a process not running for a while. This is part of the implementation's responsibility; the client does not know or really care to hear excuses about it. So we do not try to discount latencies by these delays. But, for our own understanding, we add instrumentation for it. For each implementation action there is a last client precursor and a last implementation precusor. We introduce a HistogramVec for each sort of implementation action that gets an observation of ((last impl precursor time) - (last client precusor time)) whenever that is positive, with labels identifying the sort of impl precursor and client precusor involved. That means the length of the HistogramVec is the product of the sizes of the two label value sets: 1x1 for Subnet validation, 2x2 for IP address assignment, 3x2 for NA local impl, 6x4 for NA remote impl (hmm, that's a lot; maybe remove the client precusor factor in this case). Just the sum and count of these histogram are probably most valuable, they reveal the average of this. But the buckets also have useful information.

Every implementation action other than remote NA implementation involves a write. The write can write the last client precusor and last implementation precusor into the object being written. We can use a generic representation for precusor: a string and a timestamp. We can not do this for the remote NA implementations.

We could make this open and extensible: let the Spec of Subnet and of NetworkAttachment have optional fields where the client supplies last client precusor and last implementation precursor from whatever higher level stuff is using these objects. But with a mixture of clients, this makes for a mixture in the metrics. I think I favor not opening this, keep the focus on the KOS implementation.

In the CA, for a given relevant virtual network (VN), the first local NA seen is not necessarily the one with the earliest last client precusor. In fact, with our weak assumptions about timing and eventual consistency, the CA is not even guaranteed to ever be notified about that NA. This suggests that the best way to solve this problem is to ship observations off to some biggish data collection and analysis system and let that thing accurately determine the triggering NAs and the consequent latency metrics. I think that is a bigger thing than we want to attempt here. For KOS, let's just do the best we can with what the CA sees.

MikeSpreitzer · 2020-05-08T16:26:34Z

Oh, I had overlooked the fact that deletions can be relevant precursors. I have updated the picture. There are interesting knowledge problems in both cases (delete Subnet, delete NetworkAttachment).

The knowledge problem is most stark in the case of NetworkAttachment deletion. The IPAM controller does not even notice any relationship between a NetworkAttachment deletion and a consequent assignment of an address. The Subnet validator comes closer --- it will requeue references to Subnets in consequence of being notified of a deletion. But there can be multiple relevant deletions. And when a queue worker processes a reference to a Subnet, it has no access to any of the deletions --- if any --- that caused the Subnet reference to be in the queue.

MikeSpreitzer · 2020-05-08T17:46:38Z

I suggest not trying to discern the last precusor delete because that is so problematic and it is hopefully relatively rare. Instead let's provide information about it. The HistogramVec for subnet validations and IP address assignments can be doubled by a label carrying a bit say whether there was a delay due to conflicting Subnet or address exhaustion. We can also add metrics revealing the rate at which the queue workers punt due to such conflicts.

matte21 · 2020-05-11T20:57:44Z

Overall, I really like the proposed approach.

We introduce a HistogramVec for each sort of implementation action that gets an observation of ((last impl precursor time) - (last client precusor time)) whenever that is positive, with labels identifying the sort of impl precursor and client precusor involved. That means the length of the HistogramVec is the product of the sizes of the two label value sets: 1x1 for Subnet validation, 2x2 for IP address assignment, 3x2 for NA local impl, 6x4 for NA remote impl (hmm, that's a lot; maybe remove the client precusor factor in this case). Just the sum and count of these histogram are probably most valuable, they reveal the average of this. But the buckets also have useful information.

I think we can remove the client precursor factor for every HistogramVec.
Notice, however, that there's still the problem that in some cases these histograms receive all their observations in a short burst right after a controller's start, and that this can cause PromQL's rate() to completely miss the change.

This suggests that the best way to solve this problem is to ship observations off to some biggish data collection and analysis system and let that thing accurately determine the triggering NAs and the consequent latency metrics. I think that is a bigger thing than we want to attempt here. For KOS, let's just do the best we can with what the CA sees.

I agree. You touched on a topic that keeps appearing occasionally, that is what monitoring/tracing model/system would be best for a K8s-like control planes. I remember you advocating against distributed tracing based on spans because spans do not compose. Another angle: all the distributed traces I've seen focused on HTTP calls and local method invocations. That is not where the actual, semantically relevant interaction happens in K8s (and akin systems), in K8s interaction happens via API objects (yes, at a lower abstraction layer HTTP is still used to communicate with the API server but that's only an implementation detail). I think one interesting approach to explore is to model/describe interactions like you did in https://github.com/MikeSpreitzer/kube-examples/blob/add-kos/staging/kos/doc/KOS-causaility.pdf . I also wonder whether that would scale (but I also think that in that regard that's no worse than current distributed tracing solutions, and maybe only a fraction of relevant events could be tracked like that).

matte21 · 2020-05-11T20:59:17Z

Closing this since the proposed changes are radical => they will be implemented in a new PR.

A bug fix and an indentation optimization noticed in PR #111

MikeSpreitzer · 2020-05-12T02:58:37Z

Regarding the length of the HistogramVec of ((last impl precursor time) - (last client precusor time)) for remote NA implementations: the natural length is (3 local + 3 remote) x (2 local + 2 remote), which is 24. The problem with removing the label for the last client precusor is that this makes it hard to know which latencies in the HistogramVec of (impl - (last client precusor)) were affected and by how much. Here is another idea how to reduce that length: make it (3 local + 1 remote) * (2 local + 1 remote), which is 12; that is, use just one label for all remote impl precursors and just one label for all remote client precursors. Note that the natural (3 remote) x (2 remote) are already being represented by other observations in this HistogramVec, so summarizing here is sufficient. For the natural (3 remote) x (2 local), only one of those concerns a common case, so collapsing the two local into one will rarely matter --- and when it does, we will be getting information about that from other metrics. For the natural (3 local) x (2 remote) a similar story applies.

matte21 · 2020-05-18T14:24:17Z

I can't understand most of the last message:

Regarding the length of the HistogramVec of ((last impl precursor time) - (last client precusor time)) for remote NA implementations: the natural length is (3 local + 3 remote) x (2 local + 2 remote), which is 24.

Isn't the total (1 local + 1 remote + 2 shared) x (2 local + 2 remote) = 16 ?

MikeSpreitzer · 2020-05-18T22:29:08Z

I was thinking that in general there are multiple starts of the subnet validator. The start relevant to validating subnet S might be different from the start relevant to the validation of subnet T. Same for the IPAM controller.

But it is true that for the HistogramVec of positive ((last impl precursor time) - (last client precusor time)) for remote NA implementations, there can be only one last subnet validator start, and only one last IPAM controller start. So, yes, the natural number of histograms is 16. Maybe we should be satisfied with that.

matte21 · 2020-05-20T15:11:44Z

there can be only one last subnet validator start, and only one last IPAM controller start. So, yes, the natural number of histograms is 16. Maybe we should be satisfied with that.

Yes, the model you outlined focuses on last client write and last implementation precursors, and IMO that greatly simplifies things without losing useful information.
Notice however that it's not an improvement in terms of number of metrics (but we have the capability to scrape only a subset of the CAs) and that the histogram on positive ((last impl precursor time) - (last client precursor time)) is still very likely to get all its observations before the first Prometheus scrape.

MikeSpreitzer · 2020-05-22T01:31:00Z

BTW, when people speak of "number of metrics", I am not sure whether they mean the number of metric names or the number of (metric name, label map) tuples (i.e., the number of time series) (https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels).

Cutting down on the number of label maps that apply to a given metric name (i.e., the length of a HistogramVec or other sort of metric Vec) cuts down on the number of time series. I am suggesting that we start with the HistogramVec of positive ((last impl precursor time) - (last client precusor time)) for remote NA implementation having length (1 local + 1 remote + 2 shared) x (2 local + 2 remote) = 16.

matte21 · 2020-05-25T13:06:52Z

when people speak of "number of metrics", I am not sure whether they mean the number of metric names or the number of (metric name, label map) tuples (i.e., the number of time series)

From https://prometheus.io/docs/practices/naming/ :

CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.

I am suggesting that we start with the HistogramVec of positive ((last impl precursor time) - (last client precusor time)) for remote NA implementation having length (1 local + 1 remote + 2 shared) x (2 local + 2 remote) = 16.

I agree, will do.

Separate NAs lifecycle latency metrics

cafdffa

Also, small code improvements here and there.

matte21 mentioned this pull request May 5, 2020

[WIP] Make Grafana dashboards reflect metrics separation in kos PR 111 MikeSpreitzer/kos-test#11

Closed

MikeSpreitzer reviewed May 8, 2020

View reviewed changes

staging/kos/pkg/controllers/connectionagent/connection_agent.go Show resolved Hide resolved

MikeSpreitzer reviewed May 8, 2020

View reviewed changes

staging/kos/pkg/controllers/connectionagent/network_interface.go Show resolved Hide resolved

MikeSpreitzer reviewed May 8, 2020

View reviewed changes

MikeSpreitzer added a commit that referenced this pull request May 8, 2020

A bug fix and an indentation optimization noticed in PR #111

550a2f1

matte21 closed this May 11, 2020

matte21 changed the title ~~[WIP] Separate NetworkAttachments lifecycle latency Prometheus metrics~~ [DO-NOT-MERGE] Separate NetworkAttachments lifecycle latency Prometheus metrics May 11, 2020

MikeSpreitzer added a commit that referenced this pull request May 11, 2020

Merge pull request #113 from MikeSpreitzer/fixes58

bf32508

A bug fix and an indentation optimization noticed in PR #111

matte21 mentioned this pull request Jun 1, 2020

Split metrics on last client write #116

Merged

matte21 mentioned this pull request Jun 17, 2020

Track delay caused by controllers Downtime #119

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DO-NOT-MERGE] Separate NetworkAttachments lifecycle latency Prometheus metrics #111

[DO-NOT-MERGE] Separate NetworkAttachments lifecycle latency Prometheus metrics #111

matte21 commented May 5, 2020 •

edited

Loading

MikeSpreitzer May 8, 2020

matte21 May 11, 2020

MikeSpreitzer May 12, 2020 •

edited

Loading

MikeSpreitzer commented May 8, 2020 •

edited

Loading

MikeSpreitzer commented May 8, 2020

MikeSpreitzer commented May 8, 2020

matte21 commented May 11, 2020 •

edited

Loading

matte21 commented May 11, 2020 •

edited

Loading

MikeSpreitzer commented May 12, 2020

matte21 commented May 18, 2020 •

edited

Loading

MikeSpreitzer commented May 18, 2020

matte21 commented May 20, 2020

MikeSpreitzer commented May 22, 2020

matte21 commented May 25, 2020

[DO-NOT-MERGE] Separate NetworkAttachments lifecycle latency Prometheus metrics #111

[DO-NOT-MERGE] Separate NetworkAttachments lifecycle latency Prometheus metrics #111

Conversation

matte21 commented May 5, 2020 • edited Loading

MikeSpreitzer May 8, 2020

Choose a reason for hiding this comment

matte21 May 11, 2020

Choose a reason for hiding this comment

MikeSpreitzer May 12, 2020 • edited Loading

Choose a reason for hiding this comment

MikeSpreitzer commented May 8, 2020 • edited Loading

MikeSpreitzer commented May 8, 2020

MikeSpreitzer commented May 8, 2020

matte21 commented May 11, 2020 • edited Loading

matte21 commented May 11, 2020 • edited Loading

MikeSpreitzer commented May 12, 2020

matte21 commented May 18, 2020 • edited Loading

MikeSpreitzer commented May 18, 2020

matte21 commented May 20, 2020

MikeSpreitzer commented May 22, 2020

matte21 commented May 25, 2020

matte21 commented May 5, 2020 •

edited

Loading

MikeSpreitzer May 12, 2020 •

edited

Loading

MikeSpreitzer commented May 8, 2020 •

edited

Loading

matte21 commented May 11, 2020 •

edited

Loading

matte21 commented May 11, 2020 •

edited

Loading

matte21 commented May 18, 2020 •

edited

Loading