Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task/use etcd metrics endpoint #11280

Merged
merged 25 commits into from
Mar 26, 2019
Merged

Task/use etcd metrics endpoint #11280

merged 25 commits into from
Mar 26, 2019

Conversation

odacremolbap
Copy link
Contributor

@odacremolbap odacremolbap commented Mar 16, 2019

Add etcd metrics endpoint for Etcd V3 as a new metricset

Fixes #9093

@odacremolbap odacremolbap added in progress Pull request is currently in progress. Metricbeat Metricbeat Team:Integrations Label for the Integrations team labels Mar 16, 2019
@odacremolbap odacremolbap requested review from a team as code owners March 16, 2019 19:17
@odacremolbap
Copy link
Contributor Author

I need to wrap my head around the unit tests and docs.
Opening the PR in order to get some help while I read upon it at other modules

@ruflin
Copy link
Member

ruflin commented Mar 18, 2019

@odacremolbap Happy to help, let me know where :-)

@odacremolbap
Copy link
Contributor Author

Received some out-of-band feedback on metricset naming:

  • Etcd is a single binary that exposes V2 and V3
  • Clients choose which one to use, it is expected that since V3 was released, all clients use V3
  • Each version (V2/V3) keeps most endpoints and storage separate. No data saved as Vx can be read from Vy.
  • Monitoring is also kept separate for each version, V2 will use current beats metricsets (self, leader, store) meanwhile V3 will expose a prometheus formatted metrics endpoint. Metrics read for each version will be specific to its version, although I have some doubts regarding memory and disk.
  • It is not easy to discover which version is being used. All endpoints are up.

From a user perspective:

  • An admin knows what version is being used, it is expected to be V3
  • Although it is possible to use V2 and V3 at the same time, that's an anti-pattern to be avoided

At this moment, we are exposing

  • self
  • store
  • disk
  • metrics

3 first for V2, last one for V3.
We need to come up with a solution for making clear which one to use.

Copy link
Member

@jsoriano jsoriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this! I have added some comments, mainly about fields and default configs.

metricbeat/docs/modules/etcd/metrics.asciidoc Outdated Show resolved Hide resolved
metricbeat/metricbeat.reference.yml Outdated Show resolved Hide resolved
"1.024": 6,
"2.048": 6,
"4.096": 6,
"8.192": 6
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having dots in field names is problematic as they are going to be stored as objects in Elasticsearch, scale them to milliseconds (take a look to this conversation and this PR).

// Disk
"etcd_mvcc_db_total_size_in_bytes": prometheus.Metric("disk.mvcc_db_total_size_in_bytes"),
"etcd_disk_wal_fsync_duration_seconds": prometheus.Metric("disk.wal_fsync_duration_seconds"),
"etcd_disk_backend_commit_duration_seconds": prometheus.Metric("disk.backend_commit_duration_seconds"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To scale it to milliseconds it'd be something like

Suggested change
"etcd_disk_backend_commit_duration_seconds": prometheus.Metric("disk.backend_commit_duration_seconds"),
"etcd_disk_backend_commit_duration_seconds": prometheus.Metric("disk.backend_commit_duration_seconds", prometheus.OpMultiplyBuckets(1000))),

description: >
Write ahead logs latency sum

- name: disk.backend_commit_duration_seconds.bucket
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a wildcard is needed here (and the same for other histograms)

Suggested change
- name: disk.backend_commit_duration_seconds.bucket
- name: disk.backend_commit_duration_seconds.bucket.*

if err := mbtest.WriteEventsReporterV2(f, t, ""); err != nil {
t.Fatal("write", err)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WriteEventsReporterV2 already checks for errors and non empty events, so this method should be just:

func TestData(t *testing.T) {
	compose.EnsureUp(t, "etcd")

	f := mbtest.NewReportingMetricSetV2(t, getConfig())
	if err := mbtest.WriteEventsReporterV2(f, t, ""); err != nil {
		t.Fatal("write", err)
	}
}

metricbeat/module/etcd/metrics/metrics_integration_test.go Outdated Show resolved Hide resolved
grpc_server_handled_total{grpc_code="Internal",grpc_method="UserGrantRole",grpc_service="etcdserverpb.Auth",grpc_type="unary"} 0
grpc_server_handled_total{grpc_code="Internal",grpc_method="UserList",grpc_service="etcdserverpb.Auth",grpc_type="unary"} 0
grpc_server_handled_total{grpc_code="Internal",grpc_method="UserRevokeRole",grpc_service="etcdserverpb.Auth",grpc_type="unary"} 0
grpc_server_handled_total{grp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be probably nice to expose some of these metrics on methods, but lets leave it for future changes.

@jsoriano
Copy link
Member

Regarding v2 vs v3 storages. I think we should continue enabling the old metricsets by default, and at some moment we would enable also this new one so we have all covered by default.

I saw that some operations on v2 storage also affect the metrics exposed by the new endpoint, maybe at some moment the new endpoint covers both storages, not sure about their plan on this. If that is the case, maybe we can deprecate the old metricsets at some point, but not before metricbeat 8.

@ruflin
Copy link
Member

ruflin commented Mar 20, 2019

I think there are 2 users: Admin that deploys Metricbeat and he should know if we uses v2 or v3. From a data consumer perspective it should not matter which one is used. My understanding so far is that v3 is mostly a superset of v2. So if a user upgrades from v2 to v3, the resulting events should still look the same but have more data inside.

The above assume that we like the current data structure. If we don't I'm also ok with introducing a new better data structure for v3.

@odacremolbap
Copy link
Contributor Author

Regarding module/metricsets:

current schema is

  • module etcd
  • metricsets store, self, leader, metrics

The main thing to improve is noticing users which etcd version matches which metrics. Current V2 metricsets (store, self, leader) are GA and being used, imho we should keep it as is for now. metrics metricset is V3. The name follows the endpoint where metrics are retrieved, which is not a descriptive name from the user POV.

Choices:

  1. keep metricset as they are
  2. change metricsto something like metricsV3
  3. refactor V2 metricsets so they become storeV2, selfV2, leaderV2 while keeping current names also for backwards compatibility
  4. create etcdV3 as a new module separate from current `etcd``

My preferred would be 2
Comments @ruflin @jsoriano ?

@ruflin
Copy link
Member

ruflin commented Mar 20, 2019

As mentioned before, I prefer not to mention V3 in the final doc as for the consumer it should not matter, so just metrics should be fine. What I don't like about the metrics prefix is that it's a bit meaningless as everything here is metrics.

One other idea triggered by your comment that it's a huge migration from A to B and it's unlikely that both will be used at the same time. What happens if not use the metrics prefix and put all the metrics directly under etcd? Will it conflict with v2? It's not a typical thing we do for metricsets but if I understand this change here to prometheus correctly this will be the only endpoint available and it's unlikely more endpoints will be added in v3? It's just that more metrics will be added? Based on your example event we would have etcd.disk.* etc. an none of these seems to conflict with the current one?

@odacremolbap
Copy link
Contributor Author

generally speaking adding version to metric names is not a good idea, I agree with you, no point on creating metricsets for fooV1 and a new set for fooV2. The problem with etcd is that they sort of bundle 2 products in 1 binary.

As an example, here is how you use the client for V2 for any etcd release:

etcdctl get myKey

and here is how you use V3

ETCDCTL_API=3 etcdctl get myKey

Admins, devs, users, anyone dealing with etcd must know beforehand if they are targeting V2 or V3. Internally the command will be redirected to a different set of functions based on the environment variable.

So, if i understood it correctly, that proposal would be, keeping current V2

  • etcd.store
  • etcd.self
  • etcd.leader
    And add new V3 metrics without a V3 reference at that same level
  • etcd.memory
  • etcd.network
  • etcd.server
  • etcd.disk

I'm ok with that since it avoids using extremely generic term metrics, just wondering if we can come up with something better to highlight that whenever you use V3 you should be using those new ones. Just like etcdctl requiring ETCDCTL_API=3 environment variable.

We might not add any V3 reference to metrics, but as an etcd user in my past life, I would be confused to see etcd.server and etcd.self without knowing which one is pre-V3 and which one post-V3

If you feel that our best move is setting those names, and clarifying at docs, I'm ok with that, I guess there is no perfect solution since this problem is upstream etcd.

@odacremolbap
Copy link
Contributor Author

As discussed out of band with @ruflin

We will add a field to metricsets that indicate the API version used when retrieving the metrics.
All V2 metrics will have apiVersion: 2
All V3 metrics will have apiVersion: 3

Although users will still need to get to the docs to check which metricset applies to what apiVersion, this solution has a number of advantages:

  • if some V2 metrics are still use for V3 users, they will be around
  • still V2 and V3 will be distinguishable and available for filtering at ES
  • current V2 users won't be affected

@jsoriano feedback?

@jsoriano
Copy link
Member

Ok to add an apiVersion field, but taking into account that "v3" metrics also contain metrics for v2 store and endpoint.

@odacremolbap
Copy link
Contributor Author

@jsoriano @ruflin
I've pushed

  • adding apiVersion to both V2 and V3 etcd metrics
  • changing the namespace for V3 metrics so that they are consistent with V2 metrics placement

prometheus.MetricsMapping needed to be added the Namespace field in order for this to work.

I haven't included the updated JSON until we sort out why agent group field is missing

Copy link
Member

@jsoriano jsoriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good

"etcd_network_client_grpc_sent_bytes_total": prometheus.Metric("network.client_grpc_sent.bytes"),
"etcd_network_client_grpc_received_bytes_total": prometheus.Metric("network.client_grpc_received.bytes"),
},
ExtraFields: map[string]string{"apiVersion": "3"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't use camel case for field names.

Suggested change
ExtraFields: map[string]string{"apiVersion": "3"},
ExtraFields: map[string]string{"api_version": "3"},

r.Event(mb.Event{
MetricSetFields: event,
Namespace: mapping.Namespace,
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I think this should be moved to its own PR, with a note in the developers changelog.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on having a separate PR. I think @sayden will also be happy to see this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks,
created #11423
pushed #11424

Copy link
Member

@jsoriano jsoriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It LGTM, just one more thing I have just thought about, could you add this metricset to the list of ones tested here?

@odacremolbap odacremolbap merged commit bf8ebaf into elastic:master Mar 26, 2019
@odacremolbap odacremolbap deleted the task/use-etcd-metrics-endpoint branch March 26, 2019 12:07
}

func TestData(t *testing.T) {
compose.EnsureUp(t, "etcd")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we already have the new test setup for this, we don't need this method I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in progress Pull request is currently in progress. Metricbeat Metricbeat Team:Integrations Label for the Integrations team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants