Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: promote debugging metrics #9764

Closed
wants to merge 10 commits into from
60 changes: 57 additions & 3 deletions CHANGELOG-3.4.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,33 @@ See [code changes](https://github.com/coreos/etcd/compare/v3.3.0...v3.4.0) and [
- `"github.com/coreos/etcd/rafthttp"` to `"github.com/coreos/etcd/etcdserver/api/rafthttp"`.
- `"github.com/coreos/etcd/snap"` to `"github.com/coreos/etcd/etcdserver/api/snap"`.
- `"github.com/coreos/etcd/store"` to `"github.com/coreos/etcd/etcdserver/api/v2store"`.
- Promote all [`etcd_debugging_*` Prometheus metrics to `etcd_*`](https://github.com/coreos/etcd/pull/9764).
- `etcd_debugging_*` has been marked as experimental.
- `etcd_debugging_mvcc_range_total` to `etcd_mvcc_range_total`.
- `etcd_debugging_mvcc_put_total` to `etcd_mvcc_put_total`.
- `etcd_debugging_mvcc_delete_total` to `etcd_mvcc_delete_total`.
- `etcd_debugging_mvcc_txn_total` to `etcd_mvcc_txn_total`.
- `etcd_debugging_mvcc_keys_total` to `etcd_mvcc_keys_total`.
- `etcd_debugging_mvcc_watch_stream_total` to `etcd_mvcc_watch_stream_total`.
- `etcd_debugging_mvcc_watcher_total` to `etcd_mvcc_watcher_total`.
- `etcd_debugging_mvcc_slow_watcher_total` to `etcd_mvcc_slow_watcher_total`.
- `etcd_debugging_mvcc_events_total` to `etcd_mvcc_events_total`.
- `etcd_debugging_mvcc_pending_events_total` to `etcd_mvcc_pending_events_total`.
- `etcd_debugging_mvcc_index_compaction_pause_duration_milliseconds` to `etcd_mvcc_index_compaction_pause_duration_milliseconds`.
- `etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds` to `etcd_mvcc_db_compaction_pause_duration_milliseconds`.
- `etcd_debugging_mvcc_db_compaction_total_duration_milliseconds` to `etcd_mvcc_db_compaction_total_duration_milliseconds`.
- `etcd_debugging_mvcc_db_compaction_keys_total` to `etcd_mvcc_db_compaction_keys_total`.
- `etcd_debugging_mvcc_db_total_size_in_bytes` to `etcd_mvcc_db_total_size_in_bytes`.
- `etcd_debugging_mvcc_db_total_size_in_use_in_bytes` to `etcd_mvcc_db_total_size_in_use_in_bytes`.
- `etcd_debugging_snap_save_marshalling_duration_seconds` to `etcd_snap_save_marshalling_duration_seconds`.
- `etcd_debugging_snap_save_total_duration_seconds` to `etcd_snap_save_total_duration_seconds`.
- `etcd_debugging_server_lease_expired_total` to `etcd_debugging_server_lease_expired_total`.
- v2 store `etcd_debugging_store_reads_total` to `etcd_store_reads_total`.
- v2 store `etcd_debugging_store_writes_total` to `etcd_store_writes_total`.
- v2 store `etcd_debugging_store_reads_failed_total` to `etcd_store_reads_failed_total`.
- v2 store `etcd_debugging_store_expires_total` to `etcd_store_expires_total`.
- v2 store `etcd_debugging_store_watch_requests_total` to `etcd_store_watch_requests_total`.
- v2 store `etcd_debugging_store_watchers` to `etcd_store_watchers`.

### Dependency

Expand All @@ -118,10 +145,37 @@ See [code changes](https://github.com/coreos/etcd/compare/v3.3.0...v3.4.0) and [

### Metrics, Monitoring

- Promote all [`etcd_debugging_*` Prometheus metrics to `etcd_*`](https://github.com/coreos/etcd/pull/9764).
- `etcd_debugging_*` has been marked as experimental.
- `etcd_debugging_mvcc_range_total` to `etcd_mvcc_range_total`.
- `etcd_debugging_mvcc_put_total` to `etcd_mvcc_put_total`.
- `etcd_debugging_mvcc_delete_total` to `etcd_mvcc_delete_total`.
- `etcd_debugging_mvcc_txn_total` to `etcd_mvcc_txn_total`.
- `etcd_debugging_mvcc_keys_total` to `etcd_mvcc_keys_total`.
- `etcd_debugging_mvcc_watch_stream_total` to `etcd_mvcc_watch_stream_total`.
- `etcd_debugging_mvcc_watcher_total` to `etcd_mvcc_watcher_total`.
- `etcd_debugging_mvcc_slow_watcher_total` to `etcd_mvcc_slow_watcher_total`.
- `etcd_debugging_mvcc_events_total` to `etcd_mvcc_events_total`.
- `etcd_debugging_mvcc_pending_events_total` to `etcd_mvcc_pending_events_total`.
- `etcd_debugging_mvcc_index_compaction_pause_duration_milliseconds` to `etcd_mvcc_index_compaction_pause_duration_milliseconds`.
- `etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds` to `etcd_mvcc_db_compaction_pause_duration_milliseconds`.
- `etcd_debugging_mvcc_db_compaction_total_duration_milliseconds` to `etcd_mvcc_db_compaction_total_duration_milliseconds`.
- `etcd_debugging_mvcc_db_compaction_keys_total` to `etcd_mvcc_db_compaction_keys_total`.
- `etcd_debugging_mvcc_db_total_size_in_bytes` to `etcd_mvcc_db_total_size_in_bytes`.
- `etcd_debugging_mvcc_db_total_size_in_use_in_bytes` to `etcd_mvcc_db_total_size_in_use_in_bytes`.
- `etcd_debugging_snap_save_marshalling_duration_seconds` to `etcd_snap_save_marshalling_duration_seconds`.
- `etcd_debugging_snap_save_total_duration_seconds` to `etcd_snap_save_total_duration_seconds`.
- `etcd_debugging_server_lease_expired_total` to `etcd_debugging_server_lease_expired_total`.
- v2 store `etcd_debugging_store_reads_total` to `etcd_store_reads_total`.
- v2 store `etcd_debugging_store_writes_total` to `etcd_store_writes_total`.
- v2 store `etcd_debugging_store_reads_failed_total` to `etcd_store_reads_failed_total`.
- v2 store `etcd_debugging_store_expires_total` to `etcd_store_expires_total`.
- v2 store `etcd_debugging_store_watch_requests_total` to `etcd_store_watch_requests_total`.
- v2 store `etcd_debugging_store_watchers` to `etcd_store_watchers`.
- Increase [`etcd_network_peer_round_trip_time_seconds`](https://github.com/coreos/etcd/pull/9762) Prometheus metric histogram upper-bound.
- Previously, highest bucket only collects requests taking 0.8192 seconds or more.
- Now, highest buckets collect 0.8192 seconds, 1.6384 seconds, and 3.2768 seconds or more.
- Increase [`etcd_debugging_mvcc_index_compaction_pause_duration_milliseconds`](https://github.com/coreos/etcd/pull/9762) Prometheus metric histogram upper-bound.
- Increase [`etcd_mvcc_index_compaction_pause_duration_milliseconds`](https://github.com/coreos/etcd/pull/9762) Prometheus metric histogram upper-bound.
- Previously, highest bucket only collects requests taking 1.024 seconds or more.
- Now, highest buckets collect 1.024 seconds, 2.048 seconds, and 4.096 seconds or more.
- Add [`etcd_server_is_leader`](https://github.com/coreos/etcd/pull/9587) Prometheus metric.
Expand All @@ -131,7 +185,7 @@ See [code changes](https://github.com/coreos/etcd/compare/v3.3.0...v3.4.0) and [
- Add [`etcd_disk_backend_defrag_duration_seconds`](https://github.com/coreos/etcd/pull/9761) Prometheus metric.
- Add [`etcd_mvcc_hash_duration_seconds`](https://github.com/coreos/etcd/pull/9761) Prometheus metric.
- Add [`etcd_mvcc_hash_rev_duration_seconds`](https://github.com/coreos/etcd/pull/9761) Prometheus metric.
- Add [`etcd_debugging_mvcc_db_total_size_in_use_in_bytes`](https://github.com/coreos/etcd/pull/9256) Prometheus metric.
- Add [`etcd_mvcc_db_total_size_in_use_in_bytes`](https://github.com/coreos/etcd/pull/9256) Prometheus metric.
- Add [`etcd_network_active_peers`](https://github.com/coreos/etcd/pull/9762) Prometheus metric.
- Let's say `"7339c4e5e833c029"` server `/metrics` returns `etcd_network_active_peers{Local="7339c4e5e833c029",Remote="729934363faa4a24"} 1` and `etcd_network_active_peers{Local="7339c4e5e833c029",Remote="b548c2511513015"} 1`. This indicates that the local node `"7339c4e5e833c029"` currently has two active remote peers `"729934363faa4a24"` and `"b548c2511513015"` in a 3-node cluster. If the node `"b548c2511513015"` is down, the local node `"7339c4e5e833c029"` will show `etcd_network_active_peers{Local="7339c4e5e833c029",Remote="729934363faa4a24"} 1` and `etcd_network_active_peers{Local="7339c4e5e833c029",Remote="b548c2511513015"} 0`.
- Add [`etcd_network_disconnected_peers_total`](https://github.com/coreos/etcd/pull/9762) Prometheus metric.
Expand All @@ -140,7 +194,7 @@ See [code changes](https://github.com/coreos/etcd/compare/v3.3.0...v3.4.0) and [
- e.g. `etcd_network_server_stream_failures_total{API="lease-keepalive",Type="receive"} 1`
- e.g. `etcd_network_server_stream_failures_total{API="watch",Type="receive"} 1`
- Add missing [`etcd_network_peer_sent_failures_total` count](https://github.com/coreos/etcd/pull/9437).
- Fix [`etcd_debugging_server_lease_expired_total`](https://github.com/coreos/etcd/pull/9557) Prometheus metric.
- Fix [`etcd_server_lease_expired_total`](https://github.com/coreos/etcd/pull/9557) Prometheus metric.
- Fix [race conditions in v2 server stat collecting](https://github.com/coreos/etcd/pull/9562).

### Security, Authentication
Expand Down
2 changes: 1 addition & 1 deletion Documentation/op-guide/grafana.json
Original file line number Diff line number Diff line change
Expand Up @@ -291,7 +291,7 @@
"stack": false,
"steppedLine": false,
"targets": [{
"expr": "etcd_debugging_mvcc_db_total_size_in_bytes{job=\"etcd\"}",
"expr": "etcd_mvcc_db_total_size_in_bytes{job=\"etcd\"}",
"hide": false,
"interval": "",
"intervalFactor": 2,
Expand Down
2 changes: 2 additions & 0 deletions Documentation/op-guide/maintenance.md
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,8 @@ OK

The metric `etcd_debugging_mvcc_db_total_size_in_use_in_bytes` indicates the actual database usage after a history compaction, while `etcd_debugging_mvcc_db_total_size_in_bytes` shows the database size including free space waiting for defragmentation. The latter increases only when the former is close to it, meaning when both of these metrics are close to the quota, a history compaction is required to avoid triggering the space quota.

`etcd_debugging_mvcc_db_total_size_in_use_in_bytes` is renamed to `etcd_mvcc_db_total_size_in_use_in_bytes` from v3.4.

## Snapshot backup

Snapshotting the `etcd` cluster on a regular basis serves as a durable backup for an etcd keyspace. By taking periodic snapshots of an etcd member's backend database, an `etcd` cluster can be recovered to a point in time with a known good state.
Expand Down
87 changes: 87 additions & 0 deletions Documentation/upgrades/upgrade_3_4.md
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,94 @@ curl -L http://localhost:2379/v3/kv/put \

Requests to `/v3beta` endpoints will redirect to `/v3`, and `/v3beta` will be removed in 3.5 release.

#### Promote all `etcd_debugging_` Prometheus metrics

v3.4 promotes all [`etcd_debugging_*` Prometheus metrics to `etcd_*`](https://github.com/coreos/etcd/pull/9764).

`etcd_debugging_*` has been marked as experimental.

```diff
-etcd_debugging_mvcc_range_total
+etcd_mvcc_range_total

-etcd_debugging_mvcc_put_total
+etcd_mvcc_put_total

-etcd_debugging_mvcc_delete_total
+etcd_mvcc_delete_total

-etcd_debugging_mvcc_txn_total
+etcd_mvcc_txn_total

-etcd_debugging_mvcc_keys_total
+etcd_mvcc_keys_total

-etcd_debugging_mvcc_watch_stream_total
+etcd_mvcc_watch_stream_total

-etcd_debugging_mvcc_watcher_total
+etcd_mvcc_watcher_total

-etcd_debugging_mvcc_slow_watcher_total
+etcd_mvcc_slow_watcher_total

-etcd_debugging_mvcc_events_total
+etcd_mvcc_events_total

-etcd_debugging_mvcc_pending_events_total
+etcd_mvcc_pending_events_total

-etcd_debugging_mvcc_index_compaction_pause_duration_milliseconds
+etcd_mvcc_index_compaction_pause_duration_milliseconds

-etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds
+etcd_mvcc_db_compaction_pause_duration_milliseconds

-etcd_debugging_mvcc_db_compaction_total_duration_milliseconds
+etcd_mvcc_db_compaction_total_duration_milliseconds

-etcd_debugging_mvcc_db_compaction_keys_total
+etcd_mvcc_db_compaction_keys_total

-etcd_debugging_mvcc_db_total_size_in_bytes
+etcd_mvcc_db_total_size_in_bytes

-etcd_debugging_mvcc_db_total_size_in_use_in_bytes
+etcd_mvcc_db_total_size_in_use_in_bytes

-etcd_debugging_snap_save_marshalling_duration_seconds
+etcd_snap_save_marshalling_duration_seconds

-etcd_debugging_snap_save_total_duration_seconds
+etcd_snap_save_total_duration_seconds

-etcd_debugging_server_lease_expired_total
+etcd_debugging_server_lease_expired_total

# v2 store
-etcd_debugging_store_reads_total
+etcd_store_reads_total

# v2 store
-etcd_debugging_store_writes_total
+etcd_store_writes_total

# v2 store
-etcd_debugging_store_reads_failed_total
+etcd_store_reads_failed_total

# v2 store
-etcd_debugging_store_expires_total
+etcd_store_expires_total

# v2 store
-etcd_debugging_store_watch_requests_total
+etcd_store_watch_requests_total

# v2 store
-etcd_debugging_store_watchers
+etcd_store_watchers
```

### Server upgrade checklists

Expand Down
4 changes: 2 additions & 2 deletions clientv3/integration/watch_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -887,15 +887,15 @@ func TestWatchCancelOnServer(t *testing.T) {
}

// get max watches; proxy tests have leadership watches, so total may be >numWatches
maxWatches, _ := cluster.Members[0].Metric("etcd_debugging_mvcc_watcher_total")
maxWatches, _ := cluster.Members[0].Metric("etcd_mvcc_watcher_total")

// cancel all and wait for cancels to propagate to etcd server
for i := 0; i < numWatches; i++ {
cancels[i]()
}
time.Sleep(time.Second)

minWatches, err := cluster.Members[0].Metric("etcd_debugging_mvcc_watcher_total")
minWatches, err := cluster.Members[0].Metric("etcd_mvcc_watcher_total")
if err != nil {
t.Fatal(err)
}
Expand Down
4 changes: 2 additions & 2 deletions etcdserver/api/snap/metrics.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ import "github.com/prometheus/client_golang/prometheus"

var (
snapMarshallingSec = prometheus.NewHistogram(prometheus.HistogramOpts{
Namespace: "etcd_debugging",
Namespace: "etcd",
Subsystem: "snap",
Name: "save_marshalling_duration_seconds",
Help: "The marshalling cost distributions of save called by snapshot.",
Expand All @@ -29,7 +29,7 @@ var (
})

snapSaveSec = prometheus.NewHistogram(prometheus.HistogramOpts{
Namespace: "etcd_debugging",
Namespace: "etcd",
Subsystem: "snap",
Name: "save_total_duration_seconds",
Help: "The total latency distributions of save called by snapshot.",
Expand Down
18 changes: 8 additions & 10 deletions etcdserver/api/v2store/metrics.go
Original file line number Diff line number Diff line change
Expand Up @@ -24,64 +24,62 @@ import "github.com/prometheus/client_golang/prometheus"
var (
readCounter = prometheus.NewCounterVec(
prometheus.CounterOpts{
Namespace: "etcd_debugging",
Namespace: "etcd",
Subsystem: "store",
Name: "reads_total",
Help: "Total number of reads action by (get/getRecursive), local to this member.",
}, []string{"action"})

writeCounter = prometheus.NewCounterVec(
prometheus.CounterOpts{
Namespace: "etcd_debugging",
Namespace: "etcd",
Subsystem: "store",
Name: "writes_total",
Help: "Total number of writes (e.g. set/compareAndDelete) seen by this member.",
}, []string{"action"})

readFailedCounter = prometheus.NewCounterVec(
prometheus.CounterOpts{
Namespace: "etcd_debugging",
Namespace: "etcd",
Subsystem: "store",
Name: "reads_failed_total",
Help: "Failed read actions by (get/getRecursive), local to this member.",
}, []string{"action"})

writeFailedCounter = prometheus.NewCounterVec(
prometheus.CounterOpts{
Namespace: "etcd_debugging",
Namespace: "etcd",
Subsystem: "store",
Name: "writes_failed_total",
Help: "Failed write actions (e.g. set/compareAndDelete), seen by this member.",
}, []string{"action"})

expireCounter = prometheus.NewCounter(
prometheus.CounterOpts{
Namespace: "etcd_debugging",
Namespace: "etcd",
Subsystem: "store",
Name: "expires_total",
Help: "Total number of expired keys.",
})

watchRequests = prometheus.NewCounter(
prometheus.CounterOpts{
Namespace: "etcd_debugging",
Namespace: "etcd",
Subsystem: "store",
Name: "watch_requests_total",
Help: "Total number of incoming watch requests (new or reestablished).",
})

watcherCount = prometheus.NewGauge(
prometheus.GaugeOpts{
Namespace: "etcd_debugging",
Namespace: "etcd",
Subsystem: "store",
Name: "watchers",
Help: "Count of currently active watchers.",
})
)

const (
GetRecursive = "getRecursive"
)
const GetRecursive = "getRecursive"

func init() {
if prometheus.Register(readCounter) != nil {
Expand Down
2 changes: 1 addition & 1 deletion etcdserver/metrics.go
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ var (
Help: "The total number of failed proposals seen.",
})
leaseExpired = prometheus.NewCounter(prometheus.CounterOpts{
Namespace: "etcd_debugging",
Namespace: "etcd",
Subsystem: "server",
Name: "lease_expired_total",
Help: "The total number of expired leases.",
Expand Down
12 changes: 6 additions & 6 deletions integration/metrics_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ func TestMetricDbSizeBoot(t *testing.T) {
clus := NewClusterV3(t, &ClusterConfig{Size: 1})
defer clus.Terminate(t)

v, err := clus.Members[0].Metric("etcd_debugging_mvcc_db_total_size_in_bytes")
v, err := clus.Members[0].Metric("etcd_mvcc_db_total_size_in_bytes")
if err != nil {
t.Fatal(err)
}
Expand Down Expand Up @@ -63,7 +63,7 @@ func TestMetricDbSizeDefrag(t *testing.T) {
time.Sleep(500 * time.Millisecond)

expected := numPuts * len(putreq.Value)
beforeDefrag, err := clus.Members[0].Metric("etcd_debugging_mvcc_db_total_size_in_bytes")
beforeDefrag, err := clus.Members[0].Metric("etcd_mvcc_db_total_size_in_bytes")
if err != nil {
t.Fatal(err)
}
Expand All @@ -74,7 +74,7 @@ func TestMetricDbSizeDefrag(t *testing.T) {
if bv < expected {
t.Fatalf("expected db size greater than %d, got %d", expected, bv)
}
beforeDefragInUse, err := clus.Members[0].Metric("etcd_debugging_mvcc_db_total_size_in_use_in_bytes")
beforeDefragInUse, err := clus.Members[0].Metric("etcd_mvcc_db_total_size_in_use_in_bytes")
if err != nil {
t.Fatal(err)
}
Expand All @@ -98,7 +98,7 @@ func TestMetricDbSizeDefrag(t *testing.T) {
}
time.Sleep(500 * time.Millisecond)

afterCompactionInUse, err := clus.Members[0].Metric("etcd_debugging_mvcc_db_total_size_in_use_in_bytes")
afterCompactionInUse, err := clus.Members[0].Metric("etcd_mvcc_db_total_size_in_use_in_bytes")
if err != nil {
t.Fatal(err)
}
Expand All @@ -113,7 +113,7 @@ func TestMetricDbSizeDefrag(t *testing.T) {
// defrag should give freed space back to fs
mc.Defragment(context.TODO(), &pb.DefragmentRequest{})

afterDefrag, err := clus.Members[0].Metric("etcd_debugging_mvcc_db_total_size_in_bytes")
afterDefrag, err := clus.Members[0].Metric("etcd_mvcc_db_total_size_in_bytes")
if err != nil {
t.Fatal(err)
}
Expand All @@ -125,7 +125,7 @@ func TestMetricDbSizeDefrag(t *testing.T) {
t.Fatalf("expected less than %d, got %d after defrag", bv, av)
}

afterDefragInUse, err := clus.Members[0].Metric("etcd_debugging_mvcc_db_total_size_in_use_in_bytes")
afterDefragInUse, err := clus.Members[0].Metric("etcd_mvcc_db_total_size_in_use_in_bytes")
if err != nil {
t.Fatal(err)
}
Expand Down
Loading