tikv should provide an option to return less metrics #12355

glorv · 2022-04-13T12:40:34Z

Feature Request

Is your feature request related to a problem? Please describe:

Currently tikv produces too many metrics which can consume a lot of network bandwidth and storage space.

Describe the feature you'd like:

Allow to return http response body with gzip compressed.
Provide a option that return simplify metrics.

Describe alternatives you've considered:

Teachability, Documentation, Adoption, Migration Strategy:

ref #12355 Signed-off-by: glorv <glorvs@163.com> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

ref #12355, ref #12417 Signed-off-by: glorv <glorvs@163.com> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

ref tikv#12355 Signed-off-by: glorv <glorvs@163.com> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io> Signed-off-by: Yu Juncen <yujuncen@pingcap.com>

ref tikv#12355, ref tikv#12417 Signed-off-by: glorv <glorvs@163.com> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io> Signed-off-by: Yu Juncen <yujuncen@pingcap.com>

breezewish · 2022-06-17T14:26:17Z

I have a few questions related with reducing number of series:

In what scenario do you meet the demand that the storage size of TiKV need to be reduced? Does the user has too many (e.g. >100) nodes?
What do we need to trade off for the storage space? By reducing the metric series? Or reducing the histogram size? Or so on?

glorv · 2022-06-18T13:16:54Z

@breezewish

1. In what scenario do you meet the demand that the storage size of TiKV need to be reduced? Does the user has too many (e.g. >100) nodes?

This is one step to reducing the cost of TiDB Cloud. The goal is to reduce the total size to 1/10 if possible.

2. What do we need to trade off for the storage space? By reducing the metric series? Or reducing the histogram size? Or so on?

We want to reduce the overall size of the metrics stored in prometheus.

breezewish · 2022-06-19T02:11:54Z

@glorv Thanks for the supplied information. If these metrics can be ignored, how about simply removing them? I'm worried about the usefulness of providing complex level switch in TiKV, as other people may be simply hard to configure it.

IMO, for the purpose of reducing metric cost, there are also other options:

Drop Prometheus, and use Victoriametrics

No code change is necessary, Victoriametrics is simply N times faster, with lower cost.
Reduce the cardinality of the histogram series

For example, tikv_grpc_msg_duration_seconds_bucket produce 21 series (for histogram buckets) and there are 44 grpc kinds, so there are 44*21 = 924 series for one metric tikv_grpc_msg_duration_seconds. If we reduce histogram buckets to 10, then we reduce the cost by 50%.

I see that you have already adopted such optimizations in server: further simplify metrics #12732, which is great!

BTW Victoriametrics claims to resolve this issue without the need to touch any code base and lose precision: https://valyala.medium.com/improving-histogram-usability-for-prometheus-and-grafana-bc7e5df0e350 I have not experimented with it however.
Maybe collect only whitelisted series?

Actually this can be a collector options without touching any TiKV code base, i.e. configured in prometheus.yaml.

I'm not sure whether these could be better alternatives to the current implementation in #12732, if these alternatives was not considered previously. Maybe worth to research with later. Anyway, it's always good to step out as the first iteration in #12732. Great job!

glorv · 2022-06-20T03:44:45Z

@zhangjinpeng1987 @BusyJay @kevin-xianliu, maybe we can consider Victoriametrics. It provides memory, storage space and performance improvements. If we replace prometheus with Victoriametrics, seems we can just keep all metrics with a much more low storage cost(7x less compared to prometheus). What do you think?

glorv · 2022-06-20T04:01:50Z

Reduce the cardinality of the histogram series.

I also considers this as histogram is the biggest reason for so many metrics. In my test, we may reduce histogrm size in 3 ways:

Reduce the cardinality of the histogram series. Most of the are define with 20 cardinalities, maybe 15 or less is enough.
Filter the non-significant part of histogram vector, as implemented in server: further simplify metrics #12732.
Compact the bucket data. As implemented in server: add an option to support returning simplified metrics #12417. As the sample data for most buckets are 0, by remove the data for these kind of buckets, the result can be reduced to quite small despite the cardinality. I think Victoriametrics should also implement some kind of compaction like this.

Maybe collect only whitelisted series?

I have consulted with some colleagues, they thought all of the metrics were useful in some specific scenarios, so we still need to provide the ability to return all metrics dynamically.

breezewish · 2022-06-20T04:12:19Z

@glorv

I have consulted with some colleagues, they thought all of the metrics were useful in some specific scenarios, so we still need to provide the ability to return all metrics dynamically.

Just came up with a simpler way to "filter core metrics" without touching the TiKV code base: we can write allowlist rules in Prometheus.yaml in TiDB Cloud and drop others. When we want to collect more metrics, we can update the Prometheus config file and reload it.

Advantages:

Metrics be dynamically added back or removed on demand.
Compatible with all TiKV versions.
No need to introduce "metric level" to the TiKV code base, result in less complexity.

What do you think?

glorv · 2022-06-20T06:50:05Z

@glorv

I have consulted with some colleagues, they thought all of the metrics were useful in some specific scenarios, so we still need to provide the ability to return all metrics dynamically.

Just came up with a simpler way to "filter core metrics" without touching the TiKV code base: we can write allowlist rules in Prometheus.yaml in TiDB Cloud and drop others. When we want to collect more metrics, we can update the Prometheus config file and reload it.

Advantages:

Metrics be dynamically added back or removed on demand.

Compatible with all TiKV versions.

No need to introduce "metric level" to the TiKV code base, result in less complexity.

What do you think?

I also would like to change the tikv source as less as possible.
I think put the relabel_config rule at prometheus may have following draw backs:

We may want to add a certain rule to all components(or at least components with big metrics size), then maintain these config from time to time may be not easy.
We may want to set different rule for different clusters.(e.g. keep more metrics for cluster with high load and more incidents)
When meets oncall issues, we may want to change the config temporarily to help investigate the root cause. There seems no convenient to change the rule online.

Beside these drawbacks, I'm also in favor of this approach as it is more flexible. @BusyJay PTAL.

BusyJay · 2022-06-20T17:28:43Z

👍 Both Victoriametrics and configuring the prometheus.yaml LGTM. And Victoriametrics seems can be configured as the storage of prometheus, so it may help reduce the complexity a lot.

On the other hand, allowlist can be also used as a way to find those metrics that we think is important but not in practice. We can remove them in the end.

ref tikv#12355 Signed-off-by: glorv <glorvs@163.com> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

ref tikv#12355, ref tikv#12417 Signed-off-by: glorv <glorvs@163.com> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

ref #12355, ref #12356, ref #12417, ref #12671 Signed-off-by: glorv <glorvs@163.com> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

ref #12355, ref #12356 Signed-off-by: ti-srebot <ti-srebot@pingcap.com> Co-authored-by: glorv <glorvs@163.com> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

ref #12355, ref #12417, ref #12602 Signed-off-by: ti-srebot <ti-srebot@pingcap.com> Signed-off-by: glorv <glorvs@163.com> Co-authored-by: glorv <glorvs@163.com> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

glorv added the type/feature-request Categorizes issue or PR as related to a new feature. label Apr 13, 2022

glorv mentioned this issue Apr 13, 2022

server: support gzip compression for metrics #12356

Merged

glorv changed the title ~~tikv should support gzip compression for metrics response~~ tikv should provide an option to return less metrics Apr 22, 2022

glorv mentioned this issue Apr 22, 2022

server: add an option to support returning simplified metrics #12417

Closed

glorv mentioned this issue May 23, 2022

server: support return simplified metrics response #12602

Merged

ti-chi-bot added a commit that referenced this issue May 25, 2022

server: support gzip compression for metrics (#12356)

47af4d6

ref #12355 Signed-off-by: glorv <glorvs@163.com> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

ti-srebot mentioned this issue May 25, 2022

server: support gzip compression for metrics (#12356) #12647

Merged

ti-chi-bot added a commit that referenced this issue May 25, 2022

server: support return simplified metrics response (#12602)

c3c9707

ref #12355, ref #12417 Signed-off-by: glorv <glorvs@163.com> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

ti-srebot mentioned this issue May 25, 2022

server: support return simplified metrics response (#12602) #12649

Merged

glorv mentioned this issue Jun 1, 2022

server: further simplify metrics #12732

Closed

glorv added a commit to glorv/tikv that referenced this issue Jun 22, 2022

server: support gzip compression for metrics (tikv#12356)

e9cab11

ref tikv#12355 Signed-off-by: glorv <glorvs@163.com> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

glorv added a commit to glorv/tikv that referenced this issue Jun 22, 2022

server: support return simplified metrics response (tikv#12602)

daa8203

ref tikv#12355, ref tikv#12417 Signed-off-by: glorv <glorvs@163.com> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

glorv mentioned this issue Jun 22, 2022

server: support return simplified metrics and gzip compression #12880

Merged

ti-chi-bot added a commit that referenced this issue Jun 23, 2022

server: support return simplified metrics and gzip compression (#12880)

0dc4192

ref #12355, ref #12356, ref #12417, ref #12671 Signed-off-by: glorv <glorvs@163.com> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

glorv mentioned this issue Jun 29, 2022

server: support lossy metrics compaction #12928

Closed

VelocityLight added the affects-6.1 This bug affects the 6.1.x(LTS) versions. label Aug 17, 2022

Oreoxmt mentioned this issue Aug 31, 2022

add tikv configuration item simplify-metrics pingcap/docs#10281

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tikv should provide an option to return less metrics #12355

tikv should provide an option to return less metrics #12355

glorv commented Apr 13, 2022 •

edited by zhangjinpeng87

Loading

breezewish commented Jun 17, 2022

glorv commented Jun 18, 2022

breezewish commented Jun 19, 2022 •

edited

Loading

glorv commented Jun 20, 2022 •

edited

Loading

glorv commented Jun 20, 2022

breezewish commented Jun 20, 2022 •

edited

Loading

glorv commented Jun 20, 2022

BusyJay commented Jun 20, 2022

tikv should provide an option to return less metrics #12355

tikv should provide an option to return less metrics #12355

Comments

glorv commented Apr 13, 2022 • edited by zhangjinpeng87 Loading

Feature Request

Is your feature request related to a problem? Please describe:

Describe the feature you'd like:

Describe alternatives you've considered:

Teachability, Documentation, Adoption, Migration Strategy:

breezewish commented Jun 17, 2022

glorv commented Jun 18, 2022

breezewish commented Jun 19, 2022 • edited Loading

glorv commented Jun 20, 2022 • edited Loading

glorv commented Jun 20, 2022

breezewish commented Jun 20, 2022 • edited Loading

glorv commented Jun 20, 2022

BusyJay commented Jun 20, 2022

glorv commented Apr 13, 2022 •

edited by zhangjinpeng87

Loading

breezewish commented Jun 19, 2022 •

edited

Loading

glorv commented Jun 20, 2022 •

edited

Loading

breezewish commented Jun 20, 2022 •

edited

Loading