Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tikv should provide an option to return less metrics #12355

Open
2 tasks
glorv opened this issue Apr 13, 2022 · 8 comments
Open
2 tasks

tikv should provide an option to return less metrics #12355

glorv opened this issue Apr 13, 2022 · 8 comments
Labels
affects-6.1 This bug affects the 6.1.x(LTS) versions. type/feature-request Categorizes issue or PR as related to a new feature.

Comments

@glorv
Copy link
Contributor

glorv commented Apr 13, 2022

Feature Request

Is your feature request related to a problem? Please describe:

Currently tikv produces too many metrics which can consume a lot of network bandwidth and storage space.

Describe the feature you'd like:

  • Allow to return http response body with gzip compressed.
  • Provide a option that return simplify metrics.

Describe alternatives you've considered:

Teachability, Documentation, Adoption, Migration Strategy:

@glorv glorv added the type/feature-request Categorizes issue or PR as related to a new feature. label Apr 13, 2022
@glorv glorv changed the title tikv should support gzip compression for metrics response tikv should provide an option to return less metrics Apr 22, 2022
ti-chi-bot added a commit that referenced this issue May 25, 2022
ref #12355

Signed-off-by: glorv <glorvs@163.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
ti-chi-bot added a commit that referenced this issue May 25, 2022
ref #12355, ref #12417

Signed-off-by: glorv <glorvs@163.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
YuJuncen pushed a commit to YuJuncen/tikv that referenced this issue May 26, 2022
ref tikv#12355

Signed-off-by: glorv <glorvs@163.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
Signed-off-by: Yu Juncen <yujuncen@pingcap.com>
YuJuncen pushed a commit to YuJuncen/tikv that referenced this issue May 26, 2022
ref tikv#12355, ref tikv#12417

Signed-off-by: glorv <glorvs@163.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
Signed-off-by: Yu Juncen <yujuncen@pingcap.com>
@breezewish
Copy link
Member

I have a few questions related with reducing number of series:

  1. In what scenario do you meet the demand that the storage size of TiKV need to be reduced? Does the user has too many (e.g. >100) nodes?

  2. What do we need to trade off for the storage space? By reducing the metric series? Or reducing the histogram size? Or so on?

@glorv
Copy link
Contributor Author

glorv commented Jun 18, 2022

@breezewish

1. In what scenario do you meet the demand that the storage size of TiKV need to be reduced? Does the user has too many (e.g. >100) nodes?

This is one step to reducing the cost of TiDB Cloud. The goal is to reduce the total size to 1/10 if possible.

2. What do we need to trade off for the storage space? By reducing the metric series? Or reducing the histogram size? Or so on?

We want to reduce the overall size of the metrics stored in prometheus.

@breezewish
Copy link
Member

breezewish commented Jun 19, 2022

@glorv Thanks for the supplied information. If these metrics can be ignored, how about simply removing them? I'm worried about the usefulness of providing complex level switch in TiKV, as other people may be simply hard to configure it.

IMO, for the purpose of reducing metric cost, there are also other options:

  1. Drop Prometheus, and use Victoriametrics

    No code change is necessary, Victoriametrics is simply N times faster, with lower cost.

  2. Reduce the cardinality of the histogram series

    For example, tikv_grpc_msg_duration_seconds_bucket produce 21 series (for histogram buckets) and there are 44 grpc kinds, so there are 44*21 = 924 series for one metric tikv_grpc_msg_duration_seconds. If we reduce histogram buckets to 10, then we reduce the cost by 50%.

    I see that you have already adopted such optimizations in server: further simplify metrics #12732, which is great!

    BTW Victoriametrics claims to resolve this issue without the need to touch any code base and lose precision: https://valyala.medium.com/improving-histogram-usability-for-prometheus-and-grafana-bc7e5df0e350 I have not experimented with it however.

  3. Maybe collect only whitelisted series?

    Actually this can be a collector options without touching any TiKV code base, i.e. configured in prometheus.yaml.

I'm not sure whether these could be better alternatives to the current implementation in #12732, if these alternatives was not considered previously. Maybe worth to research with later. Anyway, it's always good to step out as the first iteration in #12732. Great job!

@glorv
Copy link
Contributor Author

glorv commented Jun 20, 2022

@zhangjinpeng1987 @BusyJay @kevin-xianliu, maybe we can consider Victoriametrics. It provides memory, storage space and performance improvements. If we replace prometheus with Victoriametrics, seems we can just keep all metrics with a much more low storage cost(7x less compared to prometheus). What do you think?

@glorv
Copy link
Contributor Author

glorv commented Jun 20, 2022

Reduce the cardinality of the histogram series.

I also considers this as histogram is the biggest reason for so many metrics. In my test, we may reduce histogrm size in 3 ways:

  1. Reduce the cardinality of the histogram series. Most of the are define with 20 cardinalities, maybe 15 or less is enough.
  2. Filter the non-significant part of histogram vector, as implemented in server: further simplify metrics #12732.
  3. Compact the bucket data. As implemented in server: add an option to support returning simplified metrics #12417. As the sample data for most buckets are 0, by remove the data for these kind of buckets, the result can be reduced to quite small despite the cardinality. I think Victoriametrics should also implement some kind of compaction like this.

Maybe collect only whitelisted series?

I have consulted with some colleagues, they thought all of the metrics were useful in some specific scenarios, so we still need to provide the ability to return all metrics dynamically.

@breezewish
Copy link
Member

breezewish commented Jun 20, 2022

@glorv

I have consulted with some colleagues, they thought all of the metrics were useful in some specific scenarios, so we still need to provide the ability to return all metrics dynamically.

Just came up with a simpler way to "filter core metrics" without touching the TiKV code base: we can write allowlist rules in Prometheus.yaml in TiDB Cloud and drop others. When we want to collect more metrics, we can update the Prometheus config file and reload it.

Advantages:

  • Metrics be dynamically added back or removed on demand.
  • Compatible with all TiKV versions.
  • No need to introduce "metric level" to the TiKV code base, result in less complexity.

What do you think?

@glorv
Copy link
Contributor Author

glorv commented Jun 20, 2022

@glorv

I have consulted with some colleagues, they thought all of the metrics were useful in some specific scenarios, so we still need to provide the ability to return all metrics dynamically.

Just came up with a simpler way to "filter core metrics" without touching the TiKV code base: we can write allowlist rules in Prometheus.yaml in TiDB Cloud and drop others. When we want to collect more metrics, we can update the Prometheus config file and reload it.

Advantages:

  • Metrics be dynamically added back or removed on demand.
  • Compatible with all TiKV versions.
  • No need to introduce "metric level" to the TiKV code base, result in less complexity.

What do you think?

I also would like to change the tikv source as less as possible.
I think put the relabel_config rule at prometheus may have following draw backs:

  • We may want to add a certain rule to all components(or at least components with big metrics size), then maintain these config from time to time may be not easy.
  • We may want to set different rule for different clusters.(e.g. keep more metrics for cluster with high load and more incidents)
  • When meets oncall issues, we may want to change the config temporarily to help investigate the root cause. There seems no convenient to change the rule online.

Beside these drawbacks, I'm also in favor of this approach as it is more flexible. @BusyJay PTAL.

@BusyJay
Copy link
Member

BusyJay commented Jun 20, 2022

👍 Both Victoriametrics and configuring the prometheus.yaml LGTM. And Victoriametrics seems can be configured as the storage of prometheus, so it may help reduce the complexity a lot.

On the other hand, allowlist can be also used as a way to find those metrics that we think is important but not in practice. We can remove them in the end.

glorv added a commit to glorv/tikv that referenced this issue Jun 22, 2022
ref tikv#12355

Signed-off-by: glorv <glorvs@163.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
glorv added a commit to glorv/tikv that referenced this issue Jun 22, 2022
ref tikv#12355, ref tikv#12417

Signed-off-by: glorv <glorvs@163.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
ti-chi-bot added a commit that referenced this issue Jun 23, 2022
ref #12355, ref #12356, ref #12417, ref #12671

Signed-off-by: glorv <glorvs@163.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
@VelocityLight VelocityLight added the affects-6.1 This bug affects the 6.1.x(LTS) versions. label Aug 17, 2022
ti-chi-bot added a commit that referenced this issue Aug 22, 2022
ref #12355, ref #12356

Signed-off-by: ti-srebot <ti-srebot@pingcap.com>

Co-authored-by: glorv <glorvs@163.com>
Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
ti-chi-bot added a commit that referenced this issue Aug 22, 2022
ref #12355, ref #12417, ref #12602

Signed-off-by: ti-srebot <ti-srebot@pingcap.com>
Signed-off-by: glorv <glorvs@163.com>

Co-authored-by: glorv <glorvs@163.com>
Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-6.1 This bug affects the 6.1.x(LTS) versions. type/feature-request Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

4 participants