Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/prometheus rules missing #1073

Closed
wants to merge 13 commits into from

Conversation

9547
Copy link
Contributor

@9547 9547 commented Jan 16, 2021

What problem does this PR solve?

I'm deploying a cluster with rule_dir set in topology.yaml:

  monitoring_servers:
    - host: 172.19.0.101
      rule_dir: /tmp/local/prometheus

After deploy, found the nodes.rules.yml, tikv.rules.yml and other rules under /home/tidb/deploy/prometheus/conf/ were missing.

Should be merged after #1075

What is changed and how it works?

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Code changes

  • Has exported function/method change
  • Has exported variable/fields change
  • Has interface methods change
  • Has persistent data change

Side effects

  • Possible performance regression
  • Increased code complexity
  • Breaking backward compatibility

Related changes

  • Need to cherry-pick to the release branch
  • Need to update the documentation

Release notes:

NONE

@ti-chi-bot ti-chi-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jan 16, 2021
@ti-chi-bot ti-chi-bot requested a review from lucklove January 16, 2021 02:03
@codecov-io
Copy link

codecov-io commented Jan 16, 2021

Codecov Report

Merging #1073 (d37b230) into master (7f94ef7) will increase coverage by 0.03%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1073      +/-   ##
==========================================
+ Coverage   56.65%   56.69%   +0.03%     
==========================================
  Files         283      283              
  Lines       20045    20044       -1     
==========================================
+ Hits        11357    11363       +6     
+ Misses       6935     6927       -8     
- Partials     1753     1754       +1     
Flag Coverage Δ
cluster 45.18% <83.33%> (+0.02%) ⬆️
dm 24.88% <100.00%> (-0.02%) ⬇️
integrate 51.09% <100.00%> (+0.03%) ⬆️
playground 20.89% <ø> (ø)
tiup 16.33% <0.00%> (-0.01%) ⬇️
unittest 22.98% <83.33%> (+0.18%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/cluster/embed/autogen_pkger.go 100.00% <ø> (ø)
pkg/cluster/spec/grafana.go 68.75% <100.00%> (ø)
pkg/cluster/spec/prometheus.go 81.87% <100.00%> (+2.50%) ⬆️
pkg/cluster/api/dmapi.go 58.26% <0.00%> (-1.74%) ⬇️
pkg/cluster/manager/display.go 83.33% <0.00%> (+1.38%) ⬆️
pkg/cliutil/ssh.go 18.91% <0.00%> (+2.70%) ⬆️
components/dm/command/scale_out.go 88.23% <0.00%> (+11.76%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7f94ef7...d37b230. Read the comment docs.

@ti-chi-bot ti-chi-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 16, 2021
@lucklove
Copy link
Member

I'm a little confused, it seems what you changed has nothing to do with RuleDir, it's DashboardDir which is used to config grafana instead of prometheus.

Could you explain more about how it works?

@9547
Copy link
Contributor Author

9547 commented Jan 18, 2021

I'm a little confused, it seems what you changed has nothing to do with RuleDir, it's DashboardDir which is used to config grafana instead of prometheus.

Could you explain more about how it works?

Sorry for the confusing, It's both fixed for RuleDir and DashboardDir.

First of all, let's talk about the usage of rule_dir for Prometheus and dashboard_dir for Grafana. IMO, those fields are an addition to the package provided rules and dashboards, because the provided ones are sufficient for the normal use case.

But in the current implementation, if those fields were set, then It will only transfer the local files into the remote, without copy the provided ones too.

if spec.RuleDir != "" {
return i.TransferLocalConfigDir(e, spec.RuleDir, path.Join(paths.Deploy, "conf"), func(name string) bool {
return strings.HasSuffix(name, ".rules.yml")
})
}
and this way the testcases failed for #1074

@9547 9547 force-pushed the fix/prometheus-rules-missing branch from 32e801e to a772f6a Compare January 18, 2021 15:04
@lucklove
Copy link
Member

lucklove commented Jan 19, 2021

IMO, those fields are an addition to the package provided rules and dashboards, because the provided ones are sufficient for the normal use case.

But in the current implementation, if those fields were set, then It will only transfer the local files into the remote, without copy the provided ones too.

This is by designed, because the modify also includes remove some dashboards that not applicable for some users' case (for example, one user may not deploy tidb server at all, so he don't want see the tidb dashboard).

If we implement the dir as additional, the user can only add files and once the file added, he can never remove it with TiUP, it's not self-consistent for TiUP, I think.

@9547
Copy link
Contributor Author

9547 commented Jan 19, 2021

If the Grafana or Prometheus was deployed independently(without tidb or dm), then the default dashboards or rules seems redundant. But I 🤔 if someone uses TiUP, he'd prefer to deploy Prometheus with the full tidb or dm stack.

If the user want to delete the dashboards or rules, maybe he can replace it with an empty file(not tested)

@lucklove
Copy link
Member

If the Grafana or Prometheus was deployed independently(without tidb or dm), then the default dashboards or rules seems redundant. But I 🤔 if someone uses TiUP, he'd prefer to deploy Prometheus with the full tidb or dm stack.

If the user want to delete the dashboards or rules, maybe he can replace it with an empty file(not tested)

I have tried replacing node.json with empty file, the grafana can start, but with error logs:
屏幕快照 2021-01-20 下午12 32 09

@9547
Copy link
Contributor Author

9547 commented Jan 20, 2021

A file with empty JSON seems worked

image

@9547
Copy link
Contributor Author

9547 commented Jan 20, 2021

Let me see if there's a better solution or tradeoff.

@9547
Copy link
Contributor Author

9547 commented Jan 20, 2021

@lucklove Let's put aside the Grafana dashboards, and please look at the Prometheus issue.
Currently, the prometheus.yml template will always include those rules if any host is monitored:

{{- if .MonitoredServers}}
- 'node.rules.yml'
- 'blacker.rules.yml'
- 'bypass.rules.yml'
{{- end}}
but if the user specifies the rule_dir but without those rules, then the prometheus.yml is actually problematic and the rules specified by the user are not working.

For this bug, I'll try to introduce a new member CustomRules, and iterator the rule_dir and append CustomRules:

 rule_files:
{{- if .MonitoredServers}}
  - 'node.rules.yml'
  - 'blacker.rules.yml'
  - 'bypass.rules.yml'
{{- end}}
{{- range .CustomRules}}
  - '{{.}}'
{{- end}}

@lucklove
Copy link
Member

@lucklove Let's put aside the Grafana dashboards, and please look at the Prometheus issue.
Currently, the prometheus.yml template will always include those rules if any host is monitored:

{{- if .MonitoredServers}}
- 'node.rules.yml'
- 'blacker.rules.yml'
- 'bypass.rules.yml'
{{- end}}

but if the user specifies the rule_dir but without those rules, then the prometheus.yml is actually problematic and the rules specified by the user are not working.
For this bug, I'll try to introduce a new member CustomRules, and iterator the rule_dir and append CustomRules:

 rule_files:
{{- if .MonitoredServers}}
  - 'node.rules.yml'
  - 'blacker.rules.yml'
  - 'bypass.rules.yml'
{{- end}}
{{- range .CustomRules}}
  - '{{.}}'
{{- end}}

In this way, there may be two node.rules.yaml if the user provide it in his directory:

- node.rules.yml # one for {{- if .MonitoredServers}}
- node.rules.yml # one for {{- range .CustomRules}}

@9547
Copy link
Contributor Author

9547 commented Jan 21, 2021

Those rules will be ignored if duplicated:

  • 'node.rules.yml'
  • 'blacker.rules.yml'
  • 'bypass.rules.yml'

The user given ones has higher priority

@lucklove
Copy link
Member

Those rules will be ignored if duplicated

If duplicated rules will not trigger any error, I think it's OK

@lucklove
Copy link
Member

Sorry that there is something conflict after refactoring the context: #1069

@ti-chi-bot ti-chi-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 22, 2021
@9547 9547 force-pushed the fix/prometheus-rules-missing branch from 77e5dd0 to 3e3f7c3 Compare January 26, 2021 15:21
@ti-chi-bot ti-chi-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 26, 2021
@9547
Copy link
Contributor Author

9547 commented Jan 26, 2021

/retest

@9547
Copy link
Contributor Author

9547 commented Jan 26, 2021

Seems the bot can't respond to the /retest command

@9547
Copy link
Contributor Author

9547 commented Jan 26, 2021

/test all

@9547
Copy link
Contributor Author

9547 commented Jan 27, 2021

@lucklove PTAL, the current implementation will append the Grafana Dashboards or Prometheus ruls if dashbaoard_url or rule_dir specified.

@lucklove
Copy link
Member

@lucklove PTAL, the current implementation will append the Grafana Dashboards or Prometheus ruls if dashbaoard_url or rule_dir specified.

It seems we didn't solve the deletion problem yet: the user can only append rule/dashboard (we can use an empty file, but it will print error logs and it's not graceful)

@lucklove
Copy link
Member

lucklove commented Jan 28, 2021

IMO, it's not a good idea to change current behavior since it's documented and released long ago:

if some user already deployed a cluster with local directory and delete some rule/dashboard from that directory, after he upgrade tiup-cluster, the removed rule/dashboard will be back again. And the removed rule will fire some alert that he didn't care at all. This is unexpected.

@9547
Copy link
Contributor Author

9547 commented Jan 28, 2021

IMO, it's not a good idea to change current behavior since it's documented and released long ago:

if some user already deployed a cluster with local directory and delete some rule/dashboard from that directory, after he upgrade tiup-cluster, the removed rule/dashboard will be back again. And the removed rule will fire some alert that he didn't care at all. This is unexpected.

I will hold on to this PR for the moment and consider it when I have a suitable opportunity next time

@9547
Copy link
Contributor Author

9547 commented Jan 28, 2021

/hold

@ti-chi-bot ti-chi-bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jan 28, 2021
@ti-chi-bot
Copy link
Member

@9547: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@9547
Copy link
Contributor Author

9547 commented Jan 30, 2021

I'm just going to close this first.

@9547 9547 closed this Jan 30, 2021
@9547 9547 deleted the fix/prometheus-rules-missing branch April 6, 2021 23:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants