Skip to content

Commit

Permalink
Allow custom rule for promql/range_query
Browse files Browse the repository at this point in the history
Fixes #1064
  • Loading branch information
prymitive committed Aug 20, 2024
1 parent afc1fbf commit d642795
Show file tree
Hide file tree
Showing 13 changed files with 363 additions and 18 deletions.
1 change: 0 additions & 1 deletion cmd/pint/tests/0003_lint_workdir.txt
Original file line number Diff line number Diff line change
Expand Up @@ -155,4 +155,3 @@ rule {
strip = [ "instance" ]
}
}

28 changes: 28 additions & 0 deletions cmd/pint/tests/0181_range_query_max.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
env NO_COLOR=1
pint.ok --no-color lint --min-severity=info rules
! stdout .
cmp stderr stderr.txt

-- stderr.txt --
level=INFO msg="Loading configuration file" path=.pint.hcl
level=INFO msg="Finding all rules to check" paths=["rules"]
rules/0001.yaml:2 Warning: `errors[1h1s]` selector is trying to query Prometheus for 1h1s worth of metrics, but 1h is the maximum allowed range query. (promql/range_query)
2 | expr: sum(rate(errors[1h1s])) > 0.5

level=INFO msg="Problems found" Warning=1
-- rules/0001.yaml --
- alert: Error Rate
expr: sum(rate(errors[1h1s])) > 0.5

- alert: Error Rate
expr: sum(rate(errors[1h])) > 0.5

-- .pint.hcl --
parser {
relaxed = [".*"]
}
rule {
range_query {
max = "1h"
}
}
30 changes: 30 additions & 0 deletions cmd/pint/tests/0182_range_query_custom_severity.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
env NO_COLOR=1
pint.error --no-color lint --min-severity=info rules
! stdout .
cmp stderr stderr.txt

-- stderr.txt --
level=INFO msg="Loading configuration file" path=.pint.hcl
level=INFO msg="Finding all rules to check" paths=["rules"]
rules/0001.yaml:2 Bug: `errors[1h1s]` selector is trying to query Prometheus for 1h1s worth of metrics, but 1h is the maximum allowed range query. (promql/range_query)
2 | expr: sum(rate(errors[1h1s])) > 0.5

level=INFO msg="Problems found" Bug=1
level=ERROR msg="Fatal error" err="found 1 problem(s) with severity Bug or higher"
-- rules/0001.yaml --
- alert: Error Rate
expr: sum(rate(errors[1h1s])) > 0.5

- alert: Error Rate
expr: sum(rate(errors[1h])) > 0.5

-- .pint.hcl --
parser {
relaxed = [".*"]
}
rule {
range_query {
max = "1h"
severity = "bug"
}
}
2 changes: 2 additions & 0 deletions docs/changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@

- [promql/regexp](checks/promql/regexp.md) check will now look for smelly regexp selectors.
See check docs for details.
- [promql/range_query](checks/promql/range_query.md) now allows to configure a custom maximum
duration for range queries - #1064.

## v0.64.1

Expand Down
56 changes: 50 additions & 6 deletions docs/checks/promql/range_query.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@ By default Prometheus keeps [15 days of data](https://prometheus.io/docs/prometh
this can be customised by setting time or disk space limits.
There are two main ways of configuring retention limits in Prometheus:

* time based - Prometheus will keep last N days of metrics
* disk based - Prometheus will try to use up to N bytes of disk space.
- time based - Prometheus will keep last N days of metrics
- disk based - Prometheus will try to use up to N bytes of disk space.

Pint will ignore any disk space limits, since that doesn't tell us
what the effective time retention is.
Expand All @@ -34,13 +34,30 @@ getting results of a `avg_over_time(foo[40d])` you are getting the average
value of `foo` in the last 40 days, but in reality you're only getting
an average value in the last 30 days, and you cannot get any more than that.

You can also configure your own maximum allowed range duration if you want
to ensure that all queries are never requesting more than allowed range.
This can be done by adding a configuration rule as below.

## Configuration

This check doesn't have any configuration options.
Syntax:

```js
range_query {
max = "2h"
comment = "..."
severity = "bug|warning|info"
}
```

- `max` - duration for the maximum allowed query range.
- `comment` - set a custom comment that will be added to reported problems.
- `severity` - set custom severity for reported issues, defaults to `warning`.

## How to enable it

This check is enabled by default for all configured Prometheus servers.
This check is enabled by default for all configured Prometheus servers and will
validate that queries don't use ranges longer than configured Prometheus retention.

Example:

Expand All @@ -64,6 +81,19 @@ prometheus "dev" {
}
```

Additionally you can configure an extra rule that will enforce a custom maximum
query range duration:

```js
rule {
range_query {
max = "4h"
comment = "You cannot use range queries with range more than 4h"
severity = "bug"
}
}
```

## How to disable it

You can disable this check globally by adding this config block:
Expand Down Expand Up @@ -102,6 +132,20 @@ Example:
# pint disable promql/range_query(prod)
```

To disable a custom maximum range duration rule use:

```yaml
# pint disable promql/range_query($duration)
```

Where `$duration` is the value of `max` option in `range_query` rule.

Example:

```yaml
# pint disable promql/range_query(4h)
```

## How to snooze it

You can disable this check until given time by adding a comment to it. Example:
Expand All @@ -111,6 +155,6 @@ You can disable this check until given time by adding a comment to it. Example:
```

Where `$TIMESTAMP` is either use [RFC3339](https://www.rfc-editor.org/rfc/rfc3339)
formatted or `YYYY-MM-DD`.
Adding this comment will disable `promql/range_query` *until* `$TIMESTAMP`, after that
formatted or `YYYY-MM-DD`.
Adding this comment will disable `promql/range_query` _until_ `$TIMESTAMP`, after that
check will be re-enabled.
40 changes: 31 additions & 9 deletions internal/checks/promql_range_query.go
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ import (
promParser "github.com/prometheus/prometheus/promql/parser"

"github.com/cloudflare/pint/internal/discovery"
"github.com/cloudflare/pint/internal/output"
"github.com/cloudflare/pint/internal/parser"
"github.com/cloudflare/pint/internal/promapi"
)
Expand All @@ -17,12 +18,15 @@ const (
RangeQueryCheckName = "promql/range_query"
)

func NewRangeQueryCheck(prom *promapi.FailoverGroup) RangeQueryCheck {
return RangeQueryCheck{prom: prom}
func NewRangeQueryCheck(prom *promapi.FailoverGroup, limit time.Duration, comment string, severity Severity) RangeQueryCheck {
return RangeQueryCheck{prom: prom, limit: limit, comment: comment, severity: severity}
}

type RangeQueryCheck struct {
prom *promapi.FailoverGroup
prom *promapi.FailoverGroup
comment string
limit time.Duration
severity Severity
}

func (c RangeQueryCheck) Meta() CheckMeta {
Expand All @@ -38,6 +42,9 @@ func (c RangeQueryCheck) Meta() CheckMeta {
}

func (c RangeQueryCheck) String() string {
if c.limit > 0 {
return fmt.Sprintf("%s(%s)", RangeQueryCheckName, output.HumanizeDuration(c.limit))
}
return fmt.Sprintf("%s(%s)", RangeQueryCheckName, c.prom.Name())
}

Expand All @@ -47,11 +54,26 @@ func (c RangeQueryCheck) Reporter() string {

func (c RangeQueryCheck) Check(ctx context.Context, _ discovery.Path, rule parser.Rule, _ []discovery.Entry) (problems []Problem) {
expr := rule.Expr()

if expr.SyntaxError != nil {
return problems
}

if c.limit > 0 {
for _, problem := range c.checkNode(ctx, expr.Query, c.limit, fmt.Sprintf("%s is the maximum allowed range query.", model.Duration(c.limit))) {
problems = append(problems, Problem{
Lines: expr.Value.Lines,
Reporter: c.Reporter(),
Text: problem.text,
Details: maybeComment(c.comment),
Severity: c.severity,
})
}
}

if c.prom == nil || len(problems) > 0 {
return problems
}

flags, err := c.prom.Flags(ctx)
if err != nil {
text, severity := textAndSeverityFromError(err, c.Reporter(), c.prom.Name(), Warning)
Expand Down Expand Up @@ -84,7 +106,7 @@ func (c RangeQueryCheck) Check(ctx context.Context, _ discovery.Path, rule parse
retention = time.Hour * 24 * 15
}

for _, problem := range c.checkNode(ctx, expr.Query, retention, flags.URI) {
for _, problem := range c.checkNode(ctx, expr.Query, retention, fmt.Sprintf("%s is configured to only keep %s of metrics history.", promText(c.prom.Name(), flags.URI), model.Duration(retention))) {
problems = append(problems, Problem{
Lines: expr.Value.Lines,
Reporter: c.Reporter(),
Expand All @@ -96,20 +118,20 @@ func (c RangeQueryCheck) Check(ctx context.Context, _ discovery.Path, rule parse
return problems
}

func (c RangeQueryCheck) checkNode(ctx context.Context, node *parser.PromQLNode, retention time.Duration, uri string) (problems []exprProblem) {
func (c RangeQueryCheck) checkNode(ctx context.Context, node *parser.PromQLNode, retention time.Duration, reason string) (problems []exprProblem) {
if n, ok := node.Expr.(*promParser.MatrixSelector); ok {
if n.Range > retention {
problems = append(problems, exprProblem{
expr: node.Expr.String(),
text: fmt.Sprintf("`%s` selector is trying to query Prometheus for %s worth of metrics, but %s is configured to only keep %s of metrics history.",
node.Expr, model.Duration(n.Range), promText(c.prom.Name(), uri), model.Duration(retention)),
text: fmt.Sprintf("`%s` selector is trying to query Prometheus for %s worth of metrics, but %s",
node.Expr, model.Duration(n.Range), reason),
severity: Warning,
})
}
}

for _, child := range node.Children {
problems = append(problems, c.checkNode(ctx, child, retention, uri)...)
problems = append(problems, c.checkNode(ctx, child, retention, reason)...)
}

return problems
Expand Down
34 changes: 33 additions & 1 deletion internal/checks/promql_range_query_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,19 @@ package checks_test
import (
"fmt"
"testing"
"time"

"github.com/cloudflare/pint/internal/checks"
"github.com/cloudflare/pint/internal/parser"
"github.com/cloudflare/pint/internal/promapi"
)

func newRangeQueryCheck(prom *promapi.FailoverGroup) checks.RuleChecker {
return checks.NewRangeQueryCheck(prom)
return checks.NewRangeQueryCheck(prom, 0, "", checks.Fatal)
}

func newRangeQueryCheckWithLimit(prom *promapi.FailoverGroup) checks.RuleChecker {
return checks.NewRangeQueryCheck(prom, time.Hour*4, "some text", checks.Bug)
}

func retentionToLow(name, uri, metric, qr, retention string) string {
Expand Down Expand Up @@ -214,6 +219,33 @@ func TestRangeQueryCheck(t *testing.T) {
},
},
},
{
description: "limit / 3h",
content: "- record: foo\n expr: rate(foo[3h])\n",
checker: newRangeQueryCheckWithLimit,
prometheus: noProm,
problems: noProblems,
},
{
description: "limit / 5h",
content: "- record: foo\n expr: rate(foo[5h])\n",
checker: newRangeQueryCheckWithLimit,
prometheus: noProm,
problems: func(_ string) []checks.Problem {
return []checks.Problem{
{
Lines: parser.LineRange{
First: 2,
Last: 2,
},
Reporter: "promql/range_query",
Text: "`foo[5h]` selector is trying to query Prometheus for 5h worth of metrics, but 4h is the maximum allowed range query.",
Details: "Rule comment: some text",
Severity: checks.Bug,
},
}
},
},
}
runTests(t, testCases)
}
48 changes: 48 additions & 0 deletions internal/config/__snapshots__/config_test.snap
Original file line number Diff line number Diff line change
Expand Up @@ -2763,3 +2763,51 @@
]
}
---

[TestGetChecksForRule/custom_range_query - 1]
{
"ci": {
"baseBranch": "master",
"maxCommits": 20
},
"parser": {},
"checks": {
"enabled": [
"alerts/absent",
"alerts/annotation",
"alerts/count",
"alerts/external_labels",
"alerts/for",
"alerts/template",
"labels/conflict",
"promql/aggregate",
"alerts/comparison",
"promql/fragile",
"promql/range_query",
"promql/rate",
"promql/regexp",
"promql/syntax",
"promql/vector_matching",
"query/cost",
"promql/counter",
"promql/series",
"rule/dependency",
"rule/duplicate",
"rule/for",
"rule/name",
"rule/label",
"rule/link",
"rule/reject"
]
},
"owners": {},
"rules": [
{
"range_query": {
"max": "1h",
"severity": "bug"
}
}
]
}
---
2 changes: 1 addition & 1 deletion internal/config/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -134,7 +134,7 @@ func (cfg *Config) GetChecksForRule(ctx context.Context, gen *PrometheusGenerato
})
allChecks = append(allChecks, checkMeta{
name: checks.RangeQueryCheckName,
check: checks.NewRangeQueryCheck(p),
check: checks.NewRangeQueryCheck(p, 0, "", checks.Warning),
tags: p.Tags(),
})
allChecks = append(allChecks, checkMeta{
Expand Down
Loading

0 comments on commit d642795

Please sign in to comment.