-
Notifications
You must be signed in to change notification settings - Fork 804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for exclude_alerts flag in ListRules API #6011
Add support for exclude_alerts flag in ListRules API #6011
Conversation
909ccb9
to
d4dc41a
Compare
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to @rapphil.
And let's add a changelog entry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rajagopalanand still needs the unit test update right?
Will add tests |
d4dc41a
to
81601e9
Compare
84bd9f0
to
add416e
Compare
@@ -135,11 +135,18 @@ func (a *API) PrometheusRules(w http.ResponseWriter, req *http.Request) { | |||
return | |||
} | |||
|
|||
excludeAlerts, err := parseExcludeAlerts(req) | |||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to include the error in the response? Can we align with Prometheus API response?
add416e
to
cbc6585
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
I think the test is flaky. Can we fix them?
|
ab03cc6
to
8a65ab1
Compare
Previously the Ruler tests did not look for alerts firing. With this new filter, the tests look for active alerts and this requires extra time for the Ruler to load the rules and evaluate the rule groups at least twice. In unit tests, there is not an obvious way to know if a Ruler has started evaluating rule groups. Integration tests call the metrics endpoint to know if a particular rule group has been evaluated. I could cast the manager interface to the concrete DefaultMultiTenantManager and access user registries. For now, I added some extra sleep time to allow Rulers to evaluate the rule groups |
In addition, instance-64 zone-1 zone-2 instance-44
--- FAIL: TestRing_ShuffleShardWithLookback_CorrectnessWithFuzzy/num_instances_=_60,_num_zones_=_3,_stable_sharding_=_false (0.03s)
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x1bd7f83]
goroutine 69 [running]:
testing.tRunner.func1.2({0x1d9bfa0, 0x2d156f0})
/usr/local/Cellar/go/1.21.3/libexec/src/testing/testing.go:1545 +0x238
testing.tRunner.func1()
/usr/local/Cellar/go/1.21.3/libexec/src/testing/testing.go:1548 +0x397
panic({0x1d9bfa0?, 0x2d156f0?})
/usr/local/Cellar/go/1.21.3/libexec/src/runtime/panic.go:914 +0x21f
github.com/cortexproject/cortex/pkg/ring.(*MinimizeSpreadTokenGenerator).GenerateTokens(0xc00059ca90, 0xc0001ef160, {0xc0004f4090, 0xb}, {0xc0004f40a0, 0x6}, 0x80, 0x1)
/Users/anrajag/workplace/rajagopalanand-cortex/pkg/ring/token_generator.go:134 +0x883
github.com/cortexproject/cortex/pkg/ring.TestRing_ShuffleShardWithLookback_CorrectnessWithFuzzy.func1(0xc000503040)
/Users/anrajag/workplace/rajagopalanand-cortex/pkg/ring/ring_test.go:2497 +0xce6
testing.tRunner(0xc000503040, 0xc00070ec60)
/usr/local/Cellar/go/1.21.3/libexec/src/testing/testing.go:1595 +0xff
created by testing.(*T).Run in goroutine 11
/usr/local/Cellar/go/1.21.3/libexec/src/testing/testing.go:1648 +0x3ad The above seg fault lines #s do not match because I added some log statements. Not sure if this a problem with the test or if there is an edge case bug in |
9103987
to
0c61cfb
Compare
519ba00
to
ecf2d95
Compare
pkg/ruler/ruler_test.go
Outdated
ruler := rulerAddrMap[rID] | ||
if tc.sharding { | ||
subRing := ruler.ring.ShuffleShard(user, tc.shuffleShardSize) | ||
owns, err := instanceOwnsRuleGroup(subRing, rgDesc, validation.DisabledRuleGroups{}, ruler.lifecycler.GetInstanceAddr(), false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it feels that during error the waitUntil should also error instead of continuing the test. There are no disabled rule groups so you can only get a real error.
pkg/ruler/ruler_test.go
Outdated
} | ||
if len(m) == 0 { | ||
close(doneCh) | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be an exceptional case to have m without elements no? should this error instead of just continue silently?
pkg/ruler/ruler_test.go
Outdated
for metric := range scrapeCh { | ||
collected := &dto.Metric{} | ||
_ = metric.Write(collected) | ||
userLabel := "" | ||
ruleGroupName := "" | ||
for _, lp := range collected.GetLabel() { | ||
if lp.GetName() == "user" && lp.GetValue() == user { | ||
userLabel = lp.GetValue() | ||
} | ||
if lp.GetName() == "rule_group" { | ||
ruleGroupName = lp.GetValue() | ||
} | ||
} | ||
if userLabel != "" && ruleGroupName != "" { | ||
for k := range m { | ||
if strings.Contains(ruleGroupName, k) { | ||
m[k] = m[k] + collected.Counter.GetValue() | ||
} | ||
} | ||
} | ||
done := true | ||
for k := range m { | ||
if m[k] < 3 { | ||
done = false | ||
break | ||
} | ||
} | ||
if done { | ||
return | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use labels to simplify the code a bit.
for metric := range scrapeCh { | |
collected := &dto.Metric{} | |
_ = metric.Write(collected) | |
userLabel := "" | |
ruleGroupName := "" | |
for _, lp := range collected.GetLabel() { | |
if lp.GetName() == "user" && lp.GetValue() == user { | |
userLabel = lp.GetValue() | |
} | |
if lp.GetName() == "rule_group" { | |
ruleGroupName = lp.GetValue() | |
} | |
} | |
if userLabel != "" && ruleGroupName != "" { | |
for k := range m { | |
if strings.Contains(ruleGroupName, k) { | |
m[k] = m[k] + collected.Counter.GetValue() | |
} | |
} | |
} | |
done := true | |
for k := range m { | |
if m[k] < 3 { | |
done = false | |
break | |
} | |
} | |
if done { | |
return | |
} | |
} | |
outerLoop: | |
for metric := range scrapeCh { | |
collected := &dto.Metric{} | |
_ = metric.Write(collected) | |
userLabel := "" | |
ruleGroupName := "" | |
for _, lp := range collected.GetLabel() { | |
if lp.GetName() == "user" && lp.GetValue() == user { | |
userLabel = lp.GetValue() | |
} | |
if lp.GetName() == "rule_group" { | |
ruleGroupName = lp.GetValue() | |
} | |
} | |
if userLabel != "" && ruleGroupName != "" { | |
for k := range m { | |
if strings.Contains(ruleGroupName, k) { | |
m[k] = m[k] + collected.Counter.GetValue() | |
} | |
} | |
} | |
for k := range m { | |
if m[k] < 3 { | |
continue outerLoop | |
} | |
} | |
} |
ecf2d95
to
16b5c0d
Compare
pkg/ruler/ruler_test.go
Outdated
@@ -844,9 +899,90 @@ func TestGetRules(t *testing.T) { | |||
time.Sleep(100 * time.Millisecond) | |||
} | |||
|
|||
scrape := func(rID string, scrapeCh chan prometheus.Metric, doneCh chan struct{}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If unit test is hard to add let's just add it to the integration test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will try and add an integration test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the testing is generating the most comments. I second the idea of having an integration test, that should settle the feature IMHO
Makefile
Outdated
@@ -236,7 +236,7 @@ shell: | |||
bash | |||
|
|||
configs-integration-test: | |||
/bin/bash -c "go test -v -tags 'netgo integration' -timeout 30s ./pkg/configs/... ./pkg/ruler/..." | |||
/bin/bash -c "go test -v -tags 'netgo integration' -timeout 60s ./pkg/configs/... ./pkg/ruler/..." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never a good sign... but ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like I mentioned in the earlier comment, the tests did not check for active alerts and so the tests can immediately call GetRules. Rules need to be evaluated for alerts to be present and it takes extra time to evaluate the rules. configs-integration-test
are also calling the ruler tests which takes extra time to evaluate the rules. Basically it is a cascading effect. Even if we remove the sleep, which I did in the recent version, some of the tests are going to take longer time for alerts to be populated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will look at this again later in the week and see if there is a way to better organize the test or create a new test but even in this case, the tests would need to run longer for alerts to be populated
It would be great to revisit this PR and get it merged |
Yes, will address it this week |
713b06b
to
3771a51
Compare
Added integ tests and removed unit tests |
3771a51
to
172eea5
Compare
Makefile
Outdated
@@ -10,7 +10,7 @@ VERSION=$(shell cat "./VERSION" 2> /dev/null) | |||
GOPROXY_VALUE=$(shell go env GOPROXY) | |||
|
|||
# ARCHS | |||
ARCHS = amd64 arm64 | |||
ARCHS = amd64 #arm64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to put this back?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Done!
Signed-off-by: Anand Rajagopal <anrajag@amazon.com>
172eea5
to
0cd075a
Compare
LGTM! |
Which issue(s) this PR fixes:
Fixes #5989
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]