Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for exclude_alerts flag in ListRules API #6011

Merged
merged 1 commit into from
Jul 30, 2024

Conversation

rajagopalanand
Copy link
Contributor

@rajagopalanand rajagopalanand commented Jun 12, 2024

Which issue(s) this PR fixes:
Fixes #5989

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@rapphil
Copy link
Contributor

rapphil commented Jun 12, 2024

  • no unit tests?
  • The title of the PR is not helpful. It should indicate what the PR is doing instead of a link to a issue. I can find the link to the issue in the body of the PR if I needed.

Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @rapphil.
And let's add a changelog entry.

@rajagopalanand rajagopalanand changed the title Fix for https://github.com/cortexproject/cortex/issues/5989 Add support for exclude_alerts flag in ListRules API Jun 12, 2024
@yeya24 yeya24 self-requested a review June 13, 2024 00:03
Copy link
Member

@friedrichg friedrichg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rajagopalanand still needs the unit test update right?

@rajagopalanand
Copy link
Contributor Author

rajagopalanand commented Jun 13, 2024

Will add tests

@@ -135,11 +135,18 @@ func (a *API) PrometheusRules(w http.ResponseWriter, req *http.Request) {
return
}

excludeAlerts, err := parseExcludeAlerts(req)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to include the error in the response? Can we align with Prometheus API response?

Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@yeya24
Copy link
Contributor

yeya24 commented Jun 18, 2024

I think the test is flaky. Can we fix them?

--- FAIL: TestGetRules (10.23s)
    --- FAIL: TestGetRules/No_Sharding_with_Rule_Type_Filter (0.24s)
        ruler_test.go:362: 
            	Error Trace:	/__w/cortex/cortex/pkg/ruler/ruler_test.go:362
            	            				/__w/cortex/cortex/pkg/ruler/ruler_test.go:922
            	            				/__w/cortex/cortex/pkg/ruler/ruler_test.go:872
            	            				/__w/cortex/cortex/pkg/ruler/ruler_test.go:904
            	Error:      	"0" is not greater than or equal to "1"
            	Test:       	TestGetRules/No_Sharding_with_Rule_Type_Filter

@rajagopalanand rajagopalanand force-pushed the list-rules-bug-fix branch 3 times, most recently from ab03cc6 to 8a65ab1 Compare June 19, 2024 13:52
@rajagopalanand
Copy link
Contributor Author

rajagopalanand commented Jun 19, 2024

Previously the Ruler tests did not look for alerts firing. With this new filter, the tests look for active alerts and this requires extra time for the Ruler to load the rules and evaluate the rule groups at least twice. In unit tests, there is not an obvious way to know if a Ruler has started evaluating rule groups. Integration tests call the metrics endpoint to know if a particular rule group has been evaluated. I could cast the manager interface to the concrete DefaultMultiTenantManager and access user registries. For now, I added some extra sleep time to allow Rulers to evaluate the rule groups

@rajagopalanand
Copy link
Contributor Author

rajagopalanand commented Jun 19, 2024

In addition, TestRing_ShuffleShardWithLookback_CorrectnessWithFuzzy is failing every once in a while. Here is an example of a test failure. I ran the tests with some print statements locally and with enough # of runs, the error occurred on my local machine

instance-64 zone-1 zone-2 instance-44
    --- FAIL: TestRing_ShuffleShardWithLookback_CorrectnessWithFuzzy/num_instances_=_60,_num_zones_=_3,_stable_sharding_=_false (0.03s)
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x1bd7f83]

goroutine 69 [running]:
testing.tRunner.func1.2({0x1d9bfa0, 0x2d156f0})
	/usr/local/Cellar/go/1.21.3/libexec/src/testing/testing.go:1545 +0x238
testing.tRunner.func1()
	/usr/local/Cellar/go/1.21.3/libexec/src/testing/testing.go:1548 +0x397
panic({0x1d9bfa0?, 0x2d156f0?})
	/usr/local/Cellar/go/1.21.3/libexec/src/runtime/panic.go:914 +0x21f
github.com/cortexproject/cortex/pkg/ring.(*MinimizeSpreadTokenGenerator).GenerateTokens(0xc00059ca90, 0xc0001ef160, {0xc0004f4090, 0xb}, {0xc0004f40a0, 0x6}, 0x80, 0x1)
	/Users/anrajag/workplace/rajagopalanand-cortex/pkg/ring/token_generator.go:134 +0x883
github.com/cortexproject/cortex/pkg/ring.TestRing_ShuffleShardWithLookback_CorrectnessWithFuzzy.func1(0xc000503040)
	/Users/anrajag/workplace/rajagopalanand-cortex/pkg/ring/ring_test.go:2497 +0xce6
testing.tRunner(0xc000503040, 0xc00070ec60)
	/usr/local/Cellar/go/1.21.3/libexec/src/testing/testing.go:1595 +0xff
created by testing.(*T).Run in goroutine 11
	/usr/local/Cellar/go/1.21.3/libexec/src/testing/testing.go:1648 +0x3ad

The above seg fault lines #s do not match because I added some log statements. Not sure if this a problem with the test or if there is an edge case bug in MinimizeSpreadTokenGenerator.GenerateTokens

@rajagopalanand rajagopalanand force-pushed the list-rules-bug-fix branch 7 times, most recently from 9103987 to 0c61cfb Compare June 20, 2024 03:31
@rajagopalanand rajagopalanand force-pushed the list-rules-bug-fix branch 3 times, most recently from 519ba00 to ecf2d95 Compare June 20, 2024 18:54
ruler := rulerAddrMap[rID]
if tc.sharding {
subRing := ruler.ring.ShuffleShard(user, tc.shuffleShardSize)
owns, err := instanceOwnsRuleGroup(subRing, rgDesc, validation.DisabledRuleGroups{}, ruler.lifecycler.GetInstanceAddr(), false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels that during error the waitUntil should also error instead of continuing the test. There are no disabled rule groups so you can only get a real error.

}
if len(m) == 0 {
close(doneCh)
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be an exceptional case to have m without elements no? should this error instead of just continue silently?

Comment on lines 944 to 975
for metric := range scrapeCh {
collected := &dto.Metric{}
_ = metric.Write(collected)
userLabel := ""
ruleGroupName := ""
for _, lp := range collected.GetLabel() {
if lp.GetName() == "user" && lp.GetValue() == user {
userLabel = lp.GetValue()
}
if lp.GetName() == "rule_group" {
ruleGroupName = lp.GetValue()
}
}
if userLabel != "" && ruleGroupName != "" {
for k := range m {
if strings.Contains(ruleGroupName, k) {
m[k] = m[k] + collected.Counter.GetValue()
}
}
}
done := true
for k := range m {
if m[k] < 3 {
done = false
break
}
}
if done {
return
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use labels to simplify the code a bit.

Suggested change
for metric := range scrapeCh {
collected := &dto.Metric{}
_ = metric.Write(collected)
userLabel := ""
ruleGroupName := ""
for _, lp := range collected.GetLabel() {
if lp.GetName() == "user" && lp.GetValue() == user {
userLabel = lp.GetValue()
}
if lp.GetName() == "rule_group" {
ruleGroupName = lp.GetValue()
}
}
if userLabel != "" && ruleGroupName != "" {
for k := range m {
if strings.Contains(ruleGroupName, k) {
m[k] = m[k] + collected.Counter.GetValue()
}
}
}
done := true
for k := range m {
if m[k] < 3 {
done = false
break
}
}
if done {
return
}
}
outerLoop:
for metric := range scrapeCh {
collected := &dto.Metric{}
_ = metric.Write(collected)
userLabel := ""
ruleGroupName := ""
for _, lp := range collected.GetLabel() {
if lp.GetName() == "user" && lp.GetValue() == user {
userLabel = lp.GetValue()
}
if lp.GetName() == "rule_group" {
ruleGroupName = lp.GetValue()
}
}
if userLabel != "" && ruleGroupName != "" {
for k := range m {
if strings.Contains(ruleGroupName, k) {
m[k] = m[k] + collected.Counter.GetValue()
}
}
}
for k := range m {
if m[k] < 3 {
continue outerLoop
}
}
}

@@ -844,9 +899,90 @@ func TestGetRules(t *testing.T) {
time.Sleep(100 * time.Millisecond)
}

scrape := func(rID string, scrapeCh chan prometheus.Metric, doneCh chan struct{}) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If unit test is hard to add let's just add it to the integration test?

Copy link
Contributor Author

@rajagopalanand rajagopalanand Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will try and add an integration test

@alanprot alanprot requested a review from rapphil June 26, 2024 23:07
Copy link
Member

@friedrichg friedrichg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the testing is generating the most comments. I second the idea of having an integration test, that should settle the feature IMHO

Makefile Outdated
@@ -236,7 +236,7 @@ shell:
bash

configs-integration-test:
/bin/bash -c "go test -v -tags 'netgo integration' -timeout 30s ./pkg/configs/... ./pkg/ruler/..."
/bin/bash -c "go test -v -tags 'netgo integration' -timeout 60s ./pkg/configs/... ./pkg/ruler/..."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never a good sign... but ok

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I mentioned in the earlier comment, the tests did not check for active alerts and so the tests can immediately call GetRules. Rules need to be evaluated for alerts to be present and it takes extra time to evaluate the rules. configs-integration-test are also calling the ruler tests which takes extra time to evaluate the rules. Basically it is a cascading effect. Even if we remove the sleep, which I did in the recent version, some of the tests are going to take longer time for alerts to be populated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will look at this again later in the week and see if there is a way to better organize the test or create a new test but even in this case, the tests would need to run longer for alerts to be populated

@yeya24
Copy link
Contributor

yeya24 commented Jul 16, 2024

It would be great to revisit this PR and get it merged

@rajagopalanand
Copy link
Contributor Author

Yes, will address it this week

@rajagopalanand rajagopalanand force-pushed the list-rules-bug-fix branch 2 times, most recently from 713b06b to 3771a51 Compare July 30, 2024 15:51
@rajagopalanand
Copy link
Contributor Author

Added integ tests and removed unit tests

Makefile Outdated
@@ -10,7 +10,7 @@ VERSION=$(shell cat "./VERSION" 2> /dev/null)
GOPROXY_VALUE=$(shell go env GOPROXY)

# ARCHS
ARCHS = amd64 arm64
ARCHS = amd64 #arm64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to put this back?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Done!

Signed-off-by: Anand Rajagopal <anrajag@amazon.com>
@alanprot
Copy link
Member

LGTM!

@yeya24 yeya24 merged commit 8ce3642 into cortexproject:master Jul 30, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

List rules API not honoring exclude_alerts flag
6 participants