Add cortex_ruler_rule_groups_in_store metric #5869

emanlodovice · 2024-04-19T02:10:44Z

What this PR does:
Add a new per tenant metric called cortex_ruler_rule_groups_in_store which contains the number of rule groups each tenant has in store (s3 for example). The metric provides an accurate count of rule groups that is not affected by the health of the rulers or resharding.

Which issue(s) this PR fixes:
Fixes #5866

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

rapphil · 2024-04-23T20:54:36Z

pkg/ruler/manager_metrics.go

@@ -271,3 +273,42 @@ func (m *RuleEvalMetrics) deletePerUserMetrics(userID string) {
 		m.RulerQuerySeconds.DeleteLabelValues(userID)
 	}
 }
+
+type RuleGroupMetrics struct {
+	mtx               sync.Mutex


do we need to control access to this resource with a mutex? I checked and we only update the metrics when we sync rules, which happens sequentially.

We need to because we have unit tests for UpdateRuleGroupsInStore. Without this mutex the test will fail when run with -race

what tests are failing specifically because of race conditions?

TestRuleGroupMetrics is creating all resources in the test function, unless I'm missing something and prometheus has some package level variables under the hood.

One test that fails is TestRuler_rules_limit. The problem is that there are tests that manually calls syncRules to make sure rules are loaded https://github.com/cortexproject/cortex/blob/master/pkg/ruler/ruler_test.go#L247-L255 and this could overlap with the syncRules call when running the ruler service.

It feels that tests should not be sharing resources if they are running in parallel. How hard would be to fix the tests?
Alternatively, maybe you can add an interface and change the implementation of RuleGroupMetrics in runtime during tests by wrapping it with a version that uses mutex.

It feels that tests should not be sharing resources if they are running in parallel. How hard would be to fix the tests?

these tests are not sharing resources. Because starting a the ruler calls syncRules and the test itself calls syncRules is causing the double invocation. Honestly I don't understand why we need to shy away from this mutex at all. We know it doesn't hurt during run time because syncRules don't really get invoked concurrently. I prefer adding this simple mutex than complicate the code. This mutex is similar to https://github.com/cortexproject/cortex/blob/master/pkg/ruler/manager.go#L127-L128 since both of which are called only during syncRules.

Honestly I don't understand why we need to shy away from this mutex at all. We know it doesn't hurt during run time because syncRules don't really get invoked concurrently.

because throwing a mutex in the application code just to make a test pass is a weak reason IMHO. It is not for performance, but by the fact that this mutex is essentially not serving its purpose (i.e.: it is useless in the application) and will mislead people reading this code.

I can think of two options of solving this:

Transform RuleGroupMetrics into a interface. the implementation can be injected into the ruler by adding a parameter in this function. This will allow you to pass a thread safe implementation that is a wrapper to the real RuleGroupMetrics here by returning the thread safe RuleGroupMetrics from testSetup, like other entities that are generated only for tests.... This will make the coverage of the tests the same while essentially moving the mutex from application code to test code.

Add a method to the ruler that will allow us to identify that rules were loaded at least once and we can block and wait in the code (similar to this method). With this approach we will remove any other conditions.

I prefer option 2 but I fell that this is outside of the scope of this PR, so I think option 1 is ok for this pr.

I think that's what tests are for. To detect which parts can breaks with concurrency. The tests proves that this bit of code can break when there is concurrency. That concurrency might not happen in normal execution of the process but the fact remains that it is susceptible to it. I don't like the idea of having an interface and passing a wrapped version in tests because then what is being tested is not the same as what is being used at run time. If someone makes a change to actually call syncRules in parallel the test will not fail. I can try removing the manual call to syncRules and replace it with a wait mechanism

updated the test and removed the mtx, PTAL @rapphil , thanks

I think tests should test if the behaviour of the code is what is expected and if the assumptions that we make about its behaviour remain constant. If something is not intended to be thread safe, there is no point in testing that.

Having said that, thanks for fixing the tests :) if one day we need to make syncRules thread safe, we should add tests to explicitly test that.

rapphil · 2024-04-25T04:24:45Z

pkg/ruler/ruler.go

@@ -731,6 +740,8 @@ func (r *Ruler) listRulesShuffleSharding(ctx context.Context) (map[string]rulesp
 	mu := sync.Mutex{}
 	owned := map[string]rulespb.RuleGroupList{}
 	backedUp := map[string]rulespb.RuleGroupList{}
+	gLock := sync.Mutex{}


by now you must realize that I don't like mutexes 😅 (not that I don't like, but I try to minimize the use and prefer lock free whenever possible).

I think you can create a slice of maps of size concurrency and give one dedicated map per goroutine and merge all maps from the slice after the wait group. The same approach could be used with owned and backedUp to get rid of the other mutex.

I think the code will be much more complex in that approach. The use of mutex here seems to be acceptable considering the small scope of the lock. Maybe @yeya24 you can share your opinion on this matter.

Not a blocker for me, but a nice to have, therefore I will approve.

I am fine with the lock here as setting the map is a very lightweight operation

emanlodovice · 2024-05-16T20:41:27Z

@yeya24 please take a look. Thanks

rajagopalanand · 2024-06-21T13:14:40Z

@rapphil @yeya24 Can this be merged?

yeya24 · 2024-07-17T00:58:42Z

pkg/ruler/ruler.go

@@ -731,6 +740,8 @@ func (r *Ruler) listRulesShuffleSharding(ctx context.Context) (map[string]rulesp
 	mu := sync.Mutex{}
 	owned := map[string]rulespb.RuleGroupList{}
 	backedUp := map[string]rulespb.RuleGroupList{}
+	gLock := sync.Mutex{}


I am fine with the lock here as setting the map is a very lightweight operation

yeya24 · 2024-07-17T01:07:34Z

CHANGELOG.md

@@ -8,6 +8,7 @@
 * [ENHANCEMENT] Distributor/Querier: Clean stale per-ingester metrics after ingester restarts. #5930
 * [ENHANCEMENT] Distributor/Ring: Allow disabling detailed ring metrics by ring member. #5931
 * [ENHANCEMENT] KV: Etcd Added etcd.ping-without-stream-allowed parameter to disable/enable  PermitWithoutStream #5933
+* [ENHANCEMENT] Ruler: Add new ruler metric `cortex_ruler_rule_groups_in_store`. #5869


Can we add more details to the changelog about what does this metric help?

danielblando · 2024-08-16T18:56:33Z

We are in the process of releasing 1.18
Please rebase PR and change the changelog to master

yeya24

Can we rebase master and put the changelog entry to master/unreleased? Then we should be good to go

Signed-off-by: Emmanuel Lodovice <lodovice@amazon.com>

pull-request-size bot added the size/M label Apr 19, 2024

emanlodovice force-pushed the per-tenant-rules-group-count branch 2 times, most recently from 5dc489a to 80f2933 Compare April 19, 2024 03:31

pull-request-size bot added size/L and removed size/M labels Apr 19, 2024

emanlodovice marked this pull request as ready for review April 19, 2024 03:53

emanlodovice force-pushed the per-tenant-rules-group-count branch 2 times, most recently from e533fae to 840cd27 Compare April 19, 2024 18:48

pull-request-size bot added size/M and removed size/L labels Apr 19, 2024

rapphil approved these changes Apr 23, 2024

View reviewed changes

emanlodovice force-pushed the per-tenant-rules-group-count branch 2 times, most recently from b4af784 to 13f9d66 Compare April 25, 2024 03:28

pull-request-size bot added size/L and removed size/M labels Apr 25, 2024

emanlodovice force-pushed the per-tenant-rules-group-count branch from 13f9d66 to a9a8d5c Compare April 25, 2024 03:34

rapphil reviewed Apr 25, 2024

View reviewed changes

emanlodovice force-pushed the per-tenant-rules-group-count branch 3 times, most recently from e0e006f to 613f75d Compare May 13, 2024 07:23

emanlodovice requested a review from rapphil May 13, 2024 22:05

rapphil approved these changes Jul 16, 2024

View reviewed changes

yeya24 reviewed Jul 17, 2024

View reviewed changes

emanlodovice force-pushed the per-tenant-rules-group-count branch from 613f75d to 8c25deb Compare August 16, 2024 19:07

emanlodovice requested a review from yeya24 August 16, 2024 21:42

yeya24 reviewed Aug 20, 2024

View reviewed changes

Add cortex_ruler_rule_groups_in_store metric

6ea9748

Signed-off-by: Emmanuel Lodovice <lodovice@amazon.com>

emanlodovice force-pushed the per-tenant-rules-group-count branch from 8c25deb to 6ea9748 Compare August 20, 2024 17:13

yeya24 approved these changes Aug 20, 2024

View reviewed changes

yeya24 merged commit bdf677e into cortexproject:master Aug 20, 2024
16 checks passed

danielblando pushed a commit to danielblando/cortex that referenced this pull request Sep 3, 2024

Add cortex_ruler_rule_groups_in_store metric (cortexproject#5869)

8c686ce

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cortex_ruler_rule_groups_in_store metric #5869

Add cortex_ruler_rule_groups_in_store metric #5869

emanlodovice commented Apr 19, 2024 •

edited

Loading

rapphil Apr 23, 2024

emanlodovice Apr 24, 2024

rapphil Apr 24, 2024

emanlodovice Apr 24, 2024

rapphil Apr 24, 2024 •

edited

Loading

emanlodovice Apr 25, 2024 •

edited

Loading

rapphil Apr 25, 2024 •

edited

Loading

emanlodovice Apr 25, 2024

emanlodovice Apr 25, 2024

rapphil Apr 25, 2024

rapphil Apr 25, 2024 •

edited

Loading

emanlodovice Apr 25, 2024 •

edited

Loading

rapphil Jul 16, 2024

yeya24 Jul 17, 2024

emanlodovice commented May 16, 2024

rajagopalanand commented Jun 21, 2024

yeya24 Jul 17, 2024

yeya24 Jul 17, 2024

danielblando commented Aug 16, 2024

yeya24 left a comment

Add cortex_ruler_rule_groups_in_store metric #5869

Add cortex_ruler_rule_groups_in_store metric #5869

Conversation

emanlodovice commented Apr 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rapphil Apr 24, 2024 • edited Loading

Choose a reason for hiding this comment

emanlodovice Apr 25, 2024 • edited Loading

Choose a reason for hiding this comment

rapphil Apr 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rapphil Apr 25, 2024 • edited Loading

Choose a reason for hiding this comment

emanlodovice Apr 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emanlodovice commented May 16, 2024

rajagopalanand commented Jun 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielblando commented Aug 16, 2024

yeya24 left a comment

Choose a reason for hiding this comment

emanlodovice commented Apr 19, 2024 •

edited

Loading

rapphil Apr 24, 2024 •

edited

Loading

emanlodovice Apr 25, 2024 •

edited

Loading

rapphil Apr 25, 2024 •

edited

Loading

rapphil Apr 25, 2024 •

edited

Loading

emanlodovice Apr 25, 2024 •

edited

Loading