Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cortex_ruler_rule_groups_in_store metric #5869

Merged

Conversation

emanlodovice
Copy link
Contributor

@emanlodovice emanlodovice commented Apr 19, 2024

What this PR does:
Add a new per tenant metric called cortex_ruler_rule_groups_in_store which contains the number of rule groups each tenant has in store (s3 for example). The metric provides an accurate count of rule groups that is not affected by the health of the rulers or resharding.

Which issue(s) this PR fixes:
Fixes #5866

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@emanlodovice emanlodovice force-pushed the per-tenant-rules-group-count branch 2 times, most recently from 5dc489a to 80f2933 Compare April 19, 2024 03:31
@pull-request-size pull-request-size bot added size/L and removed size/M labels Apr 19, 2024
@emanlodovice emanlodovice marked this pull request as ready for review April 19, 2024 03:53
@emanlodovice emanlodovice force-pushed the per-tenant-rules-group-count branch 2 times, most recently from e533fae to 840cd27 Compare April 19, 2024 18:48
@pull-request-size pull-request-size bot added size/M and removed size/L labels Apr 19, 2024
@@ -271,3 +273,42 @@ func (m *RuleEvalMetrics) deletePerUserMetrics(userID string) {
m.RulerQuerySeconds.DeleteLabelValues(userID)
}
}

type RuleGroupMetrics struct {
mtx sync.Mutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to control access to this resource with a mutex? I checked and we only update the metrics when we sync rules, which happens sequentially.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to because we have unit tests for UpdateRuleGroupsInStore. Without this mutex the test will fail when run with -race

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what tests are failing specifically because of race conditions?

TestRuleGroupMetrics is creating all resources in the test function, unless I'm missing something and prometheus has some package level variables under the hood.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One test that fails is TestRuler_rules_limit. The problem is that there are tests that manually calls syncRules to make sure rules are loaded https://github.com/cortexproject/cortex/blob/master/pkg/ruler/ruler_test.go#L247-L255 and this could overlap with the syncRules call when running the ruler service.

Copy link
Contributor

@rapphil rapphil Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels that tests should not be sharing resources if they are running in parallel. How hard would be to fix the tests?
Alternatively, maybe you can add an interface and change the implementation of RuleGroupMetrics in runtime during tests by wrapping it with a version that uses mutex.

Copy link
Contributor Author

@emanlodovice emanlodovice Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels that tests should not be sharing resources if they are running in parallel. How hard would be to fix the tests?

these tests are not sharing resources. Because starting a the ruler calls syncRules and the test itself calls syncRules is causing the double invocation. Honestly I don't understand why we need to shy away from this mutex at all. We know it doesn't hurt during run time because syncRules don't really get invoked concurrently. I prefer adding this simple mutex than complicate the code. This mutex is similar to https://github.com/cortexproject/cortex/blob/master/pkg/ruler/manager.go#L127-L128 since both of which are called only during syncRules.

Copy link
Contributor

@rapphil rapphil Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I don't understand why we need to shy away from this mutex at all. We know it doesn't hurt during run time because syncRules don't really get invoked concurrently.

because throwing a mutex in the application code just to make a test pass is a weak reason IMHO. It is not for performance, but by the fact that this mutex is essentially not serving its purpose (i.e.: it is useless in the application) and will mislead people reading this code.

I can think of two options of solving this:

  1. Transform RuleGroupMetrics into a interface. the implementation can be injected into the ruler by adding a parameter in this function. This will allow you to pass a thread safe implementation that is a wrapper to the real RuleGroupMetrics here by returning the thread safe RuleGroupMetrics from testSetup, like other entities that are generated only for tests.... This will make the coverage of the tests the same while essentially moving the mutex from application code to test code.
  2. Add a method to the ruler that will allow us to identify that rules were loaded at least once and we can block and wait in the code (similar to this method). With this approach we will remove any other conditions.

I prefer option 2 but I fell that this is outside of the scope of this PR, so I think option 1 is ok for this pr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's what tests are for. To detect which parts can breaks with concurrency. The tests proves that this bit of code can break when there is concurrency. That concurrency might not happen in normal execution of the process but the fact remains that it is susceptible to it. I don't like the idea of having an interface and passing a wrapped version in tests because then what is being tested is not the same as what is being used at run time. If someone makes a change to actually call syncRules in parallel the test will not fail. I can try removing the manual call to syncRules and replace it with a wait mechanism

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the test and removed the mtx, PTAL @rapphil , thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think tests should test if the behaviour of the code is what is expected and if the assumptions that we make about its behaviour remain constant. If something is not intended to be thread safe, there is no point in testing that.

Having said that, thanks for fixing the tests :) if one day we need to make syncRules thread safe, we should add tests to explicitly test that.

@emanlodovice emanlodovice force-pushed the per-tenant-rules-group-count branch 2 times, most recently from b4af784 to 13f9d66 Compare April 25, 2024 03:28
@pull-request-size pull-request-size bot added size/L and removed size/M labels Apr 25, 2024
@emanlodovice emanlodovice force-pushed the per-tenant-rules-group-count branch from 13f9d66 to a9a8d5c Compare April 25, 2024 03:34
@@ -731,6 +740,8 @@ func (r *Ruler) listRulesShuffleSharding(ctx context.Context) (map[string]rulesp
mu := sync.Mutex{}
owned := map[string]rulespb.RuleGroupList{}
backedUp := map[string]rulespb.RuleGroupList{}
gLock := sync.Mutex{}
Copy link
Contributor

@rapphil rapphil Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by now you must realize that I don't like mutexes 😅 (not that I don't like, but I try to minimize the use and prefer lock free whenever possible).

I think you can create a slice of maps of size concurrency and give one dedicated map per goroutine and merge all maps from the slice after the wait group. The same approach could be used with owned and backedUp to get rid of the other mutex.

Copy link
Contributor Author

@emanlodovice emanlodovice Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code will be much more complex in that approach. The use of mutex here seems to be acceptable considering the small scope of the lock. Maybe @yeya24 you can share your opinion on this matter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker for me, but a nice to have, therefore I will approve.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with the lock here as setting the map is a very lightweight operation

@emanlodovice emanlodovice force-pushed the per-tenant-rules-group-count branch 3 times, most recently from e0e006f to 613f75d Compare May 13, 2024 07:23
@emanlodovice emanlodovice requested a review from rapphil May 13, 2024 22:05
@emanlodovice
Copy link
Contributor Author

@yeya24 please take a look. Thanks

@rajagopalanand
Copy link
Contributor

@rapphil @yeya24 Can this be merged?

@@ -731,6 +740,8 @@ func (r *Ruler) listRulesShuffleSharding(ctx context.Context) (map[string]rulesp
mu := sync.Mutex{}
owned := map[string]rulespb.RuleGroupList{}
backedUp := map[string]rulespb.RuleGroupList{}
gLock := sync.Mutex{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with the lock here as setting the map is a very lightweight operation

CHANGELOG.md Outdated
@@ -8,6 +8,7 @@
* [ENHANCEMENT] Distributor/Querier: Clean stale per-ingester metrics after ingester restarts. #5930
* [ENHANCEMENT] Distributor/Ring: Allow disabling detailed ring metrics by ring member. #5931
* [ENHANCEMENT] KV: Etcd Added etcd.ping-without-stream-allowed parameter to disable/enable PermitWithoutStream #5933
* [ENHANCEMENT] Ruler: Add new ruler metric `cortex_ruler_rule_groups_in_store`. #5869
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add more details to the changelog about what does this metric help?

@danielblando
Copy link
Contributor

We are in the process of releasing 1.18
Please rebase PR and change the changelog to master

@emanlodovice emanlodovice force-pushed the per-tenant-rules-group-count branch from 613f75d to 8c25deb Compare August 16, 2024 19:07
@emanlodovice emanlodovice requested a review from yeya24 August 16, 2024 21:42
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rebase master and put the changelog entry to master/unreleased? Then we should be good to go

Signed-off-by: Emmanuel Lodovice <lodovice@amazon.com>
@emanlodovice emanlodovice force-pushed the per-tenant-rules-group-count branch from 8c25deb to 6ea9748 Compare August 20, 2024 17:13
@yeya24 yeya24 merged commit bdf677e into cortexproject:master Aug 20, 2024
16 checks passed
danielblando pushed a commit to danielblando/cortex that referenced this pull request Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better mechanism to detect impact in terms of the number of rule groups when rulers become unhealthy.
5 participants