-
Notifications
You must be signed in to change notification settings - Fork 812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cortex_ruler_rule_groups_in_store metric #5869
Add cortex_ruler_rule_groups_in_store metric #5869
Conversation
5dc489a
to
80f2933
Compare
e533fae
to
840cd27
Compare
pkg/ruler/manager_metrics.go
Outdated
@@ -271,3 +273,42 @@ func (m *RuleEvalMetrics) deletePerUserMetrics(userID string) { | |||
m.RulerQuerySeconds.DeleteLabelValues(userID) | |||
} | |||
} | |||
|
|||
type RuleGroupMetrics struct { | |||
mtx sync.Mutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to control access to this resource with a mutex? I checked and we only update the metrics when we sync rules, which happens sequentially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to because we have unit tests for UpdateRuleGroupsInStore
. Without this mutex the test will fail when run with -race
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what tests are failing specifically because of race conditions?
TestRuleGroupMetrics
is creating all resources in the test function, unless I'm missing something and prometheus has some package level variables under the hood.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One test that fails is TestRuler_rules_limit
. The problem is that there are tests that manually calls syncRules to make sure rules are loaded https://github.com/cortexproject/cortex/blob/master/pkg/ruler/ruler_test.go#L247-L255 and this could overlap with the syncRules call when running the ruler service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels that tests should not be sharing resources if they are running in parallel. How hard would be to fix the tests?
Alternatively, maybe you can add an interface and change the implementation of RuleGroupMetrics in runtime during tests by wrapping it with a version that uses mutex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels that tests should not be sharing resources if they are running in parallel. How hard would be to fix the tests?
these tests are not sharing resources. Because starting a the ruler calls syncRules
and the test itself calls syncRules
is causing the double invocation. Honestly I don't understand why we need to shy away from this mutex at all. We know it doesn't hurt during run time because syncRules don't really get invoked concurrently. I prefer adding this simple mutex than complicate the code. This mutex is similar to https://github.com/cortexproject/cortex/blob/master/pkg/ruler/manager.go#L127-L128 since both of which are called only during syncRules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I don't understand why we need to shy away from this mutex at all. We know it doesn't hurt during run time because syncRules don't really get invoked concurrently.
because throwing a mutex in the application code just to make a test pass is a weak reason IMHO. It is not for performance, but by the fact that this mutex is essentially not serving its purpose (i.e.: it is useless in the application) and will mislead people reading this code.
I can think of two options of solving this:
- Transform
RuleGroupMetrics
into a interface. the implementation can be injected into the ruler by adding a parameter in this function. This will allow you to pass a thread safe implementation that is a wrapper to the realRuleGroupMetrics
here by returning the thread safeRuleGroupMetrics
fromtestSetup
, like other entities that are generated only for tests.... This will make the coverage of the tests the same while essentially moving the mutex from application code to test code. - Add a method to the ruler that will allow us to identify that rules were loaded at least once and we can block and wait in the code (similar to this method). With this approach we will remove any other conditions.
I prefer option 2 but I fell that this is outside of the scope of this PR, so I think option 1 is ok for this pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's what tests are for. To detect which parts can breaks with concurrency. The tests proves that this bit of code can break when there is concurrency. That concurrency might not happen in normal execution of the process but the fact remains that it is susceptible to it. I don't like the idea of having an interface and passing a wrapped version in tests because then what is being tested is not the same as what is being used at run time. If someone makes a change to actually call syncRules in parallel the test will not fail. I can try removing the manual call to syncRules and replace it with a wait mechanism
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the test and removed the mtx, PTAL @rapphil , thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think tests should test if the behaviour of the code is what is expected and if the assumptions that we make about its behaviour remain constant. If something is not intended to be thread safe, there is no point in testing that.
Having said that, thanks for fixing the tests :) if one day we need to make syncRules
thread safe, we should add tests to explicitly test that.
b4af784
to
13f9d66
Compare
13f9d66
to
a9a8d5c
Compare
@@ -731,6 +740,8 @@ func (r *Ruler) listRulesShuffleSharding(ctx context.Context) (map[string]rulesp | |||
mu := sync.Mutex{} | |||
owned := map[string]rulespb.RuleGroupList{} | |||
backedUp := map[string]rulespb.RuleGroupList{} | |||
gLock := sync.Mutex{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by now you must realize that I don't like mutexes 😅 (not that I don't like, but I try to minimize the use and prefer lock free whenever possible).
I think you can create a slice of maps of size concurrency
and give one dedicated map per goroutine and merge all maps from the slice after the wait group. The same approach could be used with owned
and backedUp
to get rid of the other mutex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code will be much more complex in that approach. The use of mutex here seems to be acceptable considering the small scope of the lock. Maybe @yeya24 you can share your opinion on this matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a blocker for me, but a nice to have, therefore I will approve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with the lock here as setting the map is a very lightweight operation
e0e006f
to
613f75d
Compare
@yeya24 please take a look. Thanks |
@@ -731,6 +740,8 @@ func (r *Ruler) listRulesShuffleSharding(ctx context.Context) (map[string]rulesp | |||
mu := sync.Mutex{} | |||
owned := map[string]rulespb.RuleGroupList{} | |||
backedUp := map[string]rulespb.RuleGroupList{} | |||
gLock := sync.Mutex{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with the lock here as setting the map is a very lightweight operation
CHANGELOG.md
Outdated
@@ -8,6 +8,7 @@ | |||
* [ENHANCEMENT] Distributor/Querier: Clean stale per-ingester metrics after ingester restarts. #5930 | |||
* [ENHANCEMENT] Distributor/Ring: Allow disabling detailed ring metrics by ring member. #5931 | |||
* [ENHANCEMENT] KV: Etcd Added etcd.ping-without-stream-allowed parameter to disable/enable PermitWithoutStream #5933 | |||
* [ENHANCEMENT] Ruler: Add new ruler metric `cortex_ruler_rule_groups_in_store`. #5869 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add more details to the changelog about what does this metric help?
We are in the process of releasing 1.18 |
613f75d
to
8c25deb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rebase master and put the changelog entry to master/unreleased
? Then we should be good to go
Signed-off-by: Emmanuel Lodovice <lodovice@amazon.com>
8c25deb
to
6ea9748
Compare
What this PR does:
Add a new per tenant metric called
cortex_ruler_rule_groups_in_store
which contains the number of rule groups each tenant has in store (s3 for example). The metric provides an accurate count of rule groups that is not affected by the health of the rulers or resharding.Which issue(s) this PR fixes:
Fixes #5866
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]