Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve GetRules performance #5805

Merged
merged 2 commits into from
Apr 1, 2024
Merged

Conversation

rajagopalanand
Copy link
Contributor

@rajagopalanand rajagopalanand commented Mar 6, 2024

What this PR does:

Currently the manager's SyncRuleGroups and GetRules methods share the same lock. This means that if SyncRuleGroups becomes slow then GetRules will have to wait a long time to acquire the lock.

SyncRuleGroups can become slow when rule groups are updated with slow running rules because the rule group will wait for currently evaluating rule to finish before it stops

This PR fixes this problem by:

  1. Using the userManagerMtx for it's intended purpose which is to protect the userManagers map
  2. Convert useManagerMtx to RWMutex so that exclusive lock is acquired only when a new manager is created or clean up occurs at the end of SyncRuleGroups. By doing this, the duration of exclusive lock is reduced and the scope is reduced
  3. Introduce syncRuleMtx to ensure SyncRuleGroups is executed sequentially
  4. Cache the rule groups before Prometheus rules/manager.Update is called. This is because each Prometheus rule manager can still be responsible for large number of rule groups and if many of those rule groups have slow running rules, then ListRules will be slow. The cached rule groups will be removed as soon as update is complete

Which issue(s) this PR fixes:
Fixes #5745

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@pull-request-size pull-request-size bot added size/L and removed size/M labels Mar 8, 2024
@rajagopalanand rajagopalanand force-pushed the lock-fix branch 2 times, most recently from b77ed02 to 28bb394 Compare March 29, 2024 16:00
@rajagopalanand rajagopalanand marked this pull request as ready for review March 29, 2024 16:51
@@ -184,6 +198,29 @@ func (r *DefaultMultiTenantManager) syncRulesToManager(ctx context.Context, user
}
}

func (r *DefaultMultiTenantManager) getRulesManager(user string, ctx context.Context) (RulesManager, bool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we only do one thing in this method to get manager from the map rather than creating a new one?
Create manager and make it up and running should be a separate function.

pkg/ruler/manager.go Show resolved Hide resolved
@yeya24
Copy link
Contributor

yeya24 commented Mar 30, 2024

Can you please update the PR description and mention what this pr is trying to do?

@rajagopalanand rajagopalanand changed the title Restructure mutex such that manager is not holding one mutex for the … Improve GetRules performance Mar 31, 2024
…entirety of sync rules

Signed-off-by: Anand Rajagopal <anrajag@amazon.com>
Copy link
Contributor

@emanlodovice emanlodovice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

CHANGELOG.md Outdated Show resolved Hide resolved
Signed-off-by: Anand Rajagopal <anrajag@amazon.com>
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@yeya24 yeya24 merged commit 20377e0 into cortexproject:master Apr 1, 2024
16 checks passed
alanprot pushed a commit to alanprot/cortex that referenced this pull request Apr 2, 2024
…entirety of sync rules (cortexproject#5805)

Signed-off-by: Anand Rajagopal <anrajag@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Slow running rules from one tenant can cause PrometheusRules API to timeout for all tenants
3 participants