Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ruler: protect overrides map with mutex when accessing tenant configs #11601

Merged
merged 3 commits into from
Jan 8, 2024

Conversation

dannykopping
Copy link
Contributor

What this PR does / why we need it:
A ruler handling many hundreds of rules can provoke a situation where the WAL appender reads & modifies tenant configs concurrently in an unsafe way; this PR protects that with a mutex.

Which issue(s) this PR fixes:
Fixes #11569

Special notes for your reviewer:
This doesn't need to be locked in a fine-grained way because this isn't on the hot path.

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • CHANGELOG.md updated
    • If the change is worth mentioning in the release notes, add add-to-release-notes label
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
  • For Helm chart changes bump the Helm chart version in production/helm/loki/Chart.yaml and update production/helm/loki/CHANGELOG.md and production/helm/loki/README.md. Example PR
  • If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

This doesn't need to be locked in a fine-grained way because this isn't on the hot path

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
@dannykopping dannykopping requested a review from a team as a code owner January 8, 2024 08:11
Copy link
Contributor

github-actions bot commented Jan 8, 2024

Trivy scan found the following vulnerabilities:

  • HIGH, Target: docker.io/grafana/loki:main-3373cc5 (alpine 3.18.4), Type: alpine openssl: Incorrect cipher key and IV length processing in libcrypto3 v3.1.3-r0. Fixed in v3.1.4-r0
  • HIGH, Target: docker.io/grafana/loki:main-3373cc5 (alpine 3.18.4), Type: alpine openssl: Incorrect cipher key and IV length processing in libssl3 v3.1.3-r0. Fixed in v3.1.4-r0
    \nTo see more details on these vulnerabilities, and how/where to fix them, please run docker build -t grafana/loki:main-3373cc5 -f cmd/loki/Dockerfile .
    trivy i grafana/loki:main-3373cc5 on your branch. If these were not introduced by your PR, please considering fixing them in via a subsequent PR. Thanks!

Copy link
Contributor

@chaudum chaudum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I think we need to add a changelog entry.

@dannykopping
Copy link
Contributor Author

LGTM

I think we need to add a changelog entry.

Considered it but I don't really see the value in tracking these minor bugfixes? I can easily be convinced, though.

@chaudum
Copy link
Contributor

chaudum commented Jan 8, 2024

LGTM
I think we need to add a changelog entry.

Considered it but I don't really see the value in tracking these minor bugfixes? I can easily be convinced, though.

IMO, any bugfix should get a changelog entry and should be backported. People who face this issue should be able to see that is been fixed in a certain release.

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
@dannykopping
Copy link
Contributor Author

LGTM
I think we need to add a changelog entry.

Considered it but I don't really see the value in tracking these minor bugfixes? I can easily be convinced, though.

IMO, any bugfix should get a changelog entry and should be backported. People who face this issue should be able to see that is been fixed in a certain release.

Fair point about the backport, didn't consider that; that naturally requires the CHANGELOG entry
Added one in 4973875, thanks

CHANGELOG.md Outdated Show resolved Hide resolved
Co-authored-by: Christian Haudum <christian.haudum@gmail.com>
@dannykopping dannykopping enabled auto-merge (squash) January 8, 2024 09:14
@dannykopping dannykopping merged commit cd3cf62 into grafana:main Jan 8, 2024
7 checks passed
grafanabot pushed a commit that referenced this pull request Jan 8, 2024
…#11601)

**What this PR does / why we need it**:
A ruler handling many hundreds of rules can provoke a situation where
the WAL appender reads & modifies tenant configs concurrently in an
unsafe way; this PR protects that with a mutex.

**Which issue(s) this PR fixes**:
Fixes #11569

(cherry picked from commit cd3cf62)
grafanabot pushed a commit that referenced this pull request Jan 8, 2024
…#11601)

**What this PR does / why we need it**:
A ruler handling many hundreds of rules can provoke a situation where
the WAL appender reads & modifies tenant configs concurrently in an
unsafe way; this PR protects that with a mutex.

**Which issue(s) this PR fixes**:
Fixes #11569

(cherry picked from commit cd3cf62)
@dannykopping dannykopping deleted the dannykopping/fix-ruler-race branch January 8, 2024 12:33
dannykopping pushed a commit that referenced this pull request Jan 8, 2024
dannykopping pushed a commit that referenced this pull request Jan 8, 2024
dannykopping pushed a commit that referenced this pull request Jan 9, 2024
Expands on #11601

**What this PR does / why we need it**:
Turns out the previous tests didn't expose all possible causes for data
races (another one occurs at
https://github.com/grafana/loki/blob/5a55158cc751465846383bc758aa0c169363b292/pkg/ruler/registry.go#L204).
Moving the mutex to the calling function adds more safety.

**Which issue(s) this PR fixes**:
Fixes #11569

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
grafanabot pushed a commit that referenced this pull request Jan 19, 2024
Expands on #11601

**What this PR does / why we need it**:
Turns out the previous tests didn't expose all possible causes for data
races (another one occurs at
https://github.com/grafana/loki/blob/5a55158cc751465846383bc758aa0c169363b292/pkg/ruler/registry.go#L204).
Moving the mutex to the calling function adds more safety.

**Which issue(s) this PR fixes**:
Fixes #11569

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
(cherry picked from commit 61a4205)
grafanabot pushed a commit that referenced this pull request Jan 19, 2024
Expands on #11601

**What this PR does / why we need it**:
Turns out the previous tests didn't expose all possible causes for data
races (another one occurs at
https://github.com/grafana/loki/blob/5a55158cc751465846383bc758aa0c169363b292/pkg/ruler/registry.go#L204).
Moving the mutex to the calling function adds more safety.

**Which issue(s) this PR fixes**:
Fixes #11569

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
(cherry picked from commit 61a4205)
dannykopping pushed a commit that referenced this pull request Jan 19, 2024
#11714)

Backport 61a4205 from #11612

---

Expands on #11601

**What this PR does / why we need it**:
Turns out the previous tests didn't expose all possible causes for data
races (another one occurs at
https://github.com/grafana/loki/blob/5a55158cc751465846383bc758aa0c169363b292/pkg/ruler/registry.go#L204).
Moving the mutex to the calling function adds more safety.

**Which issue(s) this PR fixes**:
Fixes #11569

Co-authored-by: Danny Kopping <danny.kopping@grafana.com>
dannykopping pushed a commit that referenced this pull request Jan 19, 2024
#11715)

Backport 61a4205 from #11612

---

Expands on #11601

**What this PR does / why we need it**:
Turns out the previous tests didn't expose all possible causes for data
races (another one occurs at
https://github.com/grafana/loki/blob/5a55158cc751465846383bc758aa0c169363b292/pkg/ruler/registry.go#L204).
Moving the mutex to the calling function adds more safety.

**Which issue(s) this PR fixes**:
Fixes #11569

Co-authored-by: Danny Kopping <danny.kopping@grafana.com>
rhnasc pushed a commit to inloco/loki that referenced this pull request Apr 12, 2024
…grafana#11601)

**What this PR does / why we need it**:
A ruler handling many hundreds of rules can provoke a situation where
the WAL appender reads & modifies tenant configs concurrently in an
unsafe way; this PR protects that with a mutex.

**Which issue(s) this PR fixes**:
Fixes grafana#11569
rhnasc pushed a commit to inloco/loki that referenced this pull request Apr 12, 2024
Expands on grafana#11601

**What this PR does / why we need it**:
Turns out the previous tests didn't expose all possible causes for data
races (another one occurs at
https://github.com/grafana/loki/blob/5a55158cc751465846383bc758aa0c169363b292/pkg/ruler/registry.go#L204).
Moving the mutex to the calling function adds more safety.

**Which issue(s) this PR fixes**:
Fixes grafana#11569

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Loki Ruler: panic with 'fatal error: concurrent map read and map write'
2 participants