Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loki Ruler: panic with 'fatal error: concurrent map read and map write' #11569

Closed
els0r opened this issue Jan 3, 2024 · 5 comments · Fixed by #11601, #11612 or #13676
Closed

Loki Ruler: panic with 'fatal error: concurrent map read and map write' #11569

els0r opened this issue Jan 3, 2024 · 5 comments · Fixed by #11601, #11612 or #13676

Comments

@els0r
Copy link

els0r commented Jan 3, 2024

Describe the bug
After running for about 15 minutes, the Loki ruler suddenly panics with

fatal error: concurrent map read and map write

We evaluate around 600 rules every minute. The crashes are occurring regularly: we are seeing frequent container restarts, since the 25.12.2023 12:00 CET.

The stack trace suggest that this has to do with WAL handling.

To Reproduce
Steps to reproduce the behavior:

  1. Run Loki
  2. Evaluate 600+ rules with the ruler every minute
  3. Wait 10-15 minutes
  4. Observe logs of ruler with kubectl -n loki logs -f ruler-<num>
  5. Last thing shown should be a 1K+ line stacktrace (see the head of it below)

Expected behavior
Concurrency safe map access. Ruler keeps running.

Environment:

  • Kubernetes (AKS), Loki microservices deployed, ruler talks to query-frontend for rule evaluation
  • Helm
  • Loki Version: 2.9.3

Screenshots, Promtail config, or terminal output

Panic (please let me know if you require the full stack trace)

goroutine 1287 [running]:
reflect.mapaccess_faststr(0x2219be0?, 0x18?, {0xc001dee280?, 0xc002028b40?})
	/usr/local/go/src/runtime/map.go:1360 +0x18
reflect.Value.MapIndex({0x2320d20?, 0xc003ea7770?, 0xcc04291ee70?}, {0x2219be0, 0xc004cea690, 0x98})
	/usr/local/go/src/reflect/value.go:1754 +0xbb
github.com/imdario/mergo.deepMerge({0x2320d20?, 0xc003ea7770?, 0xcc04291ed60?}, {0x2320d20?, 0xc003ea7a30?, 0xc003ecbf80?}, 0x0?, 0x1, 0xc0094d6648)
	/src/loki/vendor/github.com/imdario/mergo/merge.go:125 +0x17f4
github.com/imdario/mergo.deepMerge({0x26ad420?, 0xc003ea7760?, 0x26ad420?}, {0x26ad420?, 0xc003ea7a20?, 0x0?}, 0x0?, 0x0, 0xc0094d6648)
	/src/loki/vendor/github.com/imdario/mergo/merge.go:94 +0x12aa
github.com/imdario/mergo.merge({0x2384160, 0xc003ea7760}, {0x26ad420, 0xc003ea7a20}, {0xc003ed5180, 0x1, 0xc002e060d0?})
	/src/loki/vendor/github.com/imdario/mergo/merge.go:395 +0x2cb
github.com/imdario/mergo.Merge(...)
	/src/loki/vendor/github.com/imdario/mergo/merge.go:319
github.com/grafana/loki/pkg/ruler.(*walRegistry).getTenantRemoteWriteConfig(0xc000a85c00, {0xc0021581b0, 0x15}, {0x0, 0xc000b96db0, 0x1, 0x2540be400})
	/src/loki/pkg/ruler/registry.go:321 +0x7d2
github.com/grafana/loki/pkg/ruler.(*walRegistry).Appender(0xc000a85c00, {0x2ea99e0, 0xc006298780})
	/src/loki/pkg/ruler/registry.go:130 +0x9e
github.com/prometheus/prometheus/rules.(*Group).Eval.func1({0x2ea99e0, 0xc005a78cf0}, 0xc001db6e00, {0x1?, 0x1?, 0x0?}, 0xc003ed5c50, 0x0, {0x2ec38c0, 0xc003f41140})
	/src/loki/vendor/github.com/prometheus/prometheus/rules/manager.go:680 +0x562
github.com/prometheus/prometheus/rules.(*Group).Eval(0xc001db6e00, {0x2ea99e0, 0xc005a78cf0}, {0x2e75730?, 0x4395e20?, 0x0?})
	/src/loki/vendor/github.com/prometheus/prometheus/rules/manager.go:758 +0xbf
github.com/prometheus/prometheus/rules.DefaultEvalIterationFunc({0x2ea99e0, 0xc005a78cf0}, 0xc001db6e00, {0xdf8475800?, 0x2321a40?, 0x0?})
	/src/loki/vendor/github.com/prometheus/prometheus/rules/manager.go:460 +0x13a
github.com/prometheus/prometheus/rules.(*Group).run(0xc001db6e00, {0x2ea99e0, 0xc00214f2f0})
	/src/loki/vendor/github.com/prometheus/prometheus/rules/manager.go:444 +0x84d
github.com/prometheus/prometheus/rules.(*Manager).Update.func1(0x0?)
	/src/loki/vendor/github.com/prometheus/prometheus/rules/manager.go:1064 +0xa6
created by github.com/prometheus/prometheus/rules.(*Manager).Update in goroutine 547
	/src/loki/vendor/github.com/prometheus/prometheus/rules/manager.go:1054 +0x37d

Restart pattern from kubectl -n loki get pods:

ruler-ruler-0                            3/3     Running   624 (2m42s ago)   12d
ruler-ruler-1                            3/3     Running   809 (10m ago)     12d

Please let me know if there's more information required to look at this in detail.

Thanks and happy new year folks! 🥳

@dannykopping
Copy link
Contributor

@els0r thanks a lot for this detailed bug report!
We'll look into it asap

dannykopping pushed a commit that referenced this issue Jan 8, 2024
…#11601)

**What this PR does / why we need it**:
A ruler handling many hundreds of rules can provoke a situation where
the WAL appender reads & modifies tenant configs concurrently in an
unsafe way; this PR protects that with a mutex.

**Which issue(s) this PR fixes**:
Fixes #11569
grafanabot pushed a commit that referenced this issue Jan 8, 2024
…#11601)

**What this PR does / why we need it**:
A ruler handling many hundreds of rules can provoke a situation where
the WAL appender reads & modifies tenant configs concurrently in an
unsafe way; this PR protects that with a mutex.

**Which issue(s) this PR fixes**:
Fixes #11569

(cherry picked from commit cd3cf62)
grafanabot pushed a commit that referenced this issue Jan 8, 2024
…#11601)

**What this PR does / why we need it**:
A ruler handling many hundreds of rules can provoke a situation where
the WAL appender reads & modifies tenant configs concurrently in an
unsafe way; this PR protects that with a mutex.

**Which issue(s) this PR fixes**:
Fixes #11569

(cherry picked from commit cd3cf62)
@dannykopping
Copy link
Contributor

@els0r woiuld you be willing to try a nightly build to see if your problem is addressed?
I believe my test was replicating your production issue, but it's always good to confirm.

@els0r
Copy link
Author

els0r commented Jan 8, 2024

@dannykopping : thanks a lot for the quick turnaround. My colleague @verejoel is in the process of deploying your fix in our prod environment. We'll be back with feedback within the next hours.

@els0r
Copy link
Author

els0r commented Jan 8, 2024

We're running pretty stable so far:

ruler-ruler-0                            3/3     Running   0               45m
ruler-ruler-1                            3/3     Running   0               44m
ruler-ruler-2                            3/3     Running   0               44m
ruler-ruler-3                            3/3     Running   0               43m
ruler-ruler-4                            3/3     Running   0               43m

@dannykopping
Copy link
Contributor

Super 👍 thanks for testing it out.
Any issues with remote-write? These overrides are retrieved when appending to the WAL, which is then polled and the samples remote-written to the Prometheus source; it's possible this could be delayed a little bit due to the locking introduced.

dannykopping pushed a commit that referenced this issue Jan 9, 2024
Expands on #11601

**What this PR does / why we need it**:
Turns out the previous tests didn't expose all possible causes for data
races (another one occurs at
https://github.com/grafana/loki/blob/5a55158cc751465846383bc758aa0c169363b292/pkg/ruler/registry.go#L204).
Moving the mutex to the calling function adds more safety.

**Which issue(s) this PR fixes**:
Fixes #11569

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
grafanabot pushed a commit that referenced this issue Jan 19, 2024
Expands on #11601

**What this PR does / why we need it**:
Turns out the previous tests didn't expose all possible causes for data
races (another one occurs at
https://github.com/grafana/loki/blob/5a55158cc751465846383bc758aa0c169363b292/pkg/ruler/registry.go#L204).
Moving the mutex to the calling function adds more safety.

**Which issue(s) this PR fixes**:
Fixes #11569

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
(cherry picked from commit 61a4205)
grafanabot pushed a commit that referenced this issue Jan 19, 2024
Expands on #11601

**What this PR does / why we need it**:
Turns out the previous tests didn't expose all possible causes for data
races (another one occurs at
https://github.com/grafana/loki/blob/5a55158cc751465846383bc758aa0c169363b292/pkg/ruler/registry.go#L204).
Moving the mutex to the calling function adds more safety.

**Which issue(s) this PR fixes**:
Fixes #11569

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
(cherry picked from commit 61a4205)
dannykopping pushed a commit that referenced this issue Jan 19, 2024
#11714)

Backport 61a4205 from #11612

---

Expands on #11601

**What this PR does / why we need it**:
Turns out the previous tests didn't expose all possible causes for data
races (another one occurs at
https://github.com/grafana/loki/blob/5a55158cc751465846383bc758aa0c169363b292/pkg/ruler/registry.go#L204).
Moving the mutex to the calling function adds more safety.

**Which issue(s) this PR fixes**:
Fixes #11569

Co-authored-by: Danny Kopping <danny.kopping@grafana.com>
dannykopping pushed a commit that referenced this issue Jan 19, 2024
#11715)

Backport 61a4205 from #11612

---

Expands on #11601

**What this PR does / why we need it**:
Turns out the previous tests didn't expose all possible causes for data
races (another one occurs at
https://github.com/grafana/loki/blob/5a55158cc751465846383bc758aa0c169363b292/pkg/ruler/registry.go#L204).
Moving the mutex to the calling function adds more safety.

**Which issue(s) this PR fixes**:
Fixes #11569

Co-authored-by: Danny Kopping <danny.kopping@grafana.com>
rhnasc pushed a commit to inloco/loki that referenced this issue Apr 12, 2024
…grafana#11601)

**What this PR does / why we need it**:
A ruler handling many hundreds of rules can provoke a situation where
the WAL appender reads & modifies tenant configs concurrently in an
unsafe way; this PR protects that with a mutex.

**Which issue(s) this PR fixes**:
Fixes grafana#11569
rhnasc pushed a commit to inloco/loki that referenced this issue Apr 12, 2024
Expands on grafana#11601

**What this PR does / why we need it**:
Turns out the previous tests didn't expose all possible causes for data
races (another one occurs at
https://github.com/grafana/loki/blob/5a55158cc751465846383bc758aa0c169363b292/pkg/ruler/registry.go#L204).
Moving the mutex to the calling function adds more safety.

**Which issue(s) this PR fixes**:
Fixes grafana#11569

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment