Fix/lifecycler/stuck in leaving #1921

thorfour · 2019-12-17T22:11:00Z

What this PR does: I've observed ingesters that were evicted and then rescheduled. They were unable to gracefully shutdown and left their state in the ring as LEAVING. On reschedule the ingester found itself in the LEAVING state. Set itself to the LEAVING state and was then unable to join the cluster.

This allows an ingester that finds itself in the LEAVING state with tokens to mark itself as ACTIVE.

Which issue(s) this PR fixes:
N/A

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

codesome

Were you able to find out why the ingester was stuck in LEAVING state? I found myself in the same situation when working with the WAL and it turned out that I had termination grace period of Kubernetes set very low (30sec). Hence before the ingester could exit gracefully, it was killed, leaving the ring in LEAVING state.

Also, this takes care of this issue #1904 (created when I faced the exact same issue)

pkg/ring/lifecycler.go

pracucci

I see the problem you're trying to fix. The change sounds reasonable to me, but I'm not familiar enough with all the ring's edge cases. @pstibrany what's your take on this?

pstibrany

Overall, this fixes the problem when ingesters reuse their names (eg. statefulsets).

pkg/ring/lifecycler.go

Signed-off-by: Thor <thansen@digitalocean.com>

khaines · 2019-12-18T16:48:08Z

This will certainly help the statefulset ingester setup when recovering from an unclean shutdown. Thank you!

canghai118 · 2020-01-06T11:20:27Z

If ingester in deploy use k8s deployment, every time ingester recreate by k8s, it will register itself with different hostname and address. the zombie ingester will stay in ring for ever, and finally the clients try to read/write will report too many failed ingesters.

If the ingester is kill by oom-killer, it won't start the quit process, and will leave state in ACTIVE, I just encounter this problems on ruler #1956

By the way , after the success transfer or flush, ingester will sleep for -ingester.final-sleep, after that it unregister it self from ring. Default value for -ingester.final-sleep is 30seconds, the default grace termination period for k8s pod is 30s, it will probably cause trouble if both parameter use default value.

the same problem happens on ruler too.

May be we need a away to cleanup dead ingester from ring.

@thorfour @khaines @pstibrany

thorfour · 2020-01-06T14:55:26Z

@canghai118 I'm not sure there's a safe way to determine if an ingester is a "zombie" or if it's just crash looping and will come back eventually. I'd worry that trying to perform cleanups on what we believe to be a zombie ingester might cause more problems than it solves.

thorfour force-pushed the fix/lifecycler/stuck-in-leaving branch from fc72395 to 44e4d7e Compare December 17, 2019 22:12

codesome reviewed Dec 18, 2019

View reviewed changes

pkg/ring/lifecycler.go Outdated Show resolved Hide resolved

pkg/ring/lifecycler.go Show resolved Hide resolved

pracucci reviewed Dec 18, 2019

View reviewed changes

pstibrany approved these changes Dec 18, 2019

View reviewed changes

pkg/ring/lifecycler.go Show resolved Hide resolved

gouthamve approved these changes Dec 18, 2019

View reviewed changes

thorfour force-pushed the fix/lifecycler/stuck-in-leaving branch from 44e4d7e to a4d5af3 Compare December 18, 2019 14:53

Thor added 3 commits December 18, 2019 09:00

unit test: join a ingester that comes up leaving

52663c5

Signed-off-by: Thor <thansen@digitalocean.com>

lifecycler: manually move to PENDING if LEAVING during init

958445c

Signed-off-by: Thor <thansen@digitalocean.com>

CHANGELOG

4a1d45f

Signed-off-by: Thor <thansen@digitalocean.com>

thorfour force-pushed the fix/lifecycler/stuck-in-leaving branch from 2c77f3c to 4a1d45f Compare December 18, 2019 15:00

codesome approved these changes Dec 18, 2019

View reviewed changes

khaines approved these changes Dec 18, 2019

View reviewed changes

gouthamve merged commit 4c91fac into cortexproject:master Jan 3, 2020

thorfour deleted the fix/lifecycler/stuck-in-leaving branch January 3, 2020 15:15

This was referenced Feb 13, 2020

Don't blindy copy the old ingester state from ring #1904

Closed

Ingester to wait for some time before joining gossip based ring #1903

Closed

pstibrany mentioned this pull request Mar 17, 2020

Loki ingester stays in LEAVING state after node reboot grafana/loki#1806

Closed

pracucci mentioned this pull request Feb 2, 2021

Fix ring lifecycler when 'unregister on shutdown' is disabled #3774

Closed

3 tasks

marianafranco mentioned this pull request Dec 23, 2022

Fix ingesters with less tokens stuck in LEAVING #5061

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/lifecycler/stuck in leaving #1921

Fix/lifecycler/stuck in leaving #1921

thorfour commented Dec 17, 2019 •

edited

Loading

codesome left a comment

pracucci left a comment

pstibrany left a comment

khaines commented Dec 18, 2019

canghai118 commented Jan 6, 2020 •

edited

Loading

thorfour commented Jan 6, 2020

Fix/lifecycler/stuck in leaving #1921

Fix/lifecycler/stuck in leaving #1921

Conversation

thorfour commented Dec 17, 2019 • edited Loading

codesome left a comment

Choose a reason for hiding this comment

pracucci left a comment

Choose a reason for hiding this comment

pstibrany left a comment

Choose a reason for hiding this comment

khaines commented Dec 18, 2019

canghai118 commented Jan 6, 2020 • edited Loading

thorfour commented Jan 6, 2020

thorfour commented Dec 17, 2019 •

edited

Loading

canghai118 commented Jan 6, 2020 •

edited

Loading