Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/lifecycler/stuck in leaving #1921

Merged

Conversation

thorfour
Copy link
Contributor

@thorfour thorfour commented Dec 17, 2019

What this PR does: I've observed ingesters that were evicted and then rescheduled. They were unable to gracefully shutdown and left their state in the ring as LEAVING. On reschedule the ingester found itself in the LEAVING state. Set itself to the LEAVING state and was then unable to join the cluster.

This allows an ingester that finds itself in the LEAVING state with tokens to mark itself as ACTIVE.

Which issue(s) this PR fixes:
N/A

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@thorfour thorfour force-pushed the fix/lifecycler/stuck-in-leaving branch from fc72395 to 44e4d7e Compare December 17, 2019 22:12
Copy link
Contributor

@codesome codesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were you able to find out why the ingester was stuck in LEAVING state? I found myself in the same situation when working with the WAL and it turned out that I had termination grace period of Kubernetes set very low (30sec). Hence before the ingester could exit gracefully, it was killed, leaving the ring in LEAVING state.

Also, this takes care of this issue #1904 (created when I faced the exact same issue)

pkg/ring/lifecycler.go Outdated Show resolved Hide resolved
pkg/ring/lifecycler.go Show resolved Hide resolved
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the problem you're trying to fix. The change sounds reasonable to me, but I'm not familiar enough with all the ring's edge cases. @pstibrany what's your take on this?

Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this fixes the problem when ingesters reuse their names (eg. statefulsets).

pkg/ring/lifecycler.go Show resolved Hide resolved
@thorfour thorfour force-pushed the fix/lifecycler/stuck-in-leaving branch from 44e4d7e to a4d5af3 Compare December 18, 2019 14:53
Thor added 3 commits December 18, 2019 09:00
Signed-off-by: Thor <thansen@digitalocean.com>
Signed-off-by: Thor <thansen@digitalocean.com>
Signed-off-by: Thor <thansen@digitalocean.com>
@thorfour thorfour force-pushed the fix/lifecycler/stuck-in-leaving branch from 2c77f3c to 4a1d45f Compare December 18, 2019 15:00
@khaines
Copy link
Contributor

khaines commented Dec 18, 2019

This will certainly help the statefulset ingester setup when recovering from an unclean shutdown. Thank you!

@gouthamve gouthamve merged commit 4c91fac into cortexproject:master Jan 3, 2020
@thorfour thorfour deleted the fix/lifecycler/stuck-in-leaving branch January 3, 2020 15:15
@canghai118
Copy link
Contributor

canghai118 commented Jan 6, 2020

If ingester in deploy use k8s deployment, every time ingester recreate by k8s, it will register itself with different hostname and address. the zombie ingester will stay in ring for ever, and finally the clients try to read/write will report too many failed ingesters.

If the ingester is kill by oom-killer, it won't start the quit process, and will leave state in ACTIVE, I just encounter this problems on ruler #1956

By the way , after the success transfer or flush, ingester will sleep for -ingester.final-sleep, after that it unregister it self from ring. Default value for -ingester.final-sleep is 30seconds, the default grace termination period for k8s pod is 30s, it will probably cause trouble if both parameter use default value.

the same problem happens on ruler too.

May be we need a away to cleanup dead ingester from ring.

@thorfour @khaines @pstibrany

@thorfour
Copy link
Contributor Author

thorfour commented Jan 6, 2020

@canghai118 I'm not sure there's a safe way to determine if an ingester is a "zombie" or if it's just crash looping and will come back eventually. I'd worry that trying to perform cleanups on what we believe to be a zombie ingester might cause more problems than it solves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants