Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributor Warning: removing ingester failing healthcheck #3028

Closed
zhuyanxi opened this issue Aug 13, 2020 · 12 comments
Closed

Distributor Warning: removing ingester failing healthcheck #3028

zhuyanxi opened this issue Aug 13, 2020 · 12 comments

Comments

@zhuyanxi
Copy link

Here the situation:

When a ingester pod is evicted, there will be a warning in distributor:

level=warn ts=2020-08-13T06:29:31.836520675Z caller=pool.go:182 msg="removing ingester failing healthcheck" addr=10.42.7.94:9095 reason="rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.42.7.94:9095: connect: connection refused\""

I think it is because of the ingester pod is already evicted, so the IP addr is not exist. And if the number of evicted ingester is bigger than the half of all ingesters, there will be an error in distributor:

level=error ts=2020-08-13T06:44:59.834662076Z caller=pool.go:161 msg="error removing stale clients" err="too many failed ingesters"

So what can I do to deal with this problem?

Thanks.

@pracucci
Copy link
Contributor

When an ingester gracefully shutdowns, it removes itself from the ring and the issue you describe shouldn't happen. However, if the ingester pod is not cleanly shutdown (eg. process crash, node failure, ...), the ingester will not be removed from the ring and you're expected to manually address it (if it happens on more then 1 ingester, then you may have data loss). By "manually address it" I mean opening the /ring web page on the ingester and manually "Forget" the affected ingester.

I understand this is not ideal from the operability point of view, and we may reconsider / rediscuss it.

Getting back to your issue, I've the feeling that when the pod is evicted the SIGTERM is not sent to the Cortex process and no clean shutdown happens. May you double check it, please?

@zhuyanxi
Copy link
Author

zhuyanxi commented Aug 17, 2020

@pracucci Thank you very much.
I did some tests these days, and just as you said, if an ingester shutdowns gracefully, my issue will not happen; if I manually halt an ingester pod(just simulate ungracefully shutdown, eg. node failure or application crush), the ingester will not be removed from the ring, then I need to manually "Forget" it in the /ring web page.

So, I decide to create a tiny service to monitor the /ring web page, the service will trigger the "Forget" event once it find an "Unhealthy" ingester that didn't be removed correctly.

And I think whether it is possible to add this little feature to Cortex, for example, Cortex may provide an API(eg. /api/v1/ingester/forget) for customers to "Forget" an ingester.
And I found that Cortex has provided a api called /shutdown, is this the api to shutdown the ingester gracefully?

@stale
Copy link

stale bot commented Oct 18, 2020

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Oct 18, 2020
@stale stale bot closed this as completed Nov 2, 2020
@jcmoraisjr
Copy link

Hi @pracucci what about if I click "forget" and 1-2 seconds later the ingesters are there again?

cortex1
cortex2
cortex3

@pracucci
Copy link
Contributor

what about if I click "forget" and 1-2 seconds later the ingesters are there again?

If an ingester is running (healthy), it will keep adding itself to the ring if it can't find an entry for itself. This means that, if an ingester is running, and you "forget it", it will be automatically readded back few seconds later. If the ingester is "unhealthy", it's expected to not run (eg. process crashed, node unresponsive, ...) and a manual forget is required to remove that ingester from the ring.

@jcmoraisjr
Copy link

jcmoraisjr commented Nov 17, 2020

Thanks @pracucci for the update. As far as I can tell I had only that three ACTIVE ingester pods running, nothing more. I didn't have a look in the cortex internals yet, but you'd confirm that the only way an instance resurrect from a forget click is itself readding it to the ring? I'll try to reproduce this again and file another issue with more details. Btw Cortex 1.4.0 here.

@pstibrany
Copy link
Contributor

Where do you store the ring? Consul? Etcd? Memberlist?

@jcmoraisjr
Copy link

Sorry, didn't mention that. Memberlist.

@pstibrany
Copy link
Contributor

Thanks. I suspect there is some bug related to forgetting when using memberlist, but we haven't yet been able to poinpoint it down :(

@pracucci pracucci reopened this Dec 3, 2020
@pstibrany
Copy link
Contributor

Thanks. I suspect there is some bug related to forgetting when using memberlist, but we haven't yet been able to poinpoint it down :(

I think #3603 will fix this issue.

@stale
Copy link

stale bot commented Mar 15, 2021

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Mar 15, 2021
@pstibrany
Copy link
Contributor

If you still see this, another setting that may help is to change -memberlist.left-ingesters-timeout to higher value. It defaults to 5 mins. You can set it eg. to 30 mins. It tells components to keep ingesters tombstones in the ring for longer. 5mins was chosen as value during which we can reasonably expect to tombstone to propagate to entire memberlist cluster, but if cluster is large and propagation isn't complete, some nodes will still keep old (alive) ingester entries, and will eventually propagate it back, once tombstones are removed -- and forgotten entry will reappear. Increasing -memberlist.left-ingesters-timeout prevents this.

@stale stale bot removed the stale label Mar 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants