Distributor Warning: removing ingester failing healthcheck #3028

zhuyanxi · 2020-08-13T07:53:44Z

Here the situation:

When a ingester pod is evicted, there will be a warning in distributor:

level=warn ts=2020-08-13T06:29:31.836520675Z caller=pool.go:182 msg="removing ingester failing healthcheck" addr=10.42.7.94:9095 reason="rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.42.7.94:9095: connect: connection refused\""

I think it is because of the ingester pod is already evicted, so the IP addr is not exist. And if the number of evicted ingester is bigger than the half of all ingesters, there will be an error in distributor:

level=error ts=2020-08-13T06:44:59.834662076Z caller=pool.go:161 msg="error removing stale clients" err="too many failed ingesters"

So what can I do to deal with this problem?

Thanks.

pracucci · 2020-08-14T09:18:44Z

When an ingester gracefully shutdowns, it removes itself from the ring and the issue you describe shouldn't happen. However, if the ingester pod is not cleanly shutdown (eg. process crash, node failure, ...), the ingester will not be removed from the ring and you're expected to manually address it (if it happens on more then 1 ingester, then you may have data loss). By "manually address it" I mean opening the /ring web page on the ingester and manually "Forget" the affected ingester.

I understand this is not ideal from the operability point of view, and we may reconsider / rediscuss it.

Getting back to your issue, I've the feeling that when the pod is evicted the SIGTERM is not sent to the Cortex process and no clean shutdown happens. May you double check it, please?

zhuyanxi · 2020-08-17T08:56:26Z

@pracucci Thank you very much.
I did some tests these days, and just as you said, if an ingester shutdowns gracefully, my issue will not happen; if I manually halt an ingester pod(just simulate ungracefully shutdown, eg. node failure or application crush), the ingester will not be removed from the ring, then I need to manually "Forget" it in the /ring web page.

So, I decide to create a tiny service to monitor the /ring web page, the service will trigger the "Forget" event once it find an "Unhealthy" ingester that didn't be removed correctly.

And I think whether it is possible to add this little feature to Cortex, for example, Cortex may provide an API(eg. /api/v1/ingester/forget) for customers to "Forget" an ingester.
And I found that Cortex has provided a api called /shutdown, is this the api to shutdown the ingester gracefully?

stale · 2020-10-18T03:10:16Z

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

jcmoraisjr · 2020-11-12T14:35:08Z

Hi @pracucci what about if I click "forget" and 1-2 seconds later the ingesters are there again?

pracucci · 2020-11-17T12:50:59Z

what about if I click "forget" and 1-2 seconds later the ingesters are there again?

If an ingester is running (healthy), it will keep adding itself to the ring if it can't find an entry for itself. This means that, if an ingester is running, and you "forget it", it will be automatically readded back few seconds later. If the ingester is "unhealthy", it's expected to not run (eg. process crashed, node unresponsive, ...) and a manual forget is required to remove that ingester from the ring.

jcmoraisjr · 2020-11-17T12:59:03Z

Thanks @pracucci for the update. As far as I can tell I had only that three ACTIVE ingester pods running, nothing more. I didn't have a look in the cortex internals yet, but you'd confirm that the only way an instance resurrect from a forget click is itself readding it to the ring? I'll try to reproduce this again and file another issue with more details. Btw Cortex 1.4.0 here.

pstibrany · 2020-11-17T13:16:47Z

Where do you store the ring? Consul? Etcd? Memberlist?

jcmoraisjr · 2020-11-17T13:40:41Z

Sorry, didn't mention that. Memberlist.

pstibrany · 2020-11-17T14:18:22Z

Thanks. I suspect there is some bug related to forgetting when using memberlist, but we haven't yet been able to poinpoint it down :(

pstibrany · 2020-12-14T18:49:06Z

Thanks. I suspect there is some bug related to forgetting when using memberlist, but we haven't yet been able to poinpoint it down :(

I think #3603 will fix this issue.

stale · 2021-03-15T06:53:03Z

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

pstibrany · 2021-03-15T07:18:15Z

If you still see this, another setting that may help is to change -memberlist.left-ingesters-timeout to higher value. It defaults to 5 mins. You can set it eg. to 30 mins. It tells components to keep ingesters tombstones in the ring for longer. 5mins was chosen as value during which we can reasonably expect to tombstone to propagate to entire memberlist cluster, but if cluster is large and propagation isn't complete, some nodes will still keep old (alive) ingester entries, and will eventually propagate it back, once tombstones are removed -- and forgotten entry will reappear. Increasing -memberlist.left-ingesters-timeout prevents this.

stale bot added the stale label Oct 18, 2020

stale bot closed this as completed Nov 2, 2020

pracucci reopened this Dec 3, 2020

pracucci added component/ring and removed stale labels Dec 3, 2020

stale bot added the stale label Mar 15, 2021

stale bot removed the stale label Mar 15, 2021

pstibrany closed this as completed Mar 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributor Warning: removing ingester failing healthcheck #3028

Distributor Warning: removing ingester failing healthcheck #3028

zhuyanxi commented Aug 13, 2020

pracucci commented Aug 14, 2020

zhuyanxi commented Aug 17, 2020 •

edited

Loading

stale bot commented Oct 18, 2020

jcmoraisjr commented Nov 12, 2020

pracucci commented Nov 17, 2020

jcmoraisjr commented Nov 17, 2020 •

edited

Loading

pstibrany commented Nov 17, 2020

jcmoraisjr commented Nov 17, 2020

pstibrany commented Nov 17, 2020

pstibrany commented Dec 14, 2020

stale bot commented Mar 15, 2021

pstibrany commented Mar 15, 2021

Distributor Warning: removing ingester failing healthcheck #3028

Distributor Warning: removing ingester failing healthcheck #3028

Comments

zhuyanxi commented Aug 13, 2020

pracucci commented Aug 14, 2020

zhuyanxi commented Aug 17, 2020 • edited Loading

stale bot commented Oct 18, 2020

jcmoraisjr commented Nov 12, 2020

pracucci commented Nov 17, 2020

jcmoraisjr commented Nov 17, 2020 • edited Loading

pstibrany commented Nov 17, 2020

jcmoraisjr commented Nov 17, 2020

pstibrany commented Nov 17, 2020

pstibrany commented Dec 14, 2020

stale bot commented Mar 15, 2021

pstibrany commented Mar 15, 2021

zhuyanxi commented Aug 17, 2020 •

edited

Loading

jcmoraisjr commented Nov 17, 2020 •

edited

Loading