Mysterious flush of underutilised chunks 1hr after ingester rollout #467

awh · 2017-06-19T16:34:52Z

On the 15th of June:

On the 19th of June:

Both of these happened approximately one hour after an ingester upgrade in which chunks were successfully transferred from terminating ingesters. Chunk max idle is 1h, possibly related?

CC @tomwilkie any thoughts?

tomwilkie · 2017-06-19T16:38:38Z

Do you know if these are spurious? Or did 800k/3 timeseries stop being written to?

…

On Mon, Jun 19, 2017 at 5:34 PM Adam Harrison ***@***.***> wrote: On the 15th of June: [image: screenshot from 2017-06-19 17-23-20] <https://user-images.githubusercontent.com/1504438/27295498-13ea51d0-5515-11e7-8d8c-2de8b93238b2.png> On the 19th of June: [image: screenshot from 2017-06-19 17-23-33] <https://user-images.githubusercontent.com/1504438/27295513-1ff314b2-5515-11e7-8fcd-2e25cbcdf091.png> Both of these happened approximately one hour after an ingester upgrades in which chunks were successfully transferred from terminating ingesters. Chunk max idle is 1h, possibly related? CC @tomwilkie <https://github.com/tomwilkie> any thoughts? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#467>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAbGhRfyCvo82TNSkCcmdJGS3qxnMfk2ks5sFqMvgaJpZM4N-e40> .

awh · 2017-06-19T16:40:59Z

Oh - could those be series from metrics scraped from the ingesters which have now terminated?

awh · 2017-06-20T08:45:00Z

Querying for {job="cortex/ingester"} returns only 1406 time series.

tomwilkie · 2017-06-20T08:47:04Z

I assume this was part of a rolling upgrade of all of cortex? What does count({job=~"cortex/.*"}) return?

awh · 2017-06-20T09:01:05Z

9862, and no, in both cases it was just the ingester that was rolled out (in the first instance we had upgraded all of cortex but delayed the ingester whilst we understood #458 so it was eventually done in isolation, and in the second instance we only needed to do the ingester to deploy #460)

awh · 2017-06-20T09:05:06Z

It's almost like the state reconstructed from transferred chunks in the new ingesters is not identical to that in the old, and so they are making different decisions about what should be flushed...

awh · 2017-06-20T15:42:35Z

This is also happening on rolling reboots.

leth · 2017-06-20T15:43:43Z

As @awh said, we just had a single ingester replaced, due to a node reboot, and the queue peaked at 600K

leth · 2017-06-30T12:10:23Z

Oops

leth · 2017-08-03T11:14:44Z

I had a brainwave last night; during ingester handover the old and new ingesters are no longer in ACTIVE state, so the distributor will pick an extra ingester, which was not selected before the handover started.
After the handover has finished, it will return to selecting ingester covering the original part of the ring (the new ingester), and the extra ingester will no longer be selected.
After $idleChunkTime the extra ingester will flush these idle chunks.

leth · 2017-08-03T11:21:58Z

To prevent the ingester picker from picking a temporary node (and allow N concurrent upgrades), we'd need to ensure replication - quorum - N > 0, and allow the picker to select fewer nodes during a handover (quorum < X < replication).

tomwilkie · 2017-08-03T11:26:57Z

Good catch! We can work round this by including leaving ingesters in the N, but not writing to them, as you suggest. Should be an easy fix.

…

On Thu, 3 Aug 2017 at 12:21, Marcus Cobden ***@***.***> wrote: To prevent the ingester picker from picking a temporary node (and allow N concurrent upgrades), we'd need to ensure replication - quorum - N > 1, and allow the picker to select fewer nodes during a handover (quorum < X < replication). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#467 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAbGhfUg4ZQygo1EX_9poQbrKIZNs3Juks5sUa1WgaJpZM4N-e40> .

awh · 2017-08-03T11:37:57Z

Aye, good spot 👏

stale · 2020-02-03T11:56:18Z

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

leth closed this as completed Jun 30, 2017

leth reopened this Jun 30, 2017

bboreham mentioned this issue Feb 8, 2018

Unify the replication calculation between query and push. #681

Merged

leth mentioned this issue Apr 9, 2018

WIP: Send samples to joining ingester during handover #788

Closed

bboreham mentioned this issue Jun 19, 2018

Auto-scale DynamoDB provision based on Prometheus metrics #841

Merged

11 tasks

bboreham closed this as completed in #841 Aug 7, 2018

bboreham mentioned this issue Aug 7, 2018

DynamoDB "metrics" autoscaler enhancements #921

Closed

5 tasks

leth reopened this Aug 7, 2018

bboreham mentioned this issue May 31, 2019

Improve Ingester Handover #1277

Closed

gouthamve added the component/ingester label Nov 11, 2019

gouthamve mentioned this issue Nov 11, 2019

Incrementally transfer chunks per token to improve handover #1764

Closed

stale bot added the stale label Feb 3, 2020

stale bot closed this as completed Feb 18, 2020

bboreham mentioned this issue Jan 4, 2021

Allow ingesters to stay in the ring during restart #3305

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mysterious flush of underutilised chunks 1hr after ingester rollout #467

Mysterious flush of underutilised chunks 1hr after ingester rollout #467

awh commented Jun 19, 2017 •

edited

Loading

tomwilkie commented Jun 19, 2017 via email

awh commented Jun 19, 2017 •

edited

Loading

awh commented Jun 20, 2017

tomwilkie commented Jun 20, 2017

awh commented Jun 20, 2017 •

edited

Loading

awh commented Jun 20, 2017

awh commented Jun 20, 2017

leth commented Jun 20, 2017

leth commented Jun 30, 2017

leth commented Aug 3, 2017

leth commented Aug 3, 2017 •

edited

Loading

tomwilkie commented Aug 3, 2017 via email

awh commented Aug 3, 2017

stale bot commented Feb 3, 2020

Mysterious flush of underutilised chunks 1hr after ingester rollout #467

Mysterious flush of underutilised chunks 1hr after ingester rollout #467

Comments

awh commented Jun 19, 2017 • edited Loading

tomwilkie commented Jun 19, 2017 via email

awh commented Jun 19, 2017 • edited Loading

awh commented Jun 20, 2017

tomwilkie commented Jun 20, 2017

awh commented Jun 20, 2017 • edited Loading

awh commented Jun 20, 2017

awh commented Jun 20, 2017

leth commented Jun 20, 2017

leth commented Jun 30, 2017

leth commented Aug 3, 2017

leth commented Aug 3, 2017 • edited Loading

tomwilkie commented Aug 3, 2017 via email

awh commented Aug 3, 2017

stale bot commented Feb 3, 2020

awh commented Jun 19, 2017 •

edited

Loading

awh commented Jun 19, 2017 •

edited

Loading

awh commented Jun 20, 2017 •

edited

Loading

leth commented Aug 3, 2017 •

edited

Loading