WIP: Send samples to joining ingester during handover #788

leth · 2018-04-09T10:00:10Z

Avoids a "Mysterious Flush" (#467) after ingester handover by adding transfer destination information to the ring.

While an ingester handover is underway, the distributor will skip the leaving ingester, and pick an extra ingestor to maintain the same replication count for incoming samples.

This temporary ingestor is never involved in the handover, and will recieve samples only for the duration of the handover.
After the handover, since no more samples are being recieved, the temporary ingestor will eventually (default 5m later) mark those chunks as idle and flush them.
During a rollout this results in every ingestor flushing a shedload of chunks.

This change attempts to prevent that by marking a transfer as underway, sending new samples to the joining ingestor, and batching those new samples to be appended to the chunk store after a transfer is complete.

Fixes #467

leth · 2018-04-09T10:01:30Z

I'm considering adding a new ingester state PREPARING_TO_LEAVE to cover the rendezvous and token marking.

tomwilkie · 2018-04-09T17:52:10Z

pkg/ingester/ingester_claim.go

+		userSamples := &i.joiningSampleQueue[j]
+		userCtx := user.InjectOrgID(stream.Context(), userSamples.userID)
+		for k := range userSamples.samples {
+			i.append(userCtx, &userSamples.samples[k])


Returned error is ignored.

Should I just log the error?

tomwilkie · 2018-04-09T17:55:09Z

pkg/ingester/ingester.go

+		userID, err := user.ExtractOrgID(ctx)
+		if err != nil {
+			// TODO not sure what to do here
+			return &client.WriteResponse{}, nil


It shouldn't happen (request will be rejected without a userid), but we should at least return nil, err if it does.

tomwilkie · 2018-04-09T17:58:49Z

pkg/ingester/ingester_claim.go

+			i.append(userCtx, &userSamples.samples[k])
+		}
+	}
+	i.joiningSampleQueue = []userSamples{}


Probably want to put this whole loop into a separate function and utilise a defer to do the unlock?

tomwilkie · 2018-04-09T18:00:36Z

pkg/ring/ring.proto

@@ -18,6 +18,7 @@ message IngesterDesc {
 message TokenDesc {
 	uint32 token = 1;
 	string ingester = 2;
+	string nextIngester = 3;


Is this extra state redundant? Could this be detected uniquely by a token being owned by a "PENDING" ingester?

While the transfer is underway the tokens are still owned by the leaving ingester. The ingester recieving the chunks is in JOINING state.

tomwilkie · 2018-04-09T18:01:17Z

pkg/ring/ring.proto

@@ -26,4 +27,5 @@ enum IngesterState {

 	PENDING = 2;
 	JOINING = 3;
+	PREPARING_TO_LEAVE = 4;


Probably want to start adding comments to explain these states and their transitions - how is PREPARING_TO_LEAVE different from LEAVING?

Good idea!

We send new samples to PREPARING_TO_LEAVE ingesters but not LEAVING ingesters.
(We sort out the token.NextIngester pointer while in PREPARING_TO_LEAVE)

tomwilkie · 2018-04-09T18:08:18Z

Thanks @leth! Is there a short description of how this is supposed to behave or what this is trying to achieve?

I'm a bit concerned about synchronisation around state changes - can we guarantee none of the samples added to the joiningSampleQueue on the new ingester won't already have been appended to a chunk we later receive? I'm not sure we can, as ring state changes will be delivered asynchronously to distributors.

Will holding the pending sample lock like this when flushing all the samples prevent writes from succeeding to the ingester for potentially a longish time? Should they fail instead, or should we be able to add to the queue whilst draining it?

Generally, I think trying to improve the current bootstrapping process could be considered polishing a turd (not that I'm against that - incremental improvements and all). I'm am in favour of replacing this whole process with something more akin to other DHTs, where a joining node picks N fresh tokens and incrementally transfers data from the replicas of the ranges they intersect, and leaving nodes do the opposite; this has the added benefit of allow elastic scaling. WDYT?

leth · 2018-04-10T10:41:08Z

Thanks @leth! Is there a short description of how this is supposed to behave or what this is trying to achieve?

Sorry, I've updated the PR description to cover that in more detail

I'm a bit concerned about synchronisation around state changes - can we guarantee none of the samples added to the joiningSampleQueue on the new ingester won't already have been appended to a chunk we later receive? I'm not sure we can, as ring state changes will be delivered asynchronously to distributors.

I think ring state changes should be delivered in-order as lot of this interaction is pretty synchronous, see below for an outline of what happens.
I think the only thing I'm unsure about here is whether the grpc transfer stream is synchronous.

leaving ingester state ACTIVE -> PREPARING_TO_LEAVE
leaving ingester calls TransferChunks
joining ingester state PENDING -> JOINING
leaving ingester sends first item down the stream
joining ingester marks all tokens with nextIngester
leaving ingester state PREPARING_TO_LEAVE -> LEAVING
stream send complete
unsyncronised:
- joining ingester
  - claims all tokens
  - state JOINING -> ACTIVE
- leaving ingester flushes and exits

Will holding the pending sample lock like this when flushing all the samples prevent writes from succeeding to the ingester for potentially a longish time? Should they fail instead, or should we be able to add to the queue whilst draining it?

I don't think we hold the lock for a flush, just while we empty the slice into the userStates - it could take some time, but I doubt it'd take that long. I couldn't think of a better way of doing it though, any suggestions.

Generally, I think trying to improve the current bootstrapping process could be considered polishing a turd (not that I'm against that - incremental improvements and all). I'm am in favour of replacing this whole process with something more akin to other DHTs, where a joining node picks N fresh tokens and incrementally transfers data from the replicas of the ranges they intersect, and leaving nodes do the opposite; this has the added benefit of allow elastic scaling. WDYT?

It certainly sounds better for scaling, but also sounds like quite a significant change.
It'd be good to have a good testing harness for that from the outset.

Also, it will need to do something similar to this change to avoid the same mysterious flush issue.

Also update docstrings

leth · 2018-04-11T15:01:33Z

So I took this for a test drive in our dev cluster today. It didn't crash and burn, but it also didn't fix the flush issue.
Either it doesn't work, or there is another issue in the handover which causes the chunks to be flushed after ~5m.

leth · 2018-04-11T15:29:55Z

I tried some analysis of what happend in kibana, but didn't reach any conclusions

tomwilkie · 2018-06-16T16:22:33Z

Development on this seems to have stalled; any plans to revisit, or should we close?

leth · 2018-06-18T08:44:53Z

Thanks for the reminder. I'm unlikely to find time to work on this soon, so will close this PR.

leth changed the title ~~WIP: Send samples to joining ingester during handover~~ Send samples to joining ingester during handover Apr 9, 2018

leth force-pushed the clean-handover branch 2 times, most recently from 7b6dd99 to ab22fa5 Compare April 9, 2018 12:26

tomwilkie reviewed Apr 9, 2018

View reviewed changes

When transferring state, mark the next ingester on each token

e36d223

leth force-pushed the clean-handover branch from 3b7166b to 7412d56 Compare April 11, 2018 12:19

Marcus Cobden added 4 commits April 11, 2018 13:21

During ingester handover, batch up samples in new ingester

78bc577

Add a test to check queued samples are added to the store

f5c24fc

Add PREPARING_TO_LEAVE state to ingester handover

b6adc5c

Clear NextIngester after transfer completes

9c97648

Also update docstrings

leth force-pushed the clean-handover branch from 7412d56 to 8a17a06 Compare April 11, 2018 12:30

Marcus Cobden added 2 commits April 12, 2018 09:44

Improve error cases and add docstrings

3b58ca0

Joining ingesters should now be healthy for writes

64d6f0d

leth force-pushed the clean-handover branch from 8a17a06 to 64d6f0d Compare April 12, 2018 08:44

bboreham changed the title ~~Send samples to joining ingester during handover~~ WIP: Send samples to joining ingester during handover May 10, 2018

leth closed this Jun 18, 2018

tomwilkie deleted the clean-handover branch November 19, 2018 12:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Send samples to joining ingester during handover #788

WIP: Send samples to joining ingester during handover #788

leth commented Apr 9, 2018 •

edited

Loading

leth commented Apr 9, 2018

tomwilkie Apr 9, 2018

leth Apr 10, 2018

tomwilkie Apr 9, 2018

tomwilkie Apr 9, 2018

tomwilkie Apr 9, 2018

leth Apr 10, 2018

tomwilkie Apr 9, 2018

leth Apr 10, 2018

tomwilkie commented Apr 9, 2018

leth commented Apr 10, 2018

leth commented Apr 11, 2018

leth commented Apr 11, 2018

tomwilkie commented Jun 16, 2018

leth commented Jun 18, 2018

WIP: Send samples to joining ingester during handover #788

WIP: Send samples to joining ingester during handover #788

Conversation

leth commented Apr 9, 2018 • edited Loading

leth commented Apr 9, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomwilkie commented Apr 9, 2018

leth commented Apr 10, 2018

leth commented Apr 11, 2018

leth commented Apr 11, 2018

tomwilkie commented Jun 16, 2018

leth commented Jun 18, 2018

leth commented Apr 9, 2018 •

edited

Loading