introduce gc delay if rob enabled #781

replay · 2017-12-11T08:06:06Z

this should fix #776

still need to reproduce the problem locally in order to be able to test that fix

shanson7 · 2017-12-11T16:33:19Z

This won't completely fix the issue because for the first GC, there may be no chunks with data if the rob is holding all the datapoints. This line will likely need to change to account for data living in the rob

replay · 2017-12-12T07:18:56Z

Good point. You're right that by changing the line you've pointed out i could fix it.
But actually now I'm thinking it might be the cleaner solution to just make sure .lastWrite gets set to the correct value on Add(), so we don't even need to keep track of this gcDelay. This is because if we want to reuse the .lastWrite property for other stuff we'll need to keep adding those "exceptions" for the case of the rob.
So I think this might be better to avoid exceptions, and it makes sense to already update .lastWrite once a datapoint has been accepted into the rob, because once it is there it will be readable/queriable by the user: e111fd2

woodsaj · 2017-12-12T10:03:14Z

@replay The AggMetric.GC() needs to be updated to handle the ReorderBuffer. The GC() call is designed to force chunks to be persisted when users stop sending data (chunks are normally only persisted when a point in the next chunk is received). Currently, there will be reorderWindow points that are not yet in a chunk. When GC() closes the current chunk and persists it, those points will be lost.

I think adding this to GC() would work

func (a *AggMetric) GC(chunkMinTs, metricMinTs uint32) bool {
	a.Lock()
	defer a.Unlock()
       
	// if the reorderBuffer is enabled and we have not received a datapoint in a while,
       // then flush the reorder buffer.
	if a.rob != nil && a.lastWrite < chunkMinTs{
                tmpLastWrite = a.lastWrite
		pts := a.rob.Flush()
                for _, p := range pts {
                    a.add(p.Ts, p.Val)
                }
                // adding points will cause our lastWrite to be updated, but we want to keep the old value
                a.lastWrite = tmpLastWrite
         }

shanson7 · 2017-12-12T12:18:21Z

FWIW, @woodsaj solution above is almost exactly what I stuck in my local branch. I added a flush method and called that during GC.

It's worth noting that there could be data in the rob that isn't ready to flush, so len(chunks) can still be 0.

replay · 2017-12-12T13:50:45Z

@woodsaj thx, i'll use that. Just to be clear, the change I already made, to update .lastWrite after a datapoint got accepted into the rob, would still be necessary to avoid unnecessary GCs, right?

shanson7 · 2017-12-12T14:25:09Z

mdata/reorder_buffer.go

+}
+
+func (rob *ReorderBuffer) Flush() []schema.Point {
+	res := rob.Get()


This will remove datapoints that might still need to stay in the rob for reordering. In my mind, Flush should only remove the datapoints that have aged out of the rob.

Because we are now updating lastWrite when adding to the rob, flush() will always want to get all points as they will all be older then chunkMaxStale. It is important that metrics are not delayed from being persisted for more then chunkMaxStale. If the metrics dont get written, and the MT instance restarts the metrics could be lost, depending on the kafka retention.
To prevent loss, the kafka retention needs to be at least (chunk-max-stale + gc-interval)

Because we are now updating lastWrite when adding to the rob, flush() will always want to get all points as they will all be older then chunkMaxStale

I don't think this is true. The gc-interval and chunk-max-stale are completely independent and Flush() is going to be called unconditionally for every GC cycle. This means you could write data to the rob just 1 second before a GC cycle happens to kick off. That data is not yet ready to Flush() and is part of the currentChunk.

If the rob were only Flush()d AFTER checking the lastWriteTime, then I believe you would be correct.

Ah, I see. I missed the check of the lastWriteTime.

shanson7 · 2017-12-12T16:20:17Z

There are 2 reasons an AggMetric gets GC has 2 functions:

Remove AggMetrics that haven't been written to in metric-max-stale time to free up memory
Persist chunks that haven't been written persisted in chunk-max-stale

There is a sneaky extra way to GC an AggMetric (like function 1) which is if there are no chunks in it. In my particular case, the gc-interval setting lined up with the occasional publishing of the timeseries, so the AggMetric was getting GC'd before the rob got that second datapoint that would have flushed it and created chunk 1. This caused the AggMetric to get recreated on the next point with a brand new rob and still 0 chunks, and was then GC'd again. Rinse and repeat.

Flushing the rob in the GC call seems like the right way to go, but Flush should just extract the data that needs to age out of the rob, not ALL of it (this would actually break the reordering ability in this edge case). Additionally, the check that len(chunks) == 0 should include and && !rob.HasData(). This will protect against the case like:

New data comes in creating adding a few datapoints to the rob
GC kicks off, but the data in the rob shouldn't be flushed, since the data is within the rob window (and therefore still eligible for reordering)
There are STILL no chunks, but we have data in the rob, so this AggMetric is safe.

woodsaj · 2017-12-12T16:58:52Z

New data comes in creating adding a few datapoints to the rob

GC kicks off, but the data in the rob shouldn't be flushed, since the data is within the rob window (and therefore still eligible for reordering)

There are STILL no chunks, but we have data in the rob, so this AggMetric is safe.

With the changes already added to update lastWrite when points are added to the rob, this scenario is not possible.

All that will happen is

new data comes in and gets added to the rob, lastWrite is updated everytime a point is received.
GC runs, only if lastWrite is older then chunk-max-stale will the rob be flushed. This will create 1 or more chunks
if lastWrite is newer then chunk-max-stale, then nothing further is done.
if lastWrite is older then chunk-max-stale (the rob was flushed), the unfinished chunk will be closed off and persisted.

This is the desired behaviour. If the last point received (lastWrite) is older then chunk-max-stale it means every point in the rob is older then chunk-max-stale, so they all need to be persisted.

shanson7 · 2017-12-12T17:08:36Z

Actually what will happen with the current changes and the description I posted is:

new data comes in and gets added to the rob, lastWrite is updated everytime a point is received.
GC runs, and the rob isn't flushed since the data was just recently written and a.lastWrite > chunk-max-stale (no chunks are created)
len(chunks) == 0 and this AggMetric is marked as eligible for GC.

woodsaj · 2017-12-12T17:12:12Z

@shanson7 you are right. I didnt notice the return true if len(a.chunks) == 0

So we should just change that to if len(a.chunks) == 0 && ! a.rob.HasData()

shanson7 · 2017-12-12T17:16:08Z

Haha, this is why multiple reviewers is great. You caught the case where I missed something and I did the other one. I think that should do the trick.

replay · 2017-12-13T07:08:50Z

Great discussion @woodsaj and @shanson7 :)
I updated the commit: 4c026eb

Dieterbe · 2017-12-13T18:36:25Z

can anyone paraphrase all these back-and-forths into a final conclusion of what exactly the problems are and what the solutions are? thanks!

shanson7 · 2017-12-13T18:46:03Z

There are 2 basic issues:

If a given series stops getting data for a long time, the chunk can be persisted/GC'd in AggMetric::GC() with data sitting in the rob. This causes data loss.
If a given AggMetric has data in the rob but no created chunks (either window hasn't elapsed, or data stopped flowing as in issue (1)), it will be GC'd . This also causes data loss.

The solution to issue (1):

Update lastWriteTime when adding datapoints to the rob (previously was just when adding to the chunk)
In the GC, if stale, take the data from the rob, write it to the chunk and reset the rob.

The solution to issue (2):

len(chunks) == 0-> len(chunks) == 0 && !rob.HasData() to determine if the AggMetric is eligible to be GC'd.

Dieterbe · 2017-12-13T20:44:15Z

mdata/reorder_buffer.go

+	return res
+}
+
+func (rob *ReorderBuffer) HasData() bool {


super minor comment, but empty seems a bit more common than hasdata in Go source code. plus allows us to write positive checks instead of negations which I think are slightly easier to read

k, i'll update & then merge

Dieterbe · 2017-12-14T11:12:41Z

mdata/reorder_buffer_test.go

 	buf := NewReorderBuffer(10, 1)

-	if buf.HasData() != false {
-		t.Fatalf("Expected HasData() to be false")
+	if buf.IsEmpty() != true {


just evaluate buf.IsEmpty directly in all these cases instead of doing comparisons to booleans

k, updated again

Dieterbe · 2017-12-14T11:32:40Z

mdata/reorder_buffer_test.go

-	if buf.HasData() != false {
-		t.Fatalf("Expected HasData() to be false")
+	if !buf.IsEmpty() {
+		t.Fatalf("Expected IsEmpty() to be false")


expected it to be true :p the other messages need updating too :)

replay force-pushed the gcDelay branch 4 times, most recently from fad1f1c to e111fd2 Compare December 12, 2017 07:18

shanson7 reviewed Dec 12, 2017

View reviewed changes

replay force-pushed the gcDelay branch 2 times, most recently from 2ff3e3b to c082a4c Compare December 13, 2017 07:08

replay force-pushed the gcDelay branch 2 times, most recently from e50e76b to 4c026eb Compare December 13, 2017 07:12

update lastWrite on write into rob and flush rob on GC

fc64bb5

replay force-pushed the gcDelay branch from 4c026eb to fc64bb5 Compare December 13, 2017 14:01

Dieterbe reviewed Dec 13, 2017

View reviewed changes

Dieterbe approved these changes Dec 13, 2017

View reviewed changes

Dieterbe reviewed Dec 14, 2017

View reviewed changes

replay force-pushed the gcDelay branch from 7457eea to 4d3340b Compare December 14, 2017 11:30

Dieterbe reviewed Dec 14, 2017

View reviewed changes

rename HasData() to IsEmpty()

f34e7f3

replay force-pushed the gcDelay branch from 4d3340b to f34e7f3 Compare December 14, 2017 12:26

replay merged commit cdc737f into master Dec 14, 2017

Dieterbe deleted the gcDelay branch December 15, 2017 19:48

Dieterbe mentioned this pull request Mar 15, 2018

Issue 844 #869

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

introduce gc delay if rob enabled #781

introduce gc delay if rob enabled #781

replay commented Dec 11, 2017

shanson7 commented Dec 11, 2017 •

edited

Loading

replay commented Dec 12, 2017 •

edited

Loading

woodsaj commented Dec 12, 2017

shanson7 commented Dec 12, 2017

replay commented Dec 12, 2017

shanson7 Dec 12, 2017

woodsaj Dec 12, 2017

shanson7 Dec 12, 2017

shanson7 Dec 12, 2017 •

edited

Loading

shanson7 commented Dec 12, 2017

woodsaj commented Dec 12, 2017 •

edited

Loading

shanson7 commented Dec 12, 2017

woodsaj commented Dec 12, 2017 •

edited

Loading

shanson7 commented Dec 12, 2017

replay commented Dec 13, 2017 •

edited

Loading

Dieterbe commented Dec 13, 2017

shanson7 commented Dec 13, 2017

Dieterbe Dec 13, 2017 •

edited

Loading

replay Dec 14, 2017

Dieterbe Dec 14, 2017

replay Dec 14, 2017

Dieterbe Dec 14, 2017

replay Dec 14, 2017

introduce gc delay if rob enabled #781

introduce gc delay if rob enabled #781

Conversation

replay commented Dec 11, 2017

shanson7 commented Dec 11, 2017 • edited Loading

replay commented Dec 12, 2017 • edited Loading

woodsaj commented Dec 12, 2017

shanson7 commented Dec 12, 2017

replay commented Dec 12, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shanson7 Dec 12, 2017 • edited Loading

Choose a reason for hiding this comment

shanson7 commented Dec 12, 2017

woodsaj commented Dec 12, 2017 • edited Loading

shanson7 commented Dec 12, 2017

woodsaj commented Dec 12, 2017 • edited Loading

shanson7 commented Dec 12, 2017

replay commented Dec 13, 2017 • edited Loading

Dieterbe commented Dec 13, 2017

shanson7 commented Dec 13, 2017

Dieterbe Dec 13, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shanson7 commented Dec 11, 2017 •

edited

Loading

replay commented Dec 12, 2017 •

edited

Loading

shanson7 Dec 12, 2017 •

edited

Loading

woodsaj commented Dec 12, 2017 •

edited

Loading

woodsaj commented Dec 12, 2017 •

edited

Loading

replay commented Dec 13, 2017 •

edited

Loading

Dieterbe Dec 13, 2017 •

edited

Loading