-
Notifications
You must be signed in to change notification settings - Fork 104
introduce gc delay if rob enabled #781
Conversation
This won't completely fix the issue because for the first GC, there may be no chunks with data if the |
fad1f1c
to
e111fd2
Compare
Good point. You're right that by changing the line you've pointed out i could fix it. |
@replay The AggMetric.GC() needs to be updated to handle the ReorderBuffer. The GC() call is designed to force chunks to be persisted when users stop sending data (chunks are normally only persisted when a point in the next chunk is received). Currently, there will be reorderWindow points that are not yet in a chunk. When GC() closes the current chunk and persists it, those points will be lost. I think adding this to GC() would work
|
FWIW, @woodsaj solution above is almost exactly what I stuck in my local branch. I added a flush method and called that during GC. It's worth noting that there could be data in the |
@woodsaj thx, i'll use that. Just to be clear, the change I already made, to update |
} | ||
|
||
func (rob *ReorderBuffer) Flush() []schema.Point { | ||
res := rob.Get() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will remove datapoints that might still need to stay in the rob
for reordering. In my mind, Flush
should only remove the datapoints that have aged out of the rob
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we are now updating lastWrite when adding to the rob, flush() will always want to get all points as they will all be older then chunkMaxStale. It is important that metrics are not delayed from being persisted for more then chunkMaxStale. If the metrics dont get written, and the MT instance restarts the metrics could be lost, depending on the kafka retention.
To prevent loss, the kafka retention needs to be at least (chunk-max-stale + gc-interval)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we are now updating lastWrite when adding to the rob, flush() will always want to get all points as they will all be older then chunkMaxStale
I don't think this is true. The gc-interval
and chunk-max-stale
are completely independent and Flush()
is going to be called unconditionally for every GC cycle. This means you could write data to the rob
just 1 second before a GC cycle happens to kick off. That data is not yet ready to Flush()
and is part of the currentChunk
.
If the rob
were only Flush()
d AFTER checking the lastWriteTime
, then I believe you would be correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. I missed the check of the lastWriteTime
.
There are 2 reasons an AggMetric gets GC has 2 functions:
There is a sneaky extra way to GC an Flushing the
|
With the changes already added to update lastWrite when points are added to the rob, this scenario is not possible. All that will happen is
This is the desired behaviour. If the last point received (lastWrite) is older then chunk-max-stale it means every point in the rob is older then chunk-max-stale, so they all need to be persisted. |
Actually what will happen with the current changes and the description I posted is:
|
@shanson7 you are right. I didnt notice the return true So we should just change that to |
Haha, this is why multiple reviewers is great. You caught the case where I missed something and I did the other one. I think that should do the trick. |
2ff3e3b
to
c082a4c
Compare
e50e76b
to
4c026eb
Compare
can anyone paraphrase all these back-and-forths into a final conclusion of what exactly the problems are and what the solutions are? thanks! |
There are 2 basic issues:
The solution to issue (1):
The solution to issue (2):
|
mdata/reorder_buffer.go
Outdated
return res | ||
} | ||
|
||
func (rob *ReorderBuffer) HasData() bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k, i'll update & then merge
mdata/reorder_buffer_test.go
Outdated
buf := NewReorderBuffer(10, 1) | ||
|
||
if buf.HasData() != false { | ||
t.Fatalf("Expected HasData() to be false") | ||
if buf.IsEmpty() != true { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just evaluate buf.IsEmpty directly in all these cases instead of doing comparisons to booleans
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k, updated again
mdata/reorder_buffer_test.go
Outdated
if buf.HasData() != false { | ||
t.Fatalf("Expected HasData() to be false") | ||
if !buf.IsEmpty() { | ||
t.Fatalf("Expected IsEmpty() to be false") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expected it to be true :p the other messages need updating too :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
this should fix #776
still need to reproduce the problem locally in order to be able to test that fix