Push chunks of hot metrics into cache when they get completed by `AggMetric` #461

replay · 2017-01-10T16:13:19Z

When AggMetric completes a chunk it will now get passed into a callback that refers to Cache.CacheIfHot. This happens in any case, no matter if the node is cluster primary or not.
The chunk cache gets a new export method Cache.CacheIfHot which takes an IterGen that should be cached only if it's metric is hot. In this context "hot" means that the previous chunk within that metric is currently cached.
Unit tests for everything that changed.

Dieterbe · 2017-01-11T13:59:51Z

mdata/aggmetric_test.go

+	testMetricPersistOptionalPrimary(t, true)
+}
+
+func TestMetricPersistNotBeingPrimary(t *testing.T) {


can just call this Secondary

Dieterbe · 2017-01-11T14:00:06Z

mdata/cache/if.go

@@ -6,6 +6,7 @@ import (

 type Cache interface {
 	Add(string, uint32, chunk.IterGen)
+	CacheIfHot(string, uint32, *chunk.IterGen)


why pointer to IterGen? this type is tiny and we can just copy it by value. semantically it also makes more sense to me, this implies we may make changes to the itergen.

You're right that's tiny for a struct, but I'd guess it's still significantly bigger than a pointer, no? I can change it to passing by value anyway, shouldn't really make a big difference.

it's ok that the type is larger than the pointer. the problem with pointers is more about the dereferencing overhead and cache misses causing stalls (I don't have evidence to back this up, more like a gut feeling)

Dieterbe · 2017-01-11T14:03:14Z

mdata/aggmetric_test.go

+	ts := uint32(1000)
+	for i := uint32(0); i < chunkAddCount; i++ {
+		agg.Add(ts, 1)
+		ts = ts + chunkSpan


can do ts += chunkSpan

Dieterbe · 2017-01-11T14:06:21Z

mdata/aggmetric_test.go

+		timeout <- true
+	}
+
+	for i := uint32(0); i < chunkAddCount-1; i++ {


how about we create a mockcache, similar to devnullstore, maybe devnullcache, it could take care of all the counting and the timeouting.

Dieterbe · 2017-01-11T14:08:11Z

mdata/aggmetric_test.go

+	for i := uint32(0); i < chunkAddCount-1; i++ {
+		go oneSecTimeout()
+		select {
+		case <-timeout:


can just use https://golang.org/pkg/time/#After here instead of oneSecTimeout

Dieterbe · 2017-01-11T14:11:56Z

mdata/cache/ccache.go

+	}
+
+	// if the previous chunk is not cached we consider the metric not hot enough to cache this chunk
+	if met.lastTs() < itergen.Ts() {


this only works reliably for span aware chunks right? (which would be OK) but maybe worth mentioning.

Dieterbe · 2017-01-11T14:16:04Z

mdata/cache/ccache_test.go

@@ -52,6 +52,126 @@ func getConnectedChunks(metric string) *CCache {
 	return cc
 }

+// test AddIfHot method without passing a previous timestamp on a hot metric
+func TestAddIfHotWithoutPrevTsOnHotMetric(t *testing.T) {


these tests look sweet. I like how you test all 4 scenarios.

Dieterbe · 2017-01-11T14:22:29Z

my main problem with this approach is the CacheCb type.
(btw if we go forward with this, can you document the CacheCb type, and the variables that hold one should probably be called cacheCb or something instead of cache to avoid confusion)

instead of passing the CacheCb method around into AggMetric, AggMetrics, Aggregator, etc, why not just a reference to the actual cache?

if the idea is to make it explicit that those things will only invoke a subset of cache's functionality, we should consider creating an extra interface like so:

type HotPushCache interface {
	CacheIfHot(string, uint32, *chunk.IterGen)
}

any cache that implements our Cache interface - in particular our CCache - also implements the above interface, so it can be used.

this would be similar to the go standard library, where you may pass around things like os.File - which implements io.Reader, io.Writer, io.Closer and a bunch more - but the functions that execute reads only declare the interface with the subset of the functions they need (e.g. https://golang.org/pkg/io/#ReadFull )

replay · 2017-01-11T18:00:54Z

Actually I think in the current setup this change decreases the accuracy of the LRU eviction mechanism, because every time a chunk gets added to the cache it gets added to the front of the LRU.

Imagine this case:

The chunk cache size is tight
There is a small number of metrics that gets queried all the time over some long time range, f.e. by a job like for an alert check
Somebody browses around on grafana and looks at all kinds of metrics
As soon as the alert check job submits it's query the next time all the chunks it needs are getting moved to the front of the LRU
But what if just a second ago the chunkspan has wrapped around? Then it's possible that a lot of chunks got added by the pushing mechanism because a while ago somebody has been browsing around on Grafana, even though now they aren't anymore, and this could lead to evictions on the metric that's actually queried all the time by the alert job

The underlying problem is that when these chunks get pushed into the cache, they get pushed to the front of the LRU, as if they would have just been touched. So I think maybe they shouldn't be pushed to the front of the LRU, but to the end.
That way, if somebody does actually query them, on the next hit they'll be moved to the front anyway. If nobody queries them they'll be the first to get evicted the next time the cache size gets tight.

What do you think about that?

Dieterbe · 2017-01-11T18:25:09Z

let me know if i misunderstand.

this is only a problem if the cache size is that tight, that due to the chunkspan wrap-around, it adds a bunch of new chunks that collectively fill up the cache, and cause the chunks from the alert to be evicted.
it will cause cache misses on the chunks used by the alerting query, but once it ran, the next time they'll be hits again.

IMHO this is acceptable / pretty normal behavior. In the normal case, if you run your cache very close to max size, it happens, that as you view "a lot of metrics in set B" and it auto-adds all those chunks to the cache, it can evict chunks of set A, even though you'ld like A to remain cached.
It's just a little harsher in this scenario, there's basically a "storage amplification" which could be anywhere between 2x (worst case) to almost no amplification (best case), depending on how long the list of chunks was that received an extra one due to the AddIfHot.

Your proposal sounds good though, I think the cache as is works well enough, but if it's easy to implement go for it. it should give an improvement in those rare cases.
are you sure it won't negatively affect the current scenarios / assumptions ? like how we can evict an entire chain at once (e.g. a chunk and all older chunks), could there be problems with that, or if the chunks have different "last touched" times?

replay · 2017-01-11T18:36:15Z

Yeah, that's right.
Basically by pushing them to the back of the LRU we'd just know for sure that they'll get dropped first whenever space gets tight, unless they ever did get queried, so they can be considered like low priority until they did get a hit.
Currently it's already possible that the chunks get different "last touched times" if the repeatedly queried range is a moving window that moves forward and the older chunks get left behind and fall out of the window, that's fine.

At the moment the rules are simple:

If a cached metric is down to 0 chunks the metric will get dropped.
When a chunk gets dropped, all the chronologically older chunks in the cache (independent of their LRU position) get dropped as well

The second rule has the side effect that if i AddIfHot pushes a chunk to the back of the LRU, and then this metric doesn't get queried before the next time something needs to be evicted, this chunk plus all the chronologically older ones will get evicted together. This means that the whole metric would be dropped because this chunk at the back of the LRU hasn't gotten a hit. This might be fine in most scenarios where the window is moving because it basically cleans out those that don't get hits.
The only scenario where it could lead to some weirdness is the one where the same absolute range keeps getting queried and the last chunk of the queried range gets finished, then a new chunk gets added and pushed by AddIfHot and if now evictions get triggered the whole metric would be dropped out of the cache due to the second rule. But I'm kind of assuming that most people query moving ranges.

That stuff gets complicated to explain

Dieterbe · 2017-01-11T20:07:17Z

The only scenario where it could lead to some weirdness is the one where the same absolute range keeps getting queried and the last chunk of the queried range gets finished, then a new chunk gets added and pushed by AddIfHot and if now evictions get triggered the whole metric would be dropped out of the cache due to the second rule.

this alone is a strong enough reason to just leave it as is I think.
you first found a scenario where the current AddIfHot mechanism is suboptimal (but it should happen rarely) and now you found a scenario where the proposed change is also suboptimal (and should also happen rarely).

I would say let's just keep it simple and as-is. I think the eviction caused by AddIsHot (the first problem) is easy to reason about and kind of expected when you run a cache that close to its RAM limits.
the problem you described now is weirder because of the lru insertion trickery, and less obvious. So I rather avoid that problem and keep things consistent

replay · 2017-01-17T11:12:39Z

@Dieterbe yeah i think you're right. I'd better keep it simple and leave it as it is until (if) what i described becomes a problem.

replay · 2017-01-17T12:32:34Z

@Dieterbe regarding your comment with the CacheCb callback. The main reason I did that is just to gain maximum flexibility in passing it around in aggmetrics and also in unit tests. I could also pass the cache.Cache interface around, it will just require a little more code in all the unit tests where I'm mocking that callback, on the other hand I guess that prevents getting into some callback hell.

replay · 2017-01-17T13:31:46Z

@Dieterbe i changed the callback into a separate interface called cache.CachePusher and added a mock cache called cache.MockCache.

replay · 2017-01-19T19:10:25Z

@Dieterbe if you get a chance could you look at that again please? otherwise it will turn into rebase hell

Dieterbe · 2017-01-20T15:55:30Z

LGTM. see if you like anything from #487 and if so cherry-pick or merge it.
there will probably be some merge conflicts with master due to #477 being merged but I can help with those

refers to ticket #457

- does aggmetric call the cache push callback every time a metric gets evicted from the ring buffer? - does aggmetric add the chunk into the store if the node is primary? - makes the devnullStore count how many times it's Add() got called - does aggmetric not add the chunk into the store if the node is not primary?

replay force-pushed the 457 branch 2 times, most recently from f8c494f to a8106e4 Compare January 10, 2017 16:23

replay changed the title ~~Push chunks of hot metrics into cache when they get completed in AggMetric~~ [WIP] Push chunks of hot metrics into cache when they get completed in AggMetric Jan 10, 2017

replay force-pushed the 457 branch 4 times, most recently from b9497bc to af2942e Compare January 10, 2017 17:20

replay changed the title ~~[WIP] Push chunks of hot metrics into cache when they get completed in AggMetric~~ Push chunks of hot metrics into cache when they get completed by AggMetric Jan 10, 2017

replay requested review from Dieterbe and woodsaj January 10, 2017 17:41

Dieterbe suggested changes Jan 11, 2017

View reviewed changes

replay force-pushed the 457 branch from af2942e to da06802 Compare January 17, 2017 12:26

replay force-pushed the 457 branch 2 times, most recently from 516f6b4 to ed346b2 Compare January 17, 2017 13:30

replay force-pushed the 457 branch 2 times, most recently from f46ac80 to 6ad0080 Compare January 17, 2017 15:57

Dieterbe approved these changes Jan 22, 2017

View reviewed changes

replay force-pushed the 457 branch from 27f5e44 to 55bd1b0 Compare January 22, 2017 19:18

replay added 2 commits January 22, 2017 11:50

initial version of cache pushing for hot metrics

7d8e36e

refers to ticket #457

add tests for CacheIfHot

975fd57

replay and others added 2 commits January 22, 2017 11:50

simplify loop

a068096

replay force-pushed the 457 branch from 55bd1b0 to a068096 Compare January 22, 2017 19:51

Dieterbe merged commit 936f77e into master Jan 23, 2017

Dieterbe mentioned this pull request Jan 24, 2017

Add new chunks to cache if metric is hot #457

Closed

Dieterbe deleted the 457 branch January 2, 2018 16:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Push chunks of hot metrics into cache when they get completed by `AggMetric` #461

Push chunks of hot metrics into cache when they get completed by `AggMetric` #461

replay commented Jan 10, 2017 •

edited

Loading

Dieterbe Jan 11, 2017

Dieterbe Jan 11, 2017

replay Jan 17, 2017

Dieterbe Jan 19, 2017

Dieterbe Jan 11, 2017

Dieterbe Jan 11, 2017

Dieterbe Jan 11, 2017

Dieterbe Jan 11, 2017

Dieterbe Jan 11, 2017

Dieterbe commented Jan 11, 2017 •

edited

Loading

replay commented Jan 11, 2017 •

edited

Loading

Dieterbe commented Jan 11, 2017

replay commented Jan 11, 2017 •

edited

Loading

Dieterbe commented Jan 11, 2017 •

edited

Loading

replay commented Jan 17, 2017

replay commented Jan 17, 2017

replay commented Jan 17, 2017 •

edited

Loading

replay commented Jan 19, 2017

Dieterbe commented Jan 20, 2017

Push chunks of hot metrics into cache when they get completed by AggMetric #461

Push chunks of hot metrics into cache when they get completed by AggMetric #461

Conversation

replay commented Jan 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dieterbe commented Jan 11, 2017 • edited Loading

replay commented Jan 11, 2017 • edited Loading

Dieterbe commented Jan 11, 2017

replay commented Jan 11, 2017 • edited Loading

Dieterbe commented Jan 11, 2017 • edited Loading

replay commented Jan 17, 2017

replay commented Jan 17, 2017

replay commented Jan 17, 2017 • edited Loading

replay commented Jan 19, 2017

Dieterbe commented Jan 20, 2017

Push chunks of hot metrics into cache when they get completed by `AggMetric` #461

Push chunks of hot metrics into cache when they get completed by `AggMetric` #461

replay commented Jan 10, 2017 •

edited

Loading

Dieterbe commented Jan 11, 2017 •

edited

Loading

replay commented Jan 11, 2017 •

edited

Loading

replay commented Jan 11, 2017 •

edited

Loading

Dieterbe commented Jan 11, 2017 •

edited

Loading

replay commented Jan 17, 2017 •

edited

Loading