Skip to content
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.

Push chunks of hot metrics into cache when they get completed by AggMetric #461

Merged
merged 4 commits into from
Jan 23, 2017

Conversation

replay
Copy link
Contributor

@replay replay commented Jan 10, 2017

  • When AggMetric completes a chunk it will now get passed into a callback that refers to Cache.CacheIfHot. This happens in any case, no matter if the node is cluster primary or not.
  • The chunk cache gets a new export method Cache.CacheIfHot which takes an IterGen that should be cached only if it's metric is hot. In this context "hot" means that the previous chunk within that metric is currently cached.
  • Unit tests for everything that changed.

@replay replay changed the title Push chunks of hot metrics into cache when they get completed in AggMetric [WIP] Push chunks of hot metrics into cache when they get completed in AggMetric Jan 10, 2017
@replay replay changed the title [WIP] Push chunks of hot metrics into cache when they get completed in AggMetric Push chunks of hot metrics into cache when they get completed by AggMetric Jan 10, 2017
testMetricPersistOptionalPrimary(t, true)
}

func TestMetricPersistNotBeingPrimary(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can just call this Secondary

@@ -6,6 +6,7 @@ import (

type Cache interface {
Add(string, uint32, chunk.IterGen)
CacheIfHot(string, uint32, *chunk.IterGen)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why pointer to IterGen? this type is tiny and we can just copy it by value. semantically it also makes more sense to me, this implies we may make changes to the itergen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that's tiny for a struct, but I'd guess it's still significantly bigger than a pointer, no? I can change it to passing by value anyway, shouldn't really make a big difference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's ok that the type is larger than the pointer. the problem with pointers is more about the dereferencing overhead and cache misses causing stalls (I don't have evidence to back this up, more like a gut feeling)

ts := uint32(1000)
for i := uint32(0); i < chunkAddCount; i++ {
agg.Add(ts, 1)
ts = ts + chunkSpan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can do ts += chunkSpan

timeout <- true
}

for i := uint32(0); i < chunkAddCount-1; i++ {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about we create a mockcache, similar to devnullstore, maybe devnullcache, it could take care of all the counting and the timeouting.

for i := uint32(0); i < chunkAddCount-1; i++ {
go oneSecTimeout()
select {
case <-timeout:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can just use https://golang.org/pkg/time/#After here instead of oneSecTimeout

}

// if the previous chunk is not cached we consider the metric not hot enough to cache this chunk
if met.lastTs() < itergen.Ts() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this only works reliably for span aware chunks right? (which would be OK) but maybe worth mentioning.

@@ -52,6 +52,126 @@ func getConnectedChunks(metric string) *CCache {
return cc
}

// test AddIfHot method without passing a previous timestamp on a hot metric
func TestAddIfHotWithoutPrevTsOnHotMetric(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these tests look sweet. I like how you test all 4 scenarios.

@Dieterbe
Copy link
Contributor

Dieterbe commented Jan 11, 2017

my main problem with this approach is the CacheCb type.
(btw if we go forward with this, can you document the CacheCb type, and the variables that hold one should probably be called cacheCb or something instead of cache to avoid confusion)

instead of passing the CacheCb method around into AggMetric, AggMetrics, Aggregator, etc, why not just a reference to the actual cache?

if the idea is to make it explicit that those things will only invoke a subset of cache's functionality, we should consider creating an extra interface like so:

type HotPushCache interface {
	CacheIfHot(string, uint32, *chunk.IterGen)
}

any cache that implements our Cache interface - in particular our CCache - also implements the above interface, so it can be used.

this would be similar to the go standard library, where you may pass around things like os.File - which implements io.Reader, io.Writer, io.Closer and a bunch more - but the functions that execute reads only declare the interface with the subset of the functions they need (e.g. https://golang.org/pkg/io/#ReadFull )

@replay
Copy link
Contributor Author

replay commented Jan 11, 2017

Actually I think in the current setup this change decreases the accuracy of the LRU eviction mechanism, because every time a chunk gets added to the cache it gets added to the front of the LRU.

Imagine this case:

  • The chunk cache size is tight
  • There is a small number of metrics that gets queried all the time over some long time range, f.e. by a job like for an alert check
  • Somebody browses around on grafana and looks at all kinds of metrics
  • As soon as the alert check job submits it's query the next time all the chunks it needs are getting moved to the front of the LRU
  • But what if just a second ago the chunkspan has wrapped around? Then it's possible that a lot of chunks got added by the pushing mechanism because a while ago somebody has been browsing around on Grafana, even though now they aren't anymore, and this could lead to evictions on the metric that's actually queried all the time by the alert job

The underlying problem is that when these chunks get pushed into the cache, they get pushed to the front of the LRU, as if they would have just been touched. So I think maybe they shouldn't be pushed to the front of the LRU, but to the end.
That way, if somebody does actually query them, on the next hit they'll be moved to the front anyway. If nobody queries them they'll be the first to get evicted the next time the cache size gets tight.

What do you think about that?

@Dieterbe
Copy link
Contributor

let me know if i misunderstand.

  • this is only a problem if the cache size is that tight, that due to the chunkspan wrap-around, it adds a bunch of new chunks that collectively fill up the cache, and cause the chunks from the alert to be evicted.
  • it will cause cache misses on the chunks used by the alerting query, but once it ran, the next time they'll be hits again.

IMHO this is acceptable / pretty normal behavior. In the normal case, if you run your cache very close to max size, it happens, that as you view "a lot of metrics in set B" and it auto-adds all those chunks to the cache, it can evict chunks of set A, even though you'ld like A to remain cached.
It's just a little harsher in this scenario, there's basically a "storage amplification" which could be anywhere between 2x (worst case) to almost no amplification (best case), depending on how long the list of chunks was that received an extra one due to the AddIfHot.

Your proposal sounds good though, I think the cache as is works well enough, but if it's easy to implement go for it. it should give an improvement in those rare cases.
are you sure it won't negatively affect the current scenarios / assumptions ? like how we can evict an entire chain at once (e.g. a chunk and all older chunks), could there be problems with that, or if the chunks have different "last touched" times?

@replay
Copy link
Contributor Author

replay commented Jan 11, 2017

Yeah, that's right.
Basically by pushing them to the back of the LRU we'd just know for sure that they'll get dropped first whenever space gets tight, unless they ever did get queried, so they can be considered like low priority until they did get a hit.
Currently it's already possible that the chunks get different "last touched times" if the repeatedly queried range is a moving window that moves forward and the older chunks get left behind and fall out of the window, that's fine.

At the moment the rules are simple:

  • If a cached metric is down to 0 chunks the metric will get dropped.
  • When a chunk gets dropped, all the chronologically older chunks in the cache (independent of their LRU position) get dropped as well

The second rule has the side effect that if i AddIfHot pushes a chunk to the back of the LRU, and then this metric doesn't get queried before the next time something needs to be evicted, this chunk plus all the chronologically older ones will get evicted together. This means that the whole metric would be dropped because this chunk at the back of the LRU hasn't gotten a hit. This might be fine in most scenarios where the window is moving because it basically cleans out those that don't get hits.
The only scenario where it could lead to some weirdness is the one where the same absolute range keeps getting queried and the last chunk of the queried range gets finished, then a new chunk gets added and pushed by AddIfHot and if now evictions get triggered the whole metric would be dropped out of the cache due to the second rule. But I'm kind of assuming that most people query moving ranges.

That stuff gets complicated to explain

@Dieterbe
Copy link
Contributor

Dieterbe commented Jan 11, 2017

The only scenario where it could lead to some weirdness is the one where the same absolute range keeps getting queried and the last chunk of the queried range gets finished, then a new chunk gets added and pushed by AddIfHot and if now evictions get triggered the whole metric would be dropped out of the cache due to the second rule.

this alone is a strong enough reason to just leave it as is I think.
you first found a scenario where the current AddIfHot mechanism is suboptimal (but it should happen rarely) and now you found a scenario where the proposed change is also suboptimal (and should also happen rarely).

I would say let's just keep it simple and as-is. I think the eviction caused by AddIsHot (the first problem) is easy to reason about and kind of expected when you run a cache that close to its RAM limits.
the problem you described now is weirder because of the lru insertion trickery, and less obvious. So I rather avoid that problem and keep things consistent

@replay
Copy link
Contributor Author

replay commented Jan 17, 2017

@Dieterbe yeah i think you're right. I'd better keep it simple and leave it as it is until (if) what i described becomes a problem.

@replay
Copy link
Contributor Author

replay commented Jan 17, 2017

@Dieterbe regarding your comment with the CacheCb callback. The main reason I did that is just to gain maximum flexibility in passing it around in aggmetrics and also in unit tests. I could also pass the cache.Cache interface around, it will just require a little more code in all the unit tests where I'm mocking that callback, on the other hand I guess that prevents getting into some callback hell.

@replay
Copy link
Contributor Author

replay commented Jan 17, 2017

@Dieterbe i changed the callback into a separate interface called cache.CachePusher and added a mock cache called cache.MockCache.

@replay
Copy link
Contributor Author

replay commented Jan 19, 2017

@Dieterbe if you get a chance could you look at that again please? otherwise it will turn into rebase hell

@Dieterbe
Copy link
Contributor

LGTM. see if you like anything from #487 and if so cherry-pick or merge it.
there will probably be some merge conflicts with master due to #477 being merged but I can help with those

replay and others added 2 commits January 22, 2017 11:50
- does aggmetric call the cache push callback every time a metric gets
  evicted from the ring buffer?
- does aggmetric add the chunk into the store if the node is primary?
- makes the devnullStore count how many times it's Add() got called
- does aggmetric not add the chunk into the store if the node is not primary?
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants