Skip to content
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.

Clear cache api #555

Merged
merged 24 commits into from
Jan 4, 2018
Merged

Clear cache api #555

merged 24 commits into from
Jan 4, 2018

Conversation

replay
Copy link
Contributor

@replay replay commented Feb 27, 2017

Adds an API endpoint to clear the cache by metric name, including all the chunks of associated aggregations.

#547

@woodsaj
Copy link
Member

woodsaj commented Feb 27, 2017

Does this api endpoint need to broadcast the request to all nodes in the cluster? I feel that it should, or at least should have a flag to tell it to.

@replay
Copy link
Contributor Author

replay commented Mar 6, 2017

Interesting idea. What's the reason why you think that's necessary, because the cluster also forwards the requests between the nodes if something requested is known to be another node? So the cache clearing behavior should be consistent with that? That makes sense

@woodsaj
Copy link
Member

woodsaj commented Mar 15, 2017

If running a multinode cluster, and i want to delete the cache for all series that match "some.metric.*" the series that match that pattern could be spread across many nodes.

Having the delete request propagate through the whole cluster would be a requirement for exposing this feature to users who would typically have no visibility into the cluster topology, they only have a single gateway address.

@Dieterbe
Copy link
Contributor

+1 to support propagation. but we should make it optional. I suspect often enough cache clearing will be done by an operator who wants to only do it to 1 instance, to diagnose or try to fix a problem, or the operator may want to do it to different instances at different times, instead of all at once.

@replay
Copy link
Contributor Author

replay commented Sep 8, 2017

@Dieterbe @woodsaj took a while, but this would be ready to review again

@Dieterbe
Copy link
Contributor

Dieterbe commented Sep 8, 2017

it seems to make sense to be able to delete the entire cache, currently it can only be done for series matching the pattern, but there is no pattern to mach all series (correct me if i'm wrong), so maybe allow for pattern "" or "*" and delete entire cache in that case?

api/ccache.go Outdated
}
}
}
response.Write(ctx, response.NewJson(200, res, ""))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the peers reported a bunch of errors, or could not connect to peers, we should probably not report 200 ok

@Dieterbe
Copy link
Contributor

Dieterbe commented Sep 8, 2017

seems like something went wrong with the rebase. it's showing a bunch of changes that seem unrelated to this PR, like moving cluster stuff around. can you amend the commit so that the diff is only what is supposed to change

@replay
Copy link
Contributor Author

replay commented Sep 11, 2017

@Dieterbe I think you're right that it would make sense to add a way to clear the whole cache, but I think using "*" is kind of risky because that's supposed to have a specific meaning which is different. So I'd rather use something like "**"

@Dieterbe
Copy link
Contributor

** ? why not just support empty pattern then?

@replay
Copy link
Contributor Author

replay commented Sep 11, 2017

@Dieterbe because i think that "" could easily be specified by accident, that's all :)

@replay
Copy link
Contributor Author

replay commented Sep 20, 2017

@Dieterbe is there something more that i need to do to make this mergable? (apart from rebasing once again if the context cancelling get merged)

@replay replay force-pushed the clear_cache_api branch 3 times, most recently from 66de456 to 0aac486 Compare December 18, 2017 09:37
@replay
Copy link
Contributor Author

replay commented Dec 18, 2017

@Dieterbe good news, because I'm sure you're looking for more PRs to review, this one would be ready again^^
It also hides nodes away behind an interface in order to make them mockable, that was necessary because I wanted to add a test to test propagation of the cache deletes.
It also makes the chunk cache aware of which archives belong to which raw metric. That's necessary because when we delete the metric xyz out of the cache we'll also want to delete all the related archives like xyz_600_sum.

func (cd CCacheDelete) TraceDebug(span opentracing.Span) {
}

type CCacheDeleteResp struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the number of errors is interesting, but much more useful I think would be the first error encountered (for each node)

@replay
Copy link
Contributor Author

replay commented Jan 3, 2018

@Dieterbe this would be ready to review again


if respParsed.DeletedSeries != delSeries || respParsed.DeletedArchives != delArchives {
t.Fatalf("Expected %d series and %d archives to get deleted, but got %d and %d", delSeries, delArchives, respParsed.DeletedSeries, respParsed.DeletedArchives)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm confused about why we tell the cache to return these fake, made up numbers and then check them, while we also check the real numbers above. can this be simplified ? for example have the mockcache return that deleted series == deleted metric keys, and deleted archives is maybe pinned to 3x the number of deleted keys? does this idea make sense? if not, add a comment somewhere that explains this please

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only testing the request handler, not the cache. It's testing one request with propagation disabled (TestMetricDelete) and one with propagation enabled (TestMetricDeletePropagation). I'm not sure how creating additional rules like archives = 3 * series would simplify things, that would rather just make it more complicated because in order to understand the test a reader would then first need to be aware of this rule.
Maybe I should rename the tests to TestMetricDeleteRequestHandlerWithoutPropagation() and TestMetricDeleteRequestHandlerWithPropagation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's confusing because the mock cache claims to have deleted a number series and archives that has nothing to do with the delete request we actually issued on it. the numbers don't seem to make sense. we assert that it only received 1 metric key delete , then how could any cache have deleted 3 series if there was only 1 key? i know it's a mock and not a real cache, but the numbers should make more sense I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would clear things up if the DelMetric method took patterns.
but according to the argument names or the interface function and its implementations
(and the docs for CCache.DelMetric) it only takes 1 metric key, not a pattern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i found a bug while looking at that just now: b168558

return c.thisNode()
}

func (c *MemberlistManager) thisNode() HTTPNode {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this private function if we already had the public one that did the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, that's not necessary anymore, was only necessary in some previous version 👍

Copy link
Contributor Author

@replay replay Jan 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -19,6 +19,8 @@ type CCacheMetric struct {

// the list of chunk time stamps in ascending order
keys []uint32

RawMetric string
Copy link
Contributor

@Dieterbe Dieterbe Jan 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as i have mentioned before, our codebase is very GC-heavy. we need to pay attention to how many pointers we keep active. and a string has a pointer. instead of keeping the entire RawMetric string for each CCacheMetric why not just keep 1 uint8 which holds the number to be subtracted from the key len to substring and get the raw key? it'll go easier on GC and save us memory as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(we could also store the len of the raw string in a uint8 but then we're capped to a key len of 255 which some metrics will want to exceed. but a suffix can't be longer than 255 chars, we know that for sure)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dieterbe
Copy link
Contributor

Dieterbe commented Jan 3, 2018

@replay another round of comments and a few small commits. please have a look :)

@Dieterbe Dieterbe merged commit a0a00c4 into master Jan 4, 2018
@Dieterbe Dieterbe mentioned this pull request Apr 11, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants