Skip to content
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.

support multiple raw intervals per storage schema. #588

Merged
merged 2 commits into from
Apr 5, 2017
Merged

Conversation

woodsaj
Copy link
Member

@woodsaj woodsaj commented Apr 4, 2017

  • closes match retention polices based on metric interval #579
  • expand storage schemas into all permutations of the retentions.
    eg, a schema with
    retentions=1s:1d,1min:7d,10min:30day
    becomes 3 schemas:
    1 - retentions=1s:1d,1min:7d,10min:30day
    2 - retentions=1min:7d,10min:30day
    3 - retentions=10min:30day
  • when calling schemas.Match() pass in the series name and interval.
    We find the schema with a matching pattern and then find
    the sub-schema with the best retention fit. The best fit is when
    the metric interval is >= the rawInterval and less then the interval
    of the next rollup.
    Using our above retention policy, the following matches would occur
    interval=1s: 1s:1d,1min:7d,10min:30day
    interval=10s: 1s:1d,1min:7d,10min:30day
    interval=60s: 1min:7d,10min:30day
    interval=300s: 1min:7d,10min:30day
    interval=3600: 10min:30day

@woodsaj woodsaj requested review from Dieterbe and replay April 4, 2017 10:12
// Aggregations holds the aggregation definitions
type Aggregations struct {
Data []Aggregation
Data []Aggregation
DefaultAggregation Aggregation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the point of the default is that it applies irrespective of what your Aggregation definitions are, so why move it from global to a member?

Copy link
Member Author

@woodsaj woodsaj Apr 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DefaultAggregation is only modified by unitTests. Keeping it as part of the Aggregations{} instance makes it easier for unit tests to modify the defaults for testing purposes.

eg, calling mdata.SetSingleAgg(met ...conf.Method) wont break other unit tests that may be expecting the DefaultAggregation to be set to the defaults.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but both SetSingleAgg and SetSingleSchema modify global state. wouldn't that global state stay just as "dirty" as it was before? Your goal is noble but I think it can only be fulfilled by passing the Schemas and Aggregations around instead of having them as globals in the mdata package.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SetSingleAgg and SetSingleSchema only modify the state within the "mdata" pacakge. Allowing them to modify global variables in other packages is a recipe for disaster.

conf/schemas.go Outdated
// Schemas contains schema settings
type Schemas struct {
raw []Schema
index []*Schema
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is raw a slice of values and index a slice of pointers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at one stage match was performing a range over the index. A range of a slice copies the values, so only copying the reference is much faster then copying the whole schema struct.

But doenst look like this is needed anymore

conf/schemas.go Outdated
}

// Match returns the correct schema setting for the given metric
// it can always find a valid setting, because there's a default catch all
// also returns the index of the setting, to efficiently reference it
func (s Schemas) Match(metric string) (uint16, Schema) {
for i, schema := range s {
func (s Schemas) Match(metric string, interval int) (uint16, Schema) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this new function hard to understand.
One reason I think is because it's a mashup between two mental models:
A) all our schemas/retentions as 1 long list
B) a list of entries (based on pattern), and then each entry has a list of retention lists (due to the new "permutation" stuff). i.e. a nested structure.

This function has a few places where the logic works in the domain of B (e.g. taking the first retention of a given pattern, then working with len(schema.Retentions), or iterating that list which actually represents items further down in the index list. I drafted some refactoring, but couldn't come up with anything better. I think a nested, B-style model would be nicer for our index datastructure, but then we would need two identifiers to represent a schema, which is to be avoided.

Copy link
Contributor

@replay replay Apr 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it's kind of hard to wrap ones head around that concept, but it's pretty cool that in the end pattern/retention can be looked up with only one index.
Maybe it would be good to have some comment that illustrates how that works, I imagine the s.index slice to look like a partitioned list of retentions where each partition is a pattern:

|----------------------------------------------------------------------|
| pattern1           | pattern2    | pattern3                          |
|----------------------------------------------------------------------|
| ret1 | ret2 | ret3 | ret4 | ret5 | ret6 | ret7 | ret8 | ret9 | ret10 |
|----------------------------------------------------------------------|

1) if metric matches patternX -> find the best retention for interval within patternX
2) otherwise -> skip len(patternX.retentions) and back to 1)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be clear I'm not saying we should change the structure, I also really like the single index. I just think it's important to make this logic as simple as we can.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this function is complex. It took a long time to come to this solution that only requires a single call to matchString(), but still allows a single schema ID to be passed around as a reference.

I think @replay's illustration will go a long way to clarifying what is going on. But i would change it to

image

conf/schemas.go Outdated
if interval < ret.SecondsPerPoint {
// if there are no retentions with SecondsPerPoint >= interval
// then we need to use the first retention. Otherwise, the retention
// we want to use is the previous one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is confusing. we branch if interval < SecondsPerPoint but this talks about SecondsPerPoint >= interval which means interval <= SecondsPerPoint

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just basic logic. if a < b is true, then b >= a is false.

ie these are the same thing
if interval < ret.SecondsPerPoint
if !(ret.SecondsPerPoint >= interval)

Copy link
Contributor

@Dieterbe Dieterbe Apr 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a < b is true, then b >= a is false.

no, these expressions are in fact equivalent in all but 1 case. e.g. if a is 1 and b is 2 then both are true.

ie these are the same thing

no they are not:

	interval := 10
	SecondsPerPoint := 20
	fmt.Println(interval < SecondsPerPoint)
	fmt.Println(!(SecondsPerPoint >= interval))

this prints:

true
false

conf/schemas.go Outdated
}
// no retentions found with SecondsPerPoint > interval. So lets just use the retention
// with the largest secondsPerPoint.
pos := len(schema.Retentions) - 1 + i
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it just me or would i + len(schema.Retentions) - 1 be easier to understand? i think of it like "i is the index where we're at, so that's the base, then from there jump to the last retention" so this order makes more sense to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is just you. This order makes more sense to me.


// TTLs returns a slice of all TTL's seen amongst all archives of all schemas
func (schemas Schemas) TTLs() []uint32 {
ttls := make(map[uint32]struct{})
Copy link
Contributor

@replay replay Apr 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this nested loop and the following loop all be replaced by one if instead schemas.index is used as input?
Also, the generated value could be cached and reused since afaik it shouldn't change once MT is up

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it could. this code predates the "permutation" stuff. There's no need for caching since this function rarely gets called and is not in any performance critical path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

K, true. only called once in fact


// MaxChunkSpan returns the largest chunkspan seen amongst all archives of all schemas
func (schemas Schemas) MaxChunkSpan() uint32 {
max := uint32(0)
Copy link
Contributor

@replay replay Apr 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here... Only one loop would be necessary by iterating over schemas.index.

max := schemas.MaxChunkSpan()
So(max, ShouldEqual, 60*60*6)
})
}
Copy link
Contributor

@Dieterbe Dieterbe Apr 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding unit tests for my code :-D 👍

So(schema.Name, ShouldEqual, "a")
So(schema.Retentions[0].SecondsPerPoint, ShouldEqual, 10)
})
Convey("When metric has 1s raw interval", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any particular reason behind the ordering 10s -> 30s -> 1s ? i would do them in ascending order.

@woodsaj woodsaj force-pushed the issue579 branch 2 times, most recently from 5f50d02 to bf1bbee Compare April 5, 2017 06:57
@Dieterbe
Copy link
Contributor

Dieterbe commented Apr 5, 2017

I think we shouldn't use the word permutation here. a permutation is something else (e.g. implies reordering) and it's confusing.
I can't find word a perfectly fitting word but "subsequence" is at least closer.

@Dieterbe
Copy link
Contributor

Dieterbe commented Apr 5, 2017

Isn't the whole point of the createmissing to add the aggmetrics structure when this instance has no knowledge of them? So that a new instance can start up, process the metricpersist backlog, and know how far metrics (that it hasn't seen yet, but will, soon) have been saved. see e03d2fd and #485 ?
by ignoring messages if they're not in the index, doesn't that defeat the whole point?

@woodsaj
Copy link
Member Author

woodsaj commented Apr 5, 2017

No. The index is always loaded into memory first before anything else. The only time a savedChunk message can be received and there be no entry in the index is if the metric has been deleted from the index. In which case we dont need to worry about the saveChunk anymore.

@woodsaj
Copy link
Member Author

woodsaj commented Apr 5, 2017

Actually there is a second reason. The savedChunk message is for a metric that is on a different partition to the partitions being handled by this instance. In which case we also want to ignore the message.

  • NotifierKafka is partition aware, so only instances only receive savedChunk messages for metrics in the partitions they are processing. But notifierNSQ just broadcasts savedChunk messages to all nodes.

woodsaj and others added 2 commits April 5, 2017 18:51
- closes #579
- expand storage schemas into all permutations of the retentions.
eg, a schema with
        retentions=1s:1d,1min:7d,10min:30day
 becomes 3 schemas:
        1 - retentions=1s:1d,1min:7d,10min:30day
        2 - retentions=1min:7d,10min:30day
        3 - retentions=10min:30day
- when calling schemas.Match() pass in the series name and interval.
  We find the schema with a matching pattern and then find
  the sub-schema with the best retention fit. The best fit is when
  the metric interval is >= the rawInterval and less then the interval
  of the next rollup.
  Using our above retention policy, the following matches would occur
  interval=1s:  1s:1d,1min:7d,10min:30day
  interval=10s: 1s:1d,1min:7d,10min:30day
  interval=60s: 1min:7d,10min:30day
  interval=300s: 1min:7d,10min:30day
  interval=3600: 10min:30day
* deprecate PersistMessage and remove parsing of that format.
  - NSQ uses the PersistMessageBatchV1 format since
    72d2f6e (jan 06 2016)
  - Kafka uses it since it was introduced
    02a8a92 (jul 18 2016)

* simply notifier handling code, just make it utility function

* lookup metricDefs from the index to get schemaId and aggId. Looking up a
  key in a map is way faster then comparing the name against the schema
  patterns.

* as we are looking up the def from the index, we dont need to include Name
  and Interval in the savedChunk message anymore.

* remove CreateMissing flag. Creating missing metrics is now always performed,
  if the metric is in the index.
@Dieterbe
Copy link
Contributor

Dieterbe commented Apr 5, 2017

ok makes sense. for the record I just added CreateMissing to notifierNSQ because it seemed like the right thing to do. I didn't put a whole lot of thought into it.

Also, a nice property of the current index code is that while metricdef updates (e.g. existing defs) are subject to update-interval, new metrics will have archive.LastSave=0, and trigger an immediate save to cassandra.
This means that even in cases where the metric is relatively new (e.g. much newer than index update-interval) (but new enough to trigger a chunk save), and a new node comes online and gets a savedChunk for that metric, we'll have done our best to update the cassandra index asap, and the chance of a legitimate miss in the index is so low that an extra chunk save wouldn't matter.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

match retention polices based on metric interval
3 participants