Skip to content
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.

Feature aggregate refactor #771

Merged
merged 7 commits into from
Nov 28, 2017

Conversation

shanson7
Copy link
Collaborator

This is a preliminary change that makes follow up features easier.

This change combines the similar basic aggregation functions into a single aggregate function. This can be extended to support the new aggregate function and groupByTags.


func getCrossSeriesAggFunc(c string) crossSeriesAggFunc {
switch c {
case "avg", "average":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't it always average never avg?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, can't we do away with this entire function, perhaps we should assign the crossSeriesAggFunc as a member to the FuncAggregate

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use this function for groupByTags in a followup branch, so it will still be needed there.

expr/funcs.go Outdated
"avg": {NewAvgSeries, true},
"averageSeries": {NewAvgSeries, true},
"avg": {NewAggregateConstructor("average"), true},
"averageSeries": {NewAggregateConstructor("average"), true},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems a bit weird to set these strings

  1. at runtime, would rather do it at compile time
  2. in a way that each instance gets a copy instead of just hardcoded in one place

IOW I'm thinking have different structs for each function with their own Name method or something.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's totally doable. I doubt it will have any real performance impact on this transient object, but it would help catch typo's at compile-time when new functions are added/changed.

Copy link
Collaborator Author

@shanson7 shanson7 Nov 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I implemented something, but now I'm not as sure about the benefit. I created a struct that holds a function and a name. I cannot use a set of structs, or else we move back to the interfaces. So, now the name and the struct can get out of sync (bad copy/pasta, etc). Can you give me more insight into what you would like?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, at compile time, a non-existent function can be caught. At test time, typo's / mismatch between name and function can be caught.

At startup (specifically, funcs::init()) the function and name would be assigned (no lookup needed).

Is this acceptable?

point.Val = math.NaN()
} else {
point.Val = sum
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comparing what was in func_sumseries.go with this, i see 2 differences.

  1. instead of nan bool, you use num int
  2. instead of working with point throughout you create sum first, then put it in point?

why? what's the benefit of this? (in general I would advise keep the logic the same, unless there's a benefit. makes reviewing easier)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were each slightly modified versions of each other based on averageSeries. I would prefer these functions to all look like each other, than to all look like the original xxxSeries impl (since those were independent and these are co-located). For sum I can switch to using a bool.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

max := math.NaN()
for j := 0; j < len(in); j++ {
p := in[j].Datapoints[i].Val
if !math.IsNaN(p) && (math.IsNaN(max) || p > max) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see you replaced math.Max with a custom clause (which compiles to a jump, which could be expensive in case the branch is mispredicted). in master we use math.Max which i'm not sure exactly how it works, but i assume it's just 1 native instruction and hence likely faster.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change that back to math.Max

Copy link
Collaborator Author

@shanson7 shanson7 Nov 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shockingly, math.Min is MUCH slower:

bin/benchcmp using_math.txt not_using_math.txt
benchmark                                         old ns/op     new ns/op     delta
BenchmarkSeriesAggregateMin10k_100NoNulls-8       14308915      3435607       -75.99%
BenchmarkSeriesAggregateMin10k_100WithNulls-8     14437599      3477521       -75.91%

benchmark                                         old MB/s     new MB/s     speedup
BenchmarkSeriesAggregateMin10k_100NoNulls-8       838.64       3492.83      4.16x
BenchmarkSeriesAggregateMin10k_100WithNulls-8     831.16       3450.73      4.15x

It was 4x faster to put the logic in the if statement.

@Dieterbe
Copy link
Contributor

Dieterbe commented Nov 27, 2017

note that crossSeriesCnt != graphite's countSeries (http://graphite.readthedocs.io/en/latest/functions.html#graphite.render.functions.countSeries) you don't claim it is, so that's fine, just a warning. i supect you might introduce it via http://graphite.readthedocs.io/en/latest/functions.html#graphite.render.functions.aggregate which is ok i think (despite graphite not having it).

one that i'm more sceptical of is crossSeriesLst. if a user uses this, they should probably just have filtered down better. note that graphite also doesn't support aggregate with last. could also be a bit confusing as "last" in this context has nothing to do with time order, rather lexical ordering. but i'm ok with that too, if you want to add it.although for this function, you'd be better off I think iterating in reverse order and breaking when you find the first non-nan point.

func crossSeriesMin(in []models.Series, out *[]schema.Point) {
for i := 0; i < len(in[0].Datapoints); i++ {
nan := true
min := math.NaN()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start with inf, since math.Min() returns NaN if one of the inputs is NaN (every time)

@shanson7
Copy link
Collaborator Author

I think I'll remove Cnt/Lst for now. I took them as-is from batch/aggregator.go and just embedded them in a second loop. They would be easy enough to add back in later.

Cnt is actually implementing something I'd done by composing sum + isNonNull, which I find interesting.

@shanson7
Copy link
Collaborator Author

I removed Cnt and Lst and fixed Min. I added test cases and benchmarks as well.

@Dieterbe
Copy link
Contributor

looks like you accidentally reverted to the non-math.Max and non-math.Min approaches.
that stuff doesn't belong in the "looks like the "Improve performance, fix bug in min" commit anyway, please take it out of that commit.

@shanson7
Copy link
Collaborator Author

That was the "Improve performance". math.Max/math.Min was considerably slower.

@shanson7
Copy link
Collaborator Author

See #771 (comment) re: performance difference. I can split the commit if you want.

@Dieterbe
Copy link
Contributor

aha sweet. missed your comment. ok so that's good then.
we could clarify the difference between the first benchmarks like BenchmarkAggregate10k_1NoNulls and benchmarkAggregate; versus BenchmarkSeriesAggregateAvg10k_100NoNulls and benchmarkSeriesAggregate

the difference between them is that the former runs the graphite processing chain (incl reading inputs, generating output structure with proper name and attributes, dealing with the slicepool, etc) through a fake "average", whereas the latter runs the selected cross series aggregation function.

i don't know, maybe it's too obvious to warrant comments. and i'm not sure what we would rename to.

@Dieterbe Dieterbe merged commit e1ab774 into grafana:master Nov 28, 2017
@Aergonus Aergonus deleted the feature_aggregate_refactor branch January 22, 2018 14:49
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants