Skip to content
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.

implement nudging similar to graphite. #647

Merged
merged 4 commits into from
Jun 27, 2017

Conversation

Dieterbe
Copy link
Contributor

@Dieterbe Dieterbe commented Jun 1, 2017

seems to work fine, but still have to make unit tests

@Dieterbe Dieterbe force-pushed the runtime-consolidation-nudging branch 6 times, most recently from aa05b72 to 50ce641 Compare June 1, 2017 15:56
@Dieterbe Dieterbe requested review from woodsaj and DanCech and removed request for DanCech and woodsaj June 1, 2017 15:56
@Dieterbe
Copy link
Contributor Author

Dieterbe commented Jun 1, 2017

wanna do some more work on this first.

@Dieterbe Dieterbe force-pushed the runtime-consolidation-nudging branch from 50ce641 to 7ddec47 Compare June 2, 2017 14:36
* move nudging-based consolidation to consolidation package
  this will make it easier to unit test, as opposed to having it in the
  plan.
* since our aggregation returns points that are clean multiples of the
  new interval, and we don't want to move information into the past
  (which would be lying about the future) e.g. instead of moving a spike of data to
  an earlier timestamp, we move it to a later one; since we do this, we
  need to adjust the nudging logic to treat the point after a clean
  interval as the first one of any bucket.
@Dieterbe Dieterbe force-pushed the runtime-consolidation-nudging branch from 7ddec47 to ffc134c Compare June 2, 2017 14:41
@Dieterbe
Copy link
Contributor Author

Dieterbe commented Jun 2, 2017

at this point, i'm pretty happy with it I think. it is simple and assures MaxDataPoints is always honored (especially important when mdp=1) (unlike the other approach i explored and described in the comments but decided not to persue cause it was too complex).

for a demo see https://vimeo.com/220016610

i start fakemetrics with secondly points, and then visualize the points with a few MDP settings.
you can see that there's never any jumping of data, except at the end, when a runtime-consolidated points is being more accurately computed as more data for it comes in. (and also some jumping when the scale is adjusted to accomodate the points which is normal)
at the front you can see that sometimes there is a gap in data that is slightly longer that the space between points (as opposed to showing an aggregated point). in the future maybe we can go more advanced, take runtime consolidation into account when fetching, fetch all data needed to go into the first point, etc, but for now I think this will do. also the size of the gap is now exaggerated because of the low MDP values / large space between points.

@Dieterbe Dieterbe requested review from woodsaj and DanCech June 2, 2017 15:04
@Dieterbe
Copy link
Contributor Author

ping @DanCech @woodsaj review plz

// move start until it maps to the first point of an aggregation bucket
// since clean multiples of the new postAggInterval are the last point to go into an aggregation
// we want a point that comes preAggInterval after it.
remainder := (start - preAggInterval) % postAggInterval
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely clear why we subtract preAggInterval here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have pushed a commit which should explain this stuff (and some of the other stuff) better. let me know what you think / if that clears it up.

@DanCech
Copy link
Contributor

DanCech commented Jun 21, 2017

After spending some time going over this and the graphite implementation with @replay I think it makes sense, just that one question above. I'm going to dig deeper on the graphite side and figure out what the story is with its approach:

nudge = secondsPerPoint + (series.start % series.step) - (series.start % secondsPerPoint)

while it seems like it should be:

nudge = secondsPerPoint - ((series.start - 1) % secondsPerPoint) - 1

(the -1 is so that when the start time falls on the post-agg interval boundary it doesn't get nudged across a full interval)

// with our nudging approach, we strip all the points that do not form a full aggregation bucket.
// IOW the desired first point must be the the first point of a bucket.
// in the example above, it means we strip the first 2 points and start at ts=25 (which will go into ts=40).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. This actually seems to be different from the way graphite does things (based on my reading of the code), where the timestamp of the aggregated points is the start of the interval rather than the end.

Copy link
Contributor Author

@Dieterbe Dieterbe Jun 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is. it's also how we've implemented runtime consolidation in MT since forever
it's based on a guiding principle that i came up with but that is afaik in the wider graphite ecosystem not often really discussed or cared about and as a result also implemented oppositely across different tools (eg statsd also postmarks like MT). see https://grafana.com/blog/2016/03/03/25-graphite-grafana-and-statsd-gotchas/#timestamps for details
basically my point is that data should never move into the past (which breaks the space time continuum: observations can afaik not preceed the event they observe), but that people can expect data to be delayed (and move into the future instead, which is just a delay and doesn't break laws of physics) e.g. visually, a spike should never appear on a chart at an earlier time then when it actually happened, and people should expect that with consolidation, data will move it a bit to the right. so that whenever you look at a point in time (whether aggregated or not) you can always know that happened at some point in time since the previous timestamp.

perhaps this is just one of my quixotic quirks? is there a flaw in my reasoning?
especially now that we care more about stricter compatibility with graphite, do we want to undo this and do things the graphite way? though i would rather keep the postmarking, i'm open to discussion and/or change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at all, I think your reasoning makes sense and it's also how things like movingAverage work in graphite, so I'm seriously considering a PR to change graphite to work according to your logic. Will need to chew on that a little.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it would make more sense to change Graphite than to change the current MT implementation (Although I wouldn't argue based on the theory of relativity).

@Dieterbe
Copy link
Contributor Author

can i get an approve on this?

Copy link
Contributor

@replay replay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Dieterbe Dieterbe merged commit e4870e0 into master Jun 27, 2017
@Dieterbe Dieterbe deleted the runtime-consolidation-nudging branch September 18, 2018 09:07
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants