Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix first/last aggregation functions #5208

Merged
merged 5 commits into from
Feb 19, 2024
Merged

Fix first/last aggregation functions #5208

merged 5 commits into from
Feb 19, 2024

Conversation

lutter
Copy link
Collaborator

@lutter lutter commented Feb 14, 2024

The first/last aggregation functions were not working as intended. This PR fixes that; unfortunately, we need to require that the id of time series data points are in insertion order to achieve that and jump through a bunch of hoops to define arg_min/arg_max functions.

Comment on lines +1042 to +1045
// We only need to pay attention to the first bucket; if
// there are more buckets, there's nothing to rollup for
// them as the next changes we wrote are for `block_time`,
// and we'll catch that on the next iteration of the loop.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how in the next iteration we catch the other buckets, wont we miss all the buckets between the first bucket and block_time?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking here is this: assume we get passed in block_times = [b1, b2, b3, .. ] but b1 and b2 are far apart. We call rollup.interval.buckets(b1, b2) at some point in the iteration which produces timestamps [t1, t2, ..]. Since b1 and b2 are far apart, we have something like t1 <= b1 < t2 < t3 < t4 < t5 <= b2 but we know that there are no writes between b1 and b2 - if there were, we'd have some block time in block_times between b1 and b2. So we only need to do a rollup for t1 < b1 < t2. After that, we set last_rollup = b2 and repeat the loop for that, which will roll up the bucket t5 <= b2 < t6. So there's no need to worry about the buckets starting at t2, t3, and t4.

Maybe there's an error in my logic, but I think that's correct (and tests haven't shown any missed rollups)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you find the above convincing, I should probably add that to the comment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes its convincing, thanks for the clarification.

For subgraphs with big gaps in between writes, we tried to do rollups for
times where we already know that we don't have changes.
The two names are aliases, but we will use them in aggregations; to make
notation more consistent, we'll use int4 and int8
@lutter lutter merged commit ccb992e into master Feb 19, 2024
7 checks passed
@lutter lutter deleted the lutter/agg-minmax branch February 19, 2024 19:42
@lutter lutter restored the lutter/agg-minmax branch February 19, 2024 20:16
@lutter lutter deleted the lutter/agg-minmax branch February 19, 2024 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants