Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregation filter using JOIN #4009

Merged
merged 12 commits into from
Feb 15, 2024
Merged

Aggregation filter using JOIN #4009

merged 12 commits into from
Feb 15, 2024

Conversation

egor-ryashin
Copy link
Contributor

No description provided.

@egor-ryashin egor-ryashin marked this pull request as ready for review February 12, 2024 18:44
Comment on lines 420 to 425
joinConditions := make([]string, 0, len(q.Dimensions))
selfJoinCols := make([]string, 0, len(q.Dimensions)+1)
selfJoinTableAlias := tempName("self_join")
nonNullValue := tempName("non_null")
var args []any
var subSelectArgs []any
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid this takes the code to a barely readable level and will make future changes too hard and error-prone. I don't have a great proposal, but somehow this filter case needs to be isolated better in the code (and not interleaved with the existing implementation with various if statements throughout) and also have an appropriate docstring explaining the case being handled.

I wonder if it can either be refactored to a single if statement at the end of the buildMetricsAggregationSQL function, or otherwise we may need to have a dedicated buildMetricsAggregationSingleMeasureFilterSQL function that gets invoked instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

args collection prevents from refactoring it to a single if

Comment on lines 617 to 630
SELECT %[1]s FROM (
SELECT %[10]s FROM %[2]s %[3]s %[4]s %[5]s %[6]s
) %[2]s
LEFT JOIN (
SELECT %[10]s FROM %[2]s %[3]s %[9]s %[5]s %[6]s
) %[7]s
ON (%[8]s) %[13]s %[11]s OFFSET %[12]d
`,
strings.Join(selfJoinCols, ", "), // 1
safeName(mv.Table), // 2
strings.Join(unnestClauses, ""), // 3
whereClause, // 4
groupClause, // 5
havingClause, // 6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't the HAVING clause need to be applied after the left join?

)

args = append(args, subSelectArgs...)
fmt.Println("sql ", sql, args)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover print statement

Copy link
Contributor

@begelundmuller begelundmuller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks cleaner. Have you also validated that it works on Druid?

This JOIN mirrors functionality of SELECT d1, d2, d3, m1 FILTER (WHERE d4 = 'Safari') FROM t WHERE... GROUP BY d1, d2, d3
bacause FILTER cannot be applied for arbitrary measure, ie sum(a)/1000
*/
if len(q.Measures) == 1 && q.Measures[0].Filter != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should return an error if more than 1 measure has a filter (since that's not supported) or if it is used with pivot_on

Comment on lines 613 to 619
SELECT %[1]s FROM (
SELECT %[10]s FROM %[2]s %[3]s %[4]s %[5]s %[6]s
) %[2]s
LEFT JOIN (
SELECT %[10]s FROM %[2]s %[3]s %[9]s %[5]s %[6]s
) %[7]s
ON (%[8]s) %[13]s %[11]s OFFSET %[12]d
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something, but the having clause seems to be applied in the sub-selects, but shouldn’t it be outside, after the join? For example, if you have having measure1 < 10, some dimension values might be excluded from the first sub-selects which are included in the second sub-select.

selfJoinTableAlias, // 7
strings.Join(joinConditions, " AND "), // 8
measureWhereClause, // 9
strings.Join(selectCols, ", "), // 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If using the same selectCols in both sub-selects, then won't it be computing the measure twice, one without the measure filter and the second with the measure filter? The non-filtered value would not be used, so should be cheaper to not compute it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is still unaddressed

Copy link
Contributor Author

@egor-ryashin egor-ryashin Feb 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it will be much more performant than "SELECT DISTINCT" and it requires additional algorithm branching

Comment on lines +245 to +248
splits := strings.Split(alias.Name, ".")
if len(splits) > 1 {
return safeName(splits[0]) + "." + safeName(splits[1]), true
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't safe since the name can contain dots

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other way I need to refactor it to pass information that it's a complex name

Comment on lines +587 to +600
/*
Example:
SELECT t.d1, t.d2, t.d3, t2.m1 (SELECT d1, d2, d3, m1 FROM t WHERE ... GROUP BY d1, d2, d3 HAVING m1 > 10 ) t LEFT JOIN (
SELECT d1, d2, d3, m1 FROM t WHERE ... AND (d4 = 'Safari') GROUP BY d1, d2, d3 HAVING m1 > 10
) t2 ON (COALESCE(t.d1, 'val') = COALESCE(t2.d1, 'val') and COALESCE(t.d2, 'val') = COALESCE(t2.d2, 'val') and ...)
WHERE t2.m1 > 10
ORDER BY ...
LIMIT 100
OFFSET 0

This JOIN mirrors functionality of SELECT d1, d2, d3, m1 FILTER (WHERE d4 = 'Safari') FROM t WHERE... GROUP BY d1, d2, d3
bacause FILTER cannot be applied for arbitrary measure, ie sum(a)/1000
*/
if filterCount == 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pull this case into a separate function? The level of nesting and complexity here makes it hard to follow

Copy link
Contributor Author

@egor-ryashin egor-ryashin Feb 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will be a long signature function

selfJoinTableAlias, // 7
strings.Join(joinConditions, " AND "), // 8
measureWhereClause, // 9
strings.Join(selectCols, ", "), // 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is still unaddressed

Comment on lines 654 to 658
SELECT %[1]s FROM (
SELECT %[10]s FROM %[2]s %[3]s %[4]s %[5]s %[6]s
) %[2]s
LEFT JOIN (
SELECT %[10]s FROM %[2]s %[3]s %[9]s %[5]s %[6]s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

%[6]s represents the having clause – does it make sense to have it in the subqueries when it will be enforced after the left join?

At least in the first sub-query, we should be able to completely remove calculation of the measure (see other comment), so having it there seems weird.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming Druid won't push down the WHERE, the subselect generates more rows for joining

@begelundmuller begelundmuller merged commit 18caa05 into main Feb 15, 2024
4 checks passed
@begelundmuller begelundmuller deleted the agg-filter-with-join branch February 15, 2024 15:59
mindspank pushed a commit that referenced this pull request Feb 23, 2024
…4009)

* aggregation filter with join

* aggregation filter with join

* aggregation filter with join

* aggregation filter with join

* aggregation filter with join

* aggregation filter with join

* aggregation filter with join

* aggregation filter with join

* aggregation filter with join

* aggregation filter with join

* aggregation filter with join

* aggregation filter with join

---------

Co-authored-by: Egor Ryashin <egor.ryashin@rilldata.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants