-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aggregation filter using JOIN #4009
Conversation
joinConditions := make([]string, 0, len(q.Dimensions)) | ||
selfJoinCols := make([]string, 0, len(q.Dimensions)+1) | ||
selfJoinTableAlias := tempName("self_join") | ||
nonNullValue := tempName("non_null") | ||
var args []any | ||
var subSelectArgs []any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid this takes the code to a barely readable level and will make future changes too hard and error-prone. I don't have a great proposal, but somehow this filter case needs to be isolated better in the code (and not interleaved with the existing implementation with various if statements throughout) and also have an appropriate docstring explaining the case being handled.
I wonder if it can either be refactored to a single if
statement at the end of the buildMetricsAggregationSQL
function, or otherwise we may need to have a dedicated buildMetricsAggregationSingleMeasureFilterSQL
function that gets invoked instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
args
collection prevents from refactoring it to a single if
SELECT %[1]s FROM ( | ||
SELECT %[10]s FROM %[2]s %[3]s %[4]s %[5]s %[6]s | ||
) %[2]s | ||
LEFT JOIN ( | ||
SELECT %[10]s FROM %[2]s %[3]s %[9]s %[5]s %[6]s | ||
) %[7]s | ||
ON (%[8]s) %[13]s %[11]s OFFSET %[12]d | ||
`, | ||
strings.Join(selfJoinCols, ", "), // 1 | ||
safeName(mv.Table), // 2 | ||
strings.Join(unnestClauses, ""), // 3 | ||
whereClause, // 4 | ||
groupClause, // 5 | ||
havingClause, // 6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't the HAVING
clause need to be applied after the left join?
) | ||
|
||
args = append(args, subSelectArgs...) | ||
fmt.Println("sql ", sql, args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover print statement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks cleaner. Have you also validated that it works on Druid?
This JOIN mirrors functionality of SELECT d1, d2, d3, m1 FILTER (WHERE d4 = 'Safari') FROM t WHERE... GROUP BY d1, d2, d3 | ||
bacause FILTER cannot be applied for arbitrary measure, ie sum(a)/1000 | ||
*/ | ||
if len(q.Measures) == 1 && q.Measures[0].Filter != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should return an error if more than 1 measure has a filter (since that's not supported) or if it is used with pivot_on
SELECT %[1]s FROM ( | ||
SELECT %[10]s FROM %[2]s %[3]s %[4]s %[5]s %[6]s | ||
) %[2]s | ||
LEFT JOIN ( | ||
SELECT %[10]s FROM %[2]s %[3]s %[9]s %[5]s %[6]s | ||
) %[7]s | ||
ON (%[8]s) %[13]s %[11]s OFFSET %[12]d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing something, but the having
clause seems to be applied in the sub-selects, but shouldn’t it be outside, after the join? For example, if you have having measure1 < 10
, some dimension values might be excluded from the first sub-selects which are included in the second sub-select.
selfJoinTableAlias, // 7 | ||
strings.Join(joinConditions, " AND "), // 8 | ||
measureWhereClause, // 9 | ||
strings.Join(selectCols, ", "), // 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If using the same selectCols
in both sub-selects, then won't it be computing the measure twice, one without the measure filter and the second with the measure filter? The non-filtered value would not be used, so should be cheaper to not compute it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is still unaddressed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it will be much more performant than "SELECT DISTINCT" and it requires additional algorithm branching
splits := strings.Split(alias.Name, ".") | ||
if len(splits) > 1 { | ||
return safeName(splits[0]) + "." + safeName(splits[1]), true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't safe since the name can contain dots
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other way I need to refactor it to pass information that it's a complex name
/* | ||
Example: | ||
SELECT t.d1, t.d2, t.d3, t2.m1 (SELECT d1, d2, d3, m1 FROM t WHERE ... GROUP BY d1, d2, d3 HAVING m1 > 10 ) t LEFT JOIN ( | ||
SELECT d1, d2, d3, m1 FROM t WHERE ... AND (d4 = 'Safari') GROUP BY d1, d2, d3 HAVING m1 > 10 | ||
) t2 ON (COALESCE(t.d1, 'val') = COALESCE(t2.d1, 'val') and COALESCE(t.d2, 'val') = COALESCE(t2.d2, 'val') and ...) | ||
WHERE t2.m1 > 10 | ||
ORDER BY ... | ||
LIMIT 100 | ||
OFFSET 0 | ||
|
||
This JOIN mirrors functionality of SELECT d1, d2, d3, m1 FILTER (WHERE d4 = 'Safari') FROM t WHERE... GROUP BY d1, d2, d3 | ||
bacause FILTER cannot be applied for arbitrary measure, ie sum(a)/1000 | ||
*/ | ||
if filterCount == 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we pull this case into a separate function? The level of nesting and complexity here makes it hard to follow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will be a long signature function
selfJoinTableAlias, // 7 | ||
strings.Join(joinConditions, " AND "), // 8 | ||
measureWhereClause, // 9 | ||
strings.Join(selectCols, ", "), // 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is still unaddressed
SELECT %[1]s FROM ( | ||
SELECT %[10]s FROM %[2]s %[3]s %[4]s %[5]s %[6]s | ||
) %[2]s | ||
LEFT JOIN ( | ||
SELECT %[10]s FROM %[2]s %[3]s %[9]s %[5]s %[6]s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
%[6]s
represents the having clause – does it make sense to have it in the subqueries when it will be enforced after the left join?
At least in the first sub-query, we should be able to completely remove calculation of the measure (see other comment), so having it there seems weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assuming Druid won't push down the WHERE, the subselect generates more rows for joining
…4009) * aggregation filter with join * aggregation filter with join * aggregation filter with join * aggregation filter with join * aggregation filter with join * aggregation filter with join * aggregation filter with join * aggregation filter with join * aggregation filter with join * aggregation filter with join * aggregation filter with join * aggregation filter with join --------- Co-authored-by: Egor Ryashin <egor.ryashin@rilldata.com>
No description provided.