Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Projection Expression - Input Field Inconsistencies during Projection #10088

Merged
merged 8 commits into from
Apr 22, 2024

Conversation

berkaysynnada
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

During projection, 3 cases found where the input field names do not match with the projection expressions:

  1. COUNT(1) expressions appear as COUNT(*) in the input field.
  2. Aggregate function names do not include ORDER BY clause in their field name, but the projections expect them with ORDER BY clause.
  3. While we are reverting first-last value functions, we forgot to convert also ORDER BY clause.

These are not bugs since the projection does not check the names during execution. However, when writing some optimizer rules, projections may be added by looking at the input schema. That inconsistencies cause some errors then.

What changes are included in this PR?

  1. COUNT() expressions are renamed as the way that they appear in the projection.
  2. ScalarFunction, WindowFunction and AggregateFunction's names contain the ORDER BY clause now.
  3. First-Last value functions' order are also reverted now.
  4. While creating the ProjectionMapping, we check the naming of expression and fields.

Are these changes tested?

One test added for the COUNT case, existing tests with new versions cover the other cases.

Are there any user-facing changes?

@github-actions github-actions bot added physical-expr Physical Expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Apr 15, 2024
@@ -67,6 +67,9 @@ impl ProjectionMapping {
// Conceptually, `source_expr` and `expression` should be the same.
let idx = col.index();
let matching_input_field = input_schema.field(idx);
if col.name() != matching_input_field.name() {
return Err(DataFusionError::Internal(format!("Input field name {} does not match with the projection expression {}",matching_input_field.name(),col.name())));
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check does not exist before. Now, input field names and projection expressions must match.

Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ozankabak
Copy link
Contributor

@alamb, can you PTAL? This will start generating errors for plans that fail to treat names correctly, so it will have downstream effects

Copy link
Member

@jackwener jackwener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A great job to me

Comment on lines +47 to +51
02)--AggregateExec: mode=FinalPartitioned, gby=[a@0 as a], aggr=[NTH_VALUE(multiple_ordered_table.c,Int64(1)) ORDER BY [multiple_ordered_table.c ASC NULLS LAST]], ordering_mode=Sorted
03)----SortExec: expr=[a@0 ASC NULLS LAST]
04)------CoalesceBatchesExec: target_batch_size=8192
05)--------RepartitionExec: partitioning=Hash([a@0], 4), input_partitions=4
06)----------AggregateExec: mode=Partial, gby=[a@0 as a], aggr=[NTH_VALUE(multiple_ordered_table.c,Int64(1))], ordering_mode=Sorted
06)----------AggregateExec: mode=Partial, gby=[a@0 as a], aggr=[NTH_VALUE(multiple_ordered_table.c,Int64(1)) ORDER BY [multiple_ordered_table.c ASC NULLS LAST]], ordering_mode=Sorted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -895,6 +891,31 @@ fn convert_to_sort_cols(
.collect::<Vec<_>>()
}

fn replace_order_by_clause(order_by: &mut String) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than doing string manipulation here, maybe we could call create_function_physical_name to just create the right name in the first place 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see -- create_function_physical_name doesn't have sufficient information (Exprs etc to do this)

I suppose the alternate is to remember the relevant parts of the expression, but that also seems brittle.

I can't think of anything better at the moment

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think erroring when there is a mismatch between physical schemas is a reasonable things to do (it is probably better to fail fast than some other harder to diagnose error later on)

@@ -895,6 +891,31 @@ fn convert_to_sort_cols(
.collect::<Vec<_>>()
}

fn replace_order_by_clause(order_by: &mut String) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see -- create_function_physical_name doesn't have sufficient information (Exprs etc to do this)

I suppose the alternate is to remember the relevant parts of the expression, but that also seems brittle.

I can't think of anything better at the moment

@alamb
Copy link
Contributor

alamb commented Apr 22, 2024

Thanks @berkaysynnada

@alamb alamb merged commit 0780438 into apache:main Apr 22, 2024
24 checks passed
ccciudatu pushed a commit to hstack/arrow-datafusion that referenced this pull request Apr 26, 2024
…apache#10088)

* agg fixes

* test updates

* fixing count mismatch

* Update aggregate_statistics.rs

* catch different names

* minor
ion-elgreco pushed a commit to delta-io/delta-rs that referenced this pull request Jun 11, 2024
# Description
Updates the arrow and datafusion dependencies to 52 and 39(-rc1)
respectively. This is necessary for updating pyo3.

While most changes with trivial, some required big rewrites. Namely, the
logic for the Updates operation had to be rewritten (and simplified) to
accommodate some new sanity checks inside datafusion:
(apache/datafusion#10088).

Depends on delta-kernel having its arrow and object-store version bumped
as well. This PR doesn't include any major changes for pyo3, I'll open a
separate PR depending on this PR.

# Related Issue(s)
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->

---------

Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants