Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-44341][SQL][PYTHON] Define the computing logic through PartitionEvaluator API and use it in WindowExec and WindowInPandasExec #41939

Closed
wants to merge 5 commits into from

Conversation

beliefer
Copy link
Contributor

@beliefer beliefer commented Jul 11, 2023

What changes were proposed in this pull request?

WindowExec and WindowInPandasExec are updated to use the PartitionEvaluator API to do execution.

Why are the changes needed?

To define the computing logic and requires the caller side to explicitly list what needs to be serialized and sent to executors

Does this PR introduce any user-facing change?

'No'.
Just update the inner implementation.

How was this patch tested?

Add new test cases.

@beliefer
Copy link
Contributor Author

beliefer commented Jul 12, 2023

ping @cloud-fan @HyukjinKwon @zhengruifeng @viirya cc @vinodkc
The CI failure looks unrelated.

val windowExpression: Seq[NamedExpression],
val partitionSpec: Seq[Expression],
val orderSpec: Seq[SortOrder],
val conf: SQLConf,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we usually don't need to pass the SQLConf, as worker side can get it via SQLConf.get

@cloud-fan
Copy link
Contributor

We can probably skip testing it. Overall it's just a refactor and it's probably too much to run many tests twice. We can enable it by default later so that all tests cover this code path.

def windowExpression: Seq[NamedExpression]
def partitionSpec: Seq[Expression]
def orderSpec: Seq[SortOrder]
def conf: SQLConf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems conf is always SQLConf.get in both implementations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I have used SQLConf.get in both implementations.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored code looks okay to me.

@viirya
Copy link
Member

viirya commented Jul 14, 2023

As tests are skipped now, maybe you can enable it to test change in CI, and revert back current config before merging.

@beliefer
Copy link
Contributor Author

As tests are skipped now, maybe you can enable it to test change in CI, and revert back current config before merging.

In fact, CI passed in previous commit.

@viirya
Copy link
Member

viirya commented Jul 14, 2023

As tests are skipped now, maybe you can enable it to test change in CI, and revert back current config before merging.

In fact, CI passed in previous commit.

Oh okay. 👍

@beliefer
Copy link
Contributor Author

The CI failure is unrelated.

}

// Unwrap the expressions and factories from the map.
private val expressionsWithFrameIndex =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so all these private vals will be serialized to the executor side, right? I think it's better to define them in the evaluator class, not the factor, to reduce the data to be serialized.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @Hisoka-X can you also check your PRs (either merged or not)? We should avoid putting variables in the factor class, but define them in the evaluator class instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't come across a factory that needs to define member variables (except for parameters)😂

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 95dc829 Jul 18, 2023
@beliefer
Copy link
Contributor Author

@cloud-fan @viirya @Hisoka-X Thank you.

import org.apache.spark.sql.types.{CalendarIntervalType, DateType, DayTimeIntervalType, DecimalType, IntegerType, TimestampNTZType, TimestampType, YearMonthIntervalType}
import org.apache.spark.util.collection.Utils

trait WindowEvaluatorFactoryBase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with hindsight, shall we move it to a new file? Its two sub-classes are not both in this file, so it's a bit weird to put it with one of the sub-class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Let's move it.

cloud-fan pushed a commit that referenced this pull request Jul 21, 2023
…torFactoryBase to a single file

### What changes were proposed in this pull request?
#41939 defined the computing logic through PartitionEvaluator API and use it in `WindowExec` and `WindowInPandasExec`.

According to the comment #41939 (comment), this PR want move the base trait `WindowEvaluatorFactoryBase` to a single file.

### Why are the changes needed?
Improve the code.

### Does this PR introduce _any_ user-facing change?
'No'.
Just update inner implementation.

### How was this patch tested?
N/A

Closes #42106 from beliefer/SPARK-44341_followup.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
} else {
child.execute().mapPartitions { iter =>
val evaluator = evaluatorFactory.createEvaluator()
evaluator.eval(0, iter)
Copy link
Contributor

@vinodkc vinodkc Jul 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@beliefer, Can you please raise a follow-up PR to handle the partition index as this #42185

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the reminder. I will take a look!

cloud-fan pushed a commit that referenced this pull request Jul 31, 2023
… correctly for WindowGroupLimitExec,WindowExec and WindowInPandasExec

### What changes were proposed in this pull request?
This is a followup of #41899 and #41939, to set the partition index correctly even if it's not used for now. It also contains a few code cleanup.

### Why are the changes needed?
future-proof

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
existing tests

Closes #42208 from beliefer/SPARK-44340_followup.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
ragnarok56 pushed a commit to ragnarok56/spark that referenced this pull request Mar 2, 2024
…onEvaluator API and use it in WindowExec and WindowInPandasExec

### What changes were proposed in this pull request?
`WindowExec` and `WindowInPandasExec` are updated to use the `PartitionEvaluator` API to do execution.

### Why are the changes needed?
To define the computing logic and requires the caller side to explicitly list what needs to be serialized and sent to executors

### Does this PR introduce _any_ user-facing change?
'No'.
Just update the inner implementation.

### How was this patch tested?
Add new test cases.

Closes apache#41939 from beliefer/SPARK-44341.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
ragnarok56 pushed a commit to ragnarok56/spark that referenced this pull request Mar 2, 2024
…torFactoryBase to a single file

### What changes were proposed in this pull request?
apache#41939 defined the computing logic through PartitionEvaluator API and use it in `WindowExec` and `WindowInPandasExec`.

According to the comment apache#41939 (comment), this PR want move the base trait `WindowEvaluatorFactoryBase` to a single file.

### Why are the changes needed?
Improve the code.

### Does this PR introduce _any_ user-facing change?
'No'.
Just update inner implementation.

### How was this patch tested?
N/A

Closes apache#42106 from beliefer/SPARK-44341_followup.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
ragnarok56 pushed a commit to ragnarok56/spark that referenced this pull request Mar 2, 2024
… correctly for WindowGroupLimitExec,WindowExec and WindowInPandasExec

### What changes were proposed in this pull request?
This is a followup of apache#41899 and apache#41939, to set the partition index correctly even if it's not used for now. It also contains a few code cleanup.

### Why are the changes needed?
future-proof

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
existing tests

Closes apache#42208 from beliefer/SPARK-44340_followup.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants