Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Coral-Incremental] Cost calculation for RelNode #516

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

yyy1000
Copy link

@yyy1000 yyy1000 commented Jul 9, 2024

What changes are proposed in this pull request, and why are they necessary?

The goal of this PR is to introduce cost calculation method for Coral Incremental module, that's given an Input RelNode, return its cost for {executing it and writing the result into output}. So, for two RelNodes, one is incremental format, and the other is batch format, we could choose one with less cost, which could have better performance when executing.

The main changes are:

  1. Rewrite the original Sql into all possible incremental format plans. (This PR only supports two tables join)
  2. Walk through the RelNode tree and get its cost, then we choose one with least cost, and convert it back to Sql format.

How was this patch tested?

Unit test

@yyy1000 yyy1000 marked this pull request as draft July 9, 2024 23:04
@yyy1000 yyy1000 changed the title Add prev format in table name for Join [WIP] [Coral-Incremental] Add prev format in table name for Join Jul 10, 2024
Comment on lines 61 to 72
public CostInfo getExecutionCost(RelNode rel) {
if (rel instanceof TableScan) {
return getExecutionCostTableScan((TableScan) rel);
} else if (rel instanceof LogicalJoin) {
return getExecutionCostJoin((LogicalJoin) rel);
} else if (rel instanceof LogicalUnion) {
return getExecutionCostUnion((LogicalUnion) rel);
} else if (rel instanceof LogicalProject) {
return getExecutionCostProject((LogicalProject) rel);
}
return new CostInfo(0.0, 0.0);
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the core cost calculation framework.
It

  1. Walk through the whole RelNode tree and get the cost
  2. Keep the Row Count Information which will be used for IO cost

}
}

private Double estimateJoinSelectivity(List<JoinKey> joinKeys) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't write the logic now,
it will be from the join key, and from the statistic map info like distinct/row_count to get Selectivity

@yyy1000 yyy1000 changed the title [WIP] [Coral-Incremental] Add prev format in table name for Join [WIP] [Coral-Incremental] Cost calculation for RelNode Jul 17, 2024
@yyy1000 yyy1000 changed the title [WIP] [Coral-Incremental] Cost calculation for RelNode [WIP] [Coral-Incremental] Cost calculation for RelNode & Plan generation for two tables Jul 17, 2024
@yyy1000 yyy1000 marked this pull request as ready for review July 25, 2024 16:47
@yyy1000 yyy1000 changed the title [WIP] [Coral-Incremental] Cost calculation for RelNode & Plan generation for two tables [Coral-Incremental] Cost calculation for RelNode & Plan generation for two tables Jul 25, 2024
Copy link
Contributor

@KevinGe00 KevinGe00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

coral-incremental/src/test/resources/statistic.json Outdated Show resolved Hide resolved
private CostInfo getExecutionCostJoin(LogicalJoin join) {
RelNode left = join.getLeft();
RelNode right = join.getRight();
if (!(left instanceof TableScan) || !(right instanceof TableScan)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are expanding the cost estimator to handle joins for more than 2 tables?

@yyy1000 yyy1000 changed the title [Coral-Incremental] Cost calculation for RelNode & Plan generation for two tables [Coral-Incremental] Cost calculation for RelNode Jul 26, 2024
Copy link
Contributor

@KevinGe00 KevinGe00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the previous feedback :)

@wmoustafa
Copy link
Contributor

wmoustafa commented Aug 2, 2024

Overall feedback is that I believe the PR sounds too specific and there is no general framework for cost estimation that is applicable to various operators by specializing some of its aspects. I think we need to iterate over this PR to make it more abstract with various implementations. This may be better discussed over a doc.

// The number of rows in the table
Double rowCount;
// The number of distinct values in each column
Map<String, Double> distinctCountByRow;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work for nested columns? If yes, what is the spec of the key?

Does it work for complex types?

If the answer is no, we should be explicit about this at least in the Java doc.

Comment on lines 194 to 195
// The shuffle cost of a join is the maximum shuffle cost of its children because
// in modern distributed systems, the shuffle cost is dominated by the largest shuffle.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Terminology is not clear here. What is the shuffle cost of children? We might need to discuss this over a doc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants