Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27108][SQL] Add parsed SQL plans for create, CTAS. #24029

Closed

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Mar 8, 2019

What changes were proposed in this pull request?

This moves parsing CREATE TABLE ... USING statements into catalyst. Catalyst produces logical plans with the parsed information and those plans are converted to v1 DataSource plans in DataSourceAnalysis.

This prepares for adding v2 create plans that should receive the information parsed from SQL without being translated to v1 plans first.

This also makes it possible to parse in catalyst instead of breaking the parser across the abstract AstBuilder in catalyst and SparkSqlParser in core.

For more information, see the mailing list thread.

How was this patch tested?

This uses existing tests to catch regressions. This introduces no behavior changes.

@SparkQA

This comment has been minimized.

@rdblue
Copy link
Contributor Author

rdblue commented Mar 8, 2019

@cloud-fan, this is needed to add the v2 create and CTAS plans. We can get a start while waiting for the catalog identifiers to be committed.

@rdblue
Copy link
Contributor Author

rdblue commented Mar 8, 2019

cc @jzhuge, @mccheah, @gatorsmile

@rdblue rdblue force-pushed the SPARK-27108-add-parsed-create-logical-plans branch from 3a77141 to 9bb101f Compare March 8, 2019 20:51
@SparkQA

This comment has been minimized.

@rdblue rdblue force-pushed the SPARK-27108-add-parsed-create-logical-plans branch from 9bb101f to 9e0913f Compare March 9, 2019 00:00
@SparkQA

This comment has been minimized.

* Parsed logical plans are located in Catalyst so that as much SQL parsing logic as possible is be
* kept in a [[org.apache.spark.sql.catalyst.parser.AbstractSqlParser]].
*/
private[sql] abstract class ParsedLogicalPlan extends LogicalPlan {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if this is useful hierarchy, but if yes we should document more clearly this should not survive analysis.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be useful to add a special check for this, rather than relying on resolved only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 - @rdblue these should only be inputs to the analyzer, not outputs. Would be helpful to write specific JavaDoc on this.

Copy link
Contributor Author

@rdblue rdblue Mar 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is included above: "Parsed logical plans are not resolved because they must be converted to concrete logical plans."

Do you think that should be rephrased to be more clear?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defer to @rxin but I'm ok with merging with the current docs. We can rephrase in a follow-up if our contributors have trouble with this wording.

* }}}
* TODO: Remove this. It is used because CreateTempViewUsing is not a Catalyst plan.
* Either move CreateTempViewUsing into catalyst as a parsed logical plan, or remove it because
* it is deprecated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to double check, when did we deprecate CreateTempViewUsing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it happened in 5effc01, 3 years ago.

We can always create a parsed plan for CreateTempViewUsing so that it can move to catalyst as well, but I thought that we can do it later, and only need to if this isn't going to be removed in 3.0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, so we deprecated CREATE TABLE USING, but not CREATE TEMP VIEW USING

@@ -1860,4 +1861,193 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging
val structField = StructField(identifier.getText, typedVisit(dataType), nullable = true)
if (STRING == null) structField else structField.withComment(string(STRING))
}

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't review this file carefully, assuming it's just copy-paste code from SparkSqlAstBuilder

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is moving what is needed from SparkSqlAstBuilder. The only real changes are in the visitCreateTable method.

These rules should have already been in the abstract class. I'm not sure why they were in SparkSqlAstBuilder other than that was the easiest place to put them when they were added.

* converting that v1 metadata to the v2 equivalent, the sql [[CreateTable]] plan is produced by
* the parser and converted once into both implementations.
*
* Parsed logical plans are not resolved because they must be converted to concrete logical plans.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it really necessary to have this parent class just to set the resolved bit? I think we can just put override lazy val resolved = false in the new CreateTable and CreateTableAsSelect classes, with classdoc saying that these 2 classes will be replaced by what concrete plans during analysis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value of this class is that it identifies the set of logical plans that correspond directly to what was parsed from SQL. When someone working on a plan sees ParsedLogicalPlan as an ancestor in Scaladoc, it signals what is explained here: that parser produces ParsedLogicalPlan nodes without translating what was parsed, then those plans get translated into real plans in the analyzer.

With that information, it is easy to see what changes need to be made. If the parsed plan doesn't include an option, then the parser and parsed plan needs to be updated. If it does include an option, then the analyzer and downstream plans need to be updated.

Also keep in mind that this is the first two subclasses of ParsedLogicalPlan. To implement v2 along-side v1, we are going to be adding more of them. So it is valuable that we don't need to remember to set resolved to false in every plan.

@rdblue
Copy link
Contributor Author

rdblue commented Mar 13, 2019

@cloud-fan, I've responded to the review comments and implemented fixes.

Also, I moved the new resolution rules into DataSourceResolution to fix the test failures. The new rules needed to be added to extendedResolutionRules instead of postHocResolutionRules because the post-hoc rules are only run once. This resolution needs to be done before the post-hoc rules defined in DataSourceAnalysis run.

At a minimum, these needed to be separated into different classes, but I think it is also more correct for resolution rules to run in the resolution batch so that other resolution rules can run on the plans produced by these rules. This is the same reason why we added the ResolveOutputRelation rule to the resolution batch.

I think that the resolution rules in DataSourceAnalysis should also move into the resolution batch, but that should be done as a separate change.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@rdblue
Copy link
Contributor Author

rdblue commented Mar 16, 2019

Retest this please.

@SparkQA
Copy link

SparkQA commented Mar 17, 2019

Test build #103578 has finished for PR 24029 at commit 651e5b5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 17, 2019

Test build #103579 has finished for PR 24029 at commit 7beff42.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon HyukjinKwon changed the title SPARK-27108: Add parsed SQL plans for create, CTAS. [SPARK-27108][SQL] Add parsed SQL plans for create, CTAS. Mar 18, 2019
@cloud-fan
Copy link
Contributor

retest this please

@cloud-fan
Copy link
Contributor

LGTM if tests pass

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

import org.apache.spark.sql.sources.v2.TableProvider
import org.apache.spark.sql.types.StructType

case class DataSourceResolution(conf: SQLConf) extends Rule[LogicalPlan] with CastSupport {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we call it DDLResolution? It's not very related to data source.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This resolves parsed plans to execution.datasources plans. This isn't just for DDL. We are starting with CreateTable and CreateTableAsSelect, but there will be more parsed plans that get converted to datasource plans in this rule. That's why I think DataSourceResolution is an appropriate name.

@rdblue rdblue force-pushed the SPARK-27108-add-parsed-create-logical-plans branch from 65c5cd5 to 5054334 Compare March 22, 2019 16:13
@rdblue
Copy link
Contributor Author

rdblue commented Mar 22, 2019

@mccheah, @cloud-fan, I've rebased on master, fixed a minor conflict, and added the Statement suffix to the new plans. I think this should be good to go once tests pass.

@SparkQA

This comment has been minimized.

@rdblue rdblue force-pushed the SPARK-27108-add-parsed-create-logical-plans branch from 5054334 to 6c9b9dc Compare March 22, 2019 16:20
@SparkQA
Copy link

SparkQA commented Mar 22, 2019

Test build #103831 has finished for PR 24029 at commit 6c9b9dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 34e3cc7 Mar 22, 2019
@rdblue
Copy link
Contributor Author

rdblue commented Mar 22, 2019

Thanks @cloud-fan! And thanks to all the reviewers also!

mccheah pushed a commit to palantir/spark that referenced this pull request May 15, 2019
This moves parsing `CREATE TABLE ... USING` statements into catalyst. Catalyst produces logical plans with the parsed information and those plans are converted to v1 `DataSource` plans in `DataSourceAnalysis`.

This prepares for adding v2 create plans that should receive the information parsed from SQL without being translated to v1 plans first.

This also makes it possible to parse in catalyst instead of breaking the parser across the abstract `AstBuilder` in catalyst and `SparkSqlParser` in core.

For more information, see the [mailing list thread](https://lists.apache.org/thread.html/54f4e1929ceb9a2b0cac7cb058000feb8de5d6c667b2e0950804c613%3Cdev.spark.apache.org%3E).

This uses existing tests to catch regressions. This introduces no behavior changes.

Closes apache#24029 from rdblue/SPARK-27108-add-parsed-create-logical-plans.

Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
mccheah pushed a commit to palantir/spark that referenced this pull request Jun 6, 2019
This moves parsing logic for `ALTER TABLE` into Catalyst and adds parsed logical plans for alter table changes that use multi-part identifiers. This PR is similar to SPARK-27108, PR apache#24029, that created parsed logical plans for create and CTAS.

* Create parsed logical plans
* Move parsing logic into Catalyst's AstBuilder
* Convert to DataSource plans in DataSourceResolution
* Parse `ALTER TABLE ... SET LOCATION ...` separately from the partition variant
* Parse `ALTER TABLE ... ALTER COLUMN ... [TYPE dataType] [COMMENT comment]` [as discussed on the dev list](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Syntax-for-table-DDL-td25197.html#a25270)
* Parse `ALTER TABLE ... RENAME COLUMN ... TO ...`
* Parse `ALTER TABLE ... DROP COLUMNS ...`

* Added new tests in Catalyst's `DDLParserSuite`
* Moved converted plan tests from SQL `DDLParserSuite` to `PlanResolutionSuite`
* Existing tests for regressions

Closes apache#24723 from rdblue/SPARK-27857-add-alter-table-statements-in-catalyst.

Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
emanuelebardelli pushed a commit to emanuelebardelli/spark that referenced this pull request Jun 15, 2019
## What changes were proposed in this pull request?

This moves parsing logic for `ALTER TABLE` into Catalyst and adds parsed logical plans for alter table changes that use multi-part identifiers. This PR is similar to SPARK-27108, PR apache#24029, that created parsed logical plans for create and CTAS.

* Create parsed logical plans
* Move parsing logic into Catalyst's AstBuilder
* Convert to DataSource plans in DataSourceResolution
* Parse `ALTER TABLE ... SET LOCATION ...` separately from the partition variant
* Parse `ALTER TABLE ... ALTER COLUMN ... [TYPE dataType] [COMMENT comment]` [as discussed on the dev list](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Syntax-for-table-DDL-td25197.html#a25270)
* Parse `ALTER TABLE ... RENAME COLUMN ... TO ...`
* Parse `ALTER TABLE ... DROP COLUMNS ...`

## How was this patch tested?

* Added new tests in Catalyst's `DDLParserSuite`
* Moved converted plan tests from SQL `DDLParserSuite` to `PlanResolutionSuite`
* Existing tests for regressions

Closes apache#24723 from rdblue/SPARK-27857-add-alter-table-statements-in-catalyst.

Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants