[SPARK-26946][SQL] Identifiers for multi-catalog #23848

jzhuge · 2019-02-21T00:56:05Z

What changes were proposed in this pull request?

Support N-part identifier in SQL
N-part identifier extractor in Analyzer

How was this patch tested?

A new unit test suite ResolveMultipartRelationSuite
CatalogLoadingSuite

@rBlue @cloud-fan @mccheah

dongjoon-hyun · 2019-02-22T20:47:51Z

ok to test. Thank you, @jzhuge !

jzhuge · 2019-03-08T22:39:09Z

Looking at the build failure

cloud-fan · 2019-03-12T04:03:12Z

sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4

@@ -63,6 +63,10 @@ singleTableIdentifier
    : tableIdentifier EOF
    ;

+singleMultiPartIdentifier


This is a top-level parser entry used in ParserInterface. I don't think we need it now for catalog identifier.

True, only my test case uses it to parse a table name into a sequence. I will remove it.

Won't we need this eventually for parsing names passed to saveAsTable? Why not add it now?

When I start to to convert SELECT, INSERT, and DROP code path to support multi-catalog, this parse function is needed, e.g,

override def visitTable(ctx: TableContext): LogicalPlan = withOrigin(ctx) { UnresolvedIdentifier(visitMultiPartIdentifier(ctx.multiPartIdentifier)) } override def visitTableName(ctx: TableNameContext): LogicalPlan = withOrigin(ctx) { val tableId = visitMultiPartIdentifier(ctx.multiPartIdentifier()) val table = mayApplyAliasPlan(ctx.tableAlias, UnresolvedIdentifier(tableId)) table.optionalMap(ctx.sample)(withSample) }

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/identifiers.scala

.../src/test/scala/org/apache/spark/sql/catalyst/catalog/v2/ResolveMultipartRelationSuite.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

.../src/test/scala/org/apache/spark/sql/catalyst/catalog/v2/ResolveMultipartRelationSuite.scala

mccheah · 2019-03-16T00:25:45Z

Sorry for breaking up my review into individual comments. I think this looks ok short of some style changes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/identifiers.scala

rdblue · 2019-03-16T00:40:56Z

.../src/test/scala/org/apache/spark/sql/catalyst/catalog/v2/ResolveMultipartRelationSuite.scala

+    val conf = new SQLConf().copy(SQLConf.CASE_SENSITIVE -> caseSensitive)
+    new Analyzer(Some(lookupCatalog _), null, conf) {
+      override val extendedResolutionRules =
+        EliminateSubqueryAliases :: TestMultipartAnalysis(this) :: Nil


This uses a lot of temporary classes to simulate future rules that match multi-part identifiers. I think I would rather include an update that adds new UnresolvedRelation nodes and uses them instead of test plan nodes, but I'd be interested to hear whether @cloud-fan agrees.

OK either way. I have already converted SELECT/INSERT/DROP code paths to support multi-catalog in my private 2.3 branch. Pretty straightforward. Converting CREATE would be a lot easier with Ryan's PR 24029.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4

sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/IdentifierImpl.java

sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/Identifier.java

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/identifiers.scala

sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/IdentifierImpl.java

rdblue · 2019-03-18T17:40:01Z

@jzhuge, this looks really close to being ready to me!

dilipbiswal · 2019-03-19T07:16:50Z

retest this please

rdblue · 2019-03-19T15:32:38Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/IdentifierImpl.java

+import org.apache.spark.annotation.Experimental;
+
+/**
+ *  An [[Identifier]] implementation.


Minor: Looks like Scaladoc conventions used in Javadoc. This should be {@link Identifier}.

rdblue · 2019-03-19T15:36:18Z

+1

This looks good to me. @cloud-fan, do you have any more review comments?

cloud-fan · 2019-03-20T12:58:22Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/Identifier.java

+ * Identifies an object in a catalog.
+ */
+@Experimental
+public interface Identifier {


Shall we use a class directly? I don't see much value of using an interface here, as it has only one implementation.

This allows us more flexibility than a single concrete class. Changing a class to an interface is not a binary compatible change, so using an interface is the right thing to do.

Then I suggest we move the impl class to a private package like org.apache.spark.sql.catalyst. Also the static method should be moved to the impl class as well, as we only create it inside Spark.

The implementation class is package-private. If we were to move it to a different package, we would need to make it public for the of factory method, which would increase its visibility, not decrease it.

cloud-fan · 2019-03-20T13:03:00Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala

+import org.apache.spark.sql.catalyst.TableIdentifier
+
+@Experimental
+trait LookupCatalog {


Why it's a trait?

My understanding is this PR adds the class of the catalog object identifier and the related parser support. I don't think we have a detailed design of how analyzer looks up catalog yet.

This trait provides extractors, similar to a trait like PredicateHelper. These implement the resolution rules from the SPIP using a generic catalog lookup provided by the implementation.

This decouples the resolution rules from how the analyzer looks up catalogs and provides convenient extractors that implement those rules.

then this should be an internal trait under a private package like org.apache.spark.sql.catalyst

sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala

Create org.apache.spark.sql.catalog.v2.Identifier and IdentifierImpl. Inherit CatalogIdentifier from v2.Identifier. Encapsulate lookupCatalog and extractor into trait LookupCatalog. SqlBase.g4: Replace MultiPart with Multipart. Rename and simplify the unit test ResolveMultipartIdentifierSuite.

Add extractor LookupCatalog.AsTableIdentifier and a unit test. Remove CatalogIdentifier.

Add comment for AsTableIdentifier to emphasize legacy support only.

SparkQA · 2019-03-21T05:29:53Z

Test build #103750 has finished for PR 23848 at commit 3bb4485.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mccheah

@cloud-fan have we addressed all your comments, or did you have any other feedback you wanted to give? Would like to merge this soon to unblock other V2 work, particularly table catalogs.

Possibly try to merge before EOD Pacific time today, at the very latest before end of week?

For everyone else following, please feel free to leave any feedback we would like to address before this goes in.

cloud-fan · 2019-03-21T20:26:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala


  def this(catalog: SessionCatalog, conf: SQLConf) = {
    this(catalog, conf, conf.optimizerMaxIterations)
  }

+  def this(lookupCatalog: Option[(String) => CatalogPlugin], catalog: SessionCatalog,


who will call this constructor? I feel we are adding too much code for future use only. Can we add them when they are needed? It will be good if this PR only add the identifier interface and impl class, and the related parser rules, which is pretty easy to justify.

@cloud-fan, I think this commit is reasonably self-contained. Nit-picking about whether a constructor is added in this commit or the next isn't adding much value.

Keep in mind that we make commits self-contained to decrease conflicts and increase the rate at which we can review and accept patches. Is putting this in the next commit really worth the time it takes to change and test that change, if it means that this work is delayed another day?

cloud-fan · 2019-03-22T01:04:22Z

The parser part and identifier interface/impl class LGTM. The catalog lookup part looks reasonable but I'm not very confident without seeing the actual use case. To move things forward, I'm merging this. I may refactor this part after the table catalog gets it.

cloud-fan · 2019-03-22T01:05:02Z

thanks, merging to master!

jzhuge · 2019-03-22T01:07:02Z

Thanks @cloud-fan !

rdblue · 2019-03-22T15:50:17Z

Thanks for merging, @cloud-fan, and thanks for working on this, @jzhuge!

## What changes were proposed in this pull request? - Support N-part identifier in SQL - N-part identifier extractor in Analyzer ## How was this patch tested? - A new unit test suite ResolveMultipartRelationSuite - CatalogLoadingSuite rblue cloud-fan mccheah Closes apache#23848 from jzhuge/SPARK-26946. Authored-by: John Zhuge <jzhuge@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

- Support N-part identifier in SQL - N-part identifier extractor in Analyzer - A new unit test suite ResolveMultipartRelationSuite - CatalogLoadingSuite rblue cloud-fan mccheah Closes apache#23848 from jzhuge/SPARK-26946. Authored-by: John Zhuge <jzhuge@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

jzhuge force-pushed the SPARK-26946 branch from 6c96ef0 to 86b9a4f Compare February 21, 2019 01:16

This comment has been minimized.

Sign in to view

jzhuge force-pushed the SPARK-26946 branch from 86b9a4f to 372e38f Compare March 8, 2019 22:21

jzhuge changed the title ~~[SPARK-26946][SQL][WIP] Identifiers for multi-catalog~~ [SPARK-26946][SQL] Identifiers for multi-catalog Mar 8, 2019

This comment has been minimized.

Sign in to view

cloud-fan reviewed Mar 12, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/identifiers.scala Outdated Show resolved Hide resolved

rdblue reviewed Mar 16, 2019

View reviewed changes

.../src/test/scala/org/apache/spark/sql/catalyst/catalog/v2/ResolveMultipartRelationSuite.scala Outdated Show resolved Hide resolved

mccheah reviewed Mar 16, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated Show resolved Hide resolved

mccheah reviewed Mar 16, 2019

View reviewed changes

.../src/test/scala/org/apache/spark/sql/catalyst/catalog/v2/ResolveMultipartRelationSuite.scala Outdated Show resolved Hide resolved

rdblue reviewed Mar 16, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/identifiers.scala Outdated Show resolved Hide resolved

rdblue reviewed Mar 16, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated Show resolved Hide resolved

jzhuge commented Mar 16, 2019

View reviewed changes

sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 Outdated Show resolved Hide resolved

jzhuge force-pushed the SPARK-26946 branch from feed4a3 to 54280ed Compare March 18, 2019 07:38

This comment has been minimized.

Sign in to view

rdblue reviewed Mar 18, 2019

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/IdentifierImpl.java Outdated Show resolved Hide resolved

rdblue reviewed Mar 18, 2019

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/Identifier.java Outdated Show resolved Hide resolved

rdblue reviewed Mar 18, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/identifiers.scala Outdated Show resolved Hide resolved

rdblue reviewed Mar 18, 2019

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/IdentifierImpl.java Outdated Show resolved Hide resolved

jzhuge force-pushed the SPARK-26946 branch from 54280ed to 5e18fcf Compare March 19, 2019 04:17

This comment has been minimized.

Sign in to view

rdblue reviewed Mar 19, 2019

View reviewed changes

This comment has been minimized.

Sign in to view

cloud-fan reviewed Mar 20, 2019

View reviewed changes

RussellSpitzer reviewed Mar 21, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala Show resolved Hide resolved

jzhuge added 10 commits March 20, 2019 18:29

[SPARK-26946][SQL] Identifiers for multi-catalog

04e313c

Fix build failure

13be86d

Change CatalogRef to return CatalogPlugin and CatalogIdentifier

1566ead

Fix test failure in CatalogLoadingSuite

3f65394

Add constructor CatalogIdentifier(name)

cc23994

More review feedbacks

881d95a

Add extractor LookupCatalog.AsTableIdentifier and a unit test. Remove CatalogIdentifier.

Fix checkstyle error

b32bb1d

Fix javadoc for IdentifierImpl

4993bcc

Add test cases for back-quoted identifiers

3bb4485

Add comment for AsTableIdentifier to emphasize legacy support only.

jzhuge force-pushed the SPARK-26946 branch from 11c7871 to 3bb4485 Compare March 21, 2019 01:29

mccheah approved these changes Mar 21, 2019

View reviewed changes

cloud-fan reviewed Mar 21, 2019

View reviewed changes

cloud-fan closed this in 80565ce Mar 22, 2019

rdblue mentioned this pull request Mar 29, 2019

[SPARK-25006][SQL] Add CatalogTableIdentifier. #21978

Closed

[SPARK-26946][SQL] Identifiers for multi-catalog #23848

[SPARK-26946][SQL] Identifiers for multi-catalog #23848

Conversation

jzhuge commented Feb 21, 2019 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

dongjoon-hyun commented Feb 22, 2019 • edited Loading

This comment has been minimized.

This comment has been minimized.

jzhuge commented Mar 8, 2019

This comment has been minimized.

This comment has been minimized.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mccheah commented Mar 16, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment has been minimized.

rdblue commented Mar 18, 2019

This comment has been minimized.

This comment has been minimized.

dilipbiswal commented Mar 19, 2019

This comment has been minimized.

Choose a reason for hiding this comment

rdblue commented Mar 19, 2019

This comment has been minimized.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 21, 2019

mccheah left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Mar 22, 2019

cloud-fan commented Mar 22, 2019

jzhuge commented Mar 22, 2019

rdblue commented Mar 22, 2019

jzhuge commented Feb 21, 2019 •

edited

Loading

dongjoon-hyun commented Feb 22, 2019 •

edited

Loading