Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26946][SQL] Identifiers for multi-catalog #23848

Closed
wants to merge 10 commits into from

Conversation

jzhuge
Copy link
Member

@jzhuge jzhuge commented Feb 21, 2019

What changes were proposed in this pull request?

  • Support N-part identifier in SQL
  • N-part identifier extractor in Analyzer

How was this patch tested?

  • A new unit test suite ResolveMultipartRelationSuite
  • CatalogLoadingSuite

@rBlue @cloud-fan @mccheah

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Feb 22, 2019

ok to test. Thank you, @jzhuge !

@SparkQA

This comment has been minimized.

@jzhuge jzhuge changed the title [SPARK-26946][SQL][WIP] Identifiers for multi-catalog [SPARK-26946][SQL] Identifiers for multi-catalog Mar 8, 2019
@SparkQA

This comment has been minimized.

@jzhuge
Copy link
Member Author

jzhuge commented Mar 8, 2019

Looking at the build failure

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@@ -63,6 +63,10 @@ singleTableIdentifier
: tableIdentifier EOF
;

singleMultiPartIdentifier
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a top-level parser entry used in ParserInterface. I don't think we need it now for catalog identifier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, only my test case uses it to parse a table name into a sequence. I will remove it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't we need this eventually for parsing names passed to saveAsTable? Why not add it now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I start to to convert SELECT, INSERT, and DROP code path to support multi-catalog, this parse function is needed, e.g,

  override def visitTable(ctx: TableContext): LogicalPlan = withOrigin(ctx) {
    UnresolvedIdentifier(visitMultiPartIdentifier(ctx.multiPartIdentifier))
  }

 override def visitTableName(ctx: TableNameContext): LogicalPlan = withOrigin(ctx) {
    val tableId = visitMultiPartIdentifier(ctx.multiPartIdentifier())
    val table = mayApplyAliasPlan(ctx.tableAlias, UnresolvedIdentifier(tableId))
    table.optionalMap(ctx.sample)(withSample)
  }

@mccheah
Copy link
Contributor

mccheah commented Mar 16, 2019

Sorry for breaking up my review into individual comments. I think this looks ok short of some style changes.

val conf = new SQLConf().copy(SQLConf.CASE_SENSITIVE -> caseSensitive)
new Analyzer(Some(lookupCatalog _), null, conf) {
override val extendedResolutionRules =
EliminateSubqueryAliases :: TestMultipartAnalysis(this) :: Nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses a lot of temporary classes to simulate future rules that match multi-part identifiers. I think I would rather include an update that adds new UnresolvedRelation nodes and uses them instead of test plan nodes, but I'd be interested to hear whether @cloud-fan agrees.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK either way. I have already converted SELECT/INSERT/DROP code paths to support multi-catalog in my private 2.3 branch. Pretty straightforward. Converting CREATE would be a lot easier with Ryan's PR 24029.

@SparkQA

This comment has been minimized.

@rdblue
Copy link
Contributor

rdblue commented Mar 18, 2019

@jzhuge, this looks really close to being ready to me!

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@dilipbiswal
Copy link
Contributor

retest this please

@SparkQA

This comment has been minimized.

import org.apache.spark.annotation.Experimental;

/**
* An [[Identifier]] implementation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Looks like Scaladoc conventions used in Javadoc. This should be {@link Identifier}.

@rdblue
Copy link
Contributor

rdblue commented Mar 19, 2019

+1

This looks good to me. @cloud-fan, do you have any more review comments?

@SparkQA

This comment has been minimized.

* Identifies an object in a catalog.
*/
@Experimental
public interface Identifier {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we use a class directly? I don't see much value of using an interface here, as it has only one implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows us more flexibility than a single concrete class. Changing a class to an interface is not a binary compatible change, so using an interface is the right thing to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I suggest we move the impl class to a private package like org.apache.spark.sql.catalyst. Also the static method should be moved to the impl class as well, as we only create it inside Spark.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation class is package-private. If we were to move it to a different package, we would need to make it public for the of factory method, which would increase its visibility, not decrease it.

import org.apache.spark.sql.catalyst.TableIdentifier

@Experimental
trait LookupCatalog {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why it's a trait?

My understanding is this PR adds the class of the catalog object identifier and the related parser support. I don't think we have a detailed design of how analyzer looks up catalog yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This trait provides extractors, similar to a trait like PredicateHelper. These implement the resolution rules from the SPIP using a generic catalog lookup provided by the implementation.

This decouples the resolution rules from how the analyzer looks up catalogs and provides convenient extractors that implement those rules.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then this should be an internal trait under a private package like org.apache.spark.sql.catalyst

Create org.apache.spark.sql.catalog.v2.Identifier and IdentifierImpl.
Inherit CatalogIdentifier from v2.Identifier.
Encapsulate lookupCatalog and extractor into trait LookupCatalog.
SqlBase.g4: Replace MultiPart with Multipart.
Rename and simplify the unit test ResolveMultipartIdentifierSuite.
Add extractor LookupCatalog.AsTableIdentifier and a unit test.
Remove CatalogIdentifier.
Add comment for AsTableIdentifier to emphasize legacy support only.
@SparkQA
Copy link

SparkQA commented Mar 21, 2019

Test build #103750 has finished for PR 23848 at commit 3bb4485.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@mccheah mccheah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan have we addressed all your comments, or did you have any other feedback you wanted to give? Would like to merge this soon to unblock other V2 work, particularly table catalogs.

Possibly try to merge before EOD Pacific time today, at the very latest before end of week?

For everyone else following, please feel free to leave any feedback we would like to address before this goes in.


def this(catalog: SessionCatalog, conf: SQLConf) = {
this(catalog, conf, conf.optimizerMaxIterations)
}

def this(lookupCatalog: Option[(String) => CatalogPlugin], catalog: SessionCatalog,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who will call this constructor? I feel we are adding too much code for future use only. Can we add them when they are needed? It will be good if this PR only add the identifier interface and impl class, and the related parser rules, which is pretty easy to justify.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan, I think this commit is reasonably self-contained. Nit-picking about whether a constructor is added in this commit or the next isn't adding much value.

Keep in mind that we make commits self-contained to decrease conflicts and increase the rate at which we can review and accept patches. Is putting this in the next commit really worth the time it takes to change and test that change, if it means that this work is delayed another day?

@cloud-fan
Copy link
Contributor

The parser part and identifier interface/impl class LGTM. The catalog lookup part looks reasonable but I'm not very confident without seeing the actual use case. To move things forward, I'm merging this. I may refactor this part after the table catalog gets it.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@jzhuge
Copy link
Member Author

jzhuge commented Mar 22, 2019

Thanks @cloud-fan !

@cloud-fan cloud-fan closed this in 80565ce Mar 22, 2019
@rdblue
Copy link
Contributor

rdblue commented Mar 22, 2019

Thanks for merging, @cloud-fan, and thanks for working on this, @jzhuge!

mccheah pushed a commit to palantir/spark that referenced this pull request May 15, 2019
## What changes were proposed in this pull request?

- Support N-part identifier in SQL
- N-part identifier extractor in Analyzer

## How was this patch tested?

- A new unit test suite ResolveMultipartRelationSuite
- CatalogLoadingSuite

rblue cloud-fan mccheah

Closes apache#23848 from jzhuge/SPARK-26946.

Authored-by: John Zhuge <jzhuge@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
rdblue pushed a commit to rdblue/spark that referenced this pull request May 19, 2019
- Support N-part identifier in SQL
- N-part identifier extractor in Analyzer

- A new unit test suite ResolveMultipartRelationSuite
- CatalogLoadingSuite

rblue cloud-fan mccheah

Closes apache#23848 from jzhuge/SPARK-26946.

Authored-by: John Zhuge <jzhuge@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
jzhuge added a commit to jzhuge/spark that referenced this pull request Oct 15, 2019
- Support N-part identifier in SQL
- N-part identifier extractor in Analyzer

- A new unit test suite ResolveMultipartRelationSuite
- CatalogLoadingSuite

rblue cloud-fan mccheah

Closes apache#23848 from jzhuge/SPARK-26946.

Authored-by: John Zhuge <jzhuge@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants