-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-26946][SQL] Identifiers for multi-catalog #23848
Conversation
ok to test. Thank you, @jzhuge ! |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Looking at the build failure |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
@@ -63,6 +63,10 @@ singleTableIdentifier | |||
: tableIdentifier EOF | |||
; | |||
|
|||
singleMultiPartIdentifier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a top-level parser entry used in ParserInterface
. I don't think we need it now for catalog identifier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, only my test case uses it to parse a table name into a sequence. I will remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't we need this eventually for parsing names passed to saveAsTable
? Why not add it now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I start to to convert SELECT, INSERT, and DROP code path to support multi-catalog, this parse function is needed, e.g,
override def visitTable(ctx: TableContext): LogicalPlan = withOrigin(ctx) {
UnresolvedIdentifier(visitMultiPartIdentifier(ctx.multiPartIdentifier))
}
override def visitTableName(ctx: TableNameContext): LogicalPlan = withOrigin(ctx) {
val tableId = visitMultiPartIdentifier(ctx.multiPartIdentifier())
val table = mayApplyAliasPlan(ctx.tableAlias, UnresolvedIdentifier(tableId))
table.optionalMap(ctx.sample)(withSample)
}
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/identifiers.scala
Outdated
Show resolved
Hide resolved
.../src/test/scala/org/apache/spark/sql/catalyst/catalog/v2/ResolveMultipartRelationSuite.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Outdated
Show resolved
Hide resolved
.../src/test/scala/org/apache/spark/sql/catalyst/catalog/v2/ResolveMultipartRelationSuite.scala
Outdated
Show resolved
Hide resolved
Sorry for breaking up my review into individual comments. I think this looks ok short of some style changes. |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/identifiers.scala
Outdated
Show resolved
Hide resolved
val conf = new SQLConf().copy(SQLConf.CASE_SENSITIVE -> caseSensitive) | ||
new Analyzer(Some(lookupCatalog _), null, conf) { | ||
override val extendedResolutionRules = | ||
EliminateSubqueryAliases :: TestMultipartAnalysis(this) :: Nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This uses a lot of temporary classes to simulate future rules that match multi-part identifiers. I think I would rather include an update that adds new UnresolvedRelation
nodes and uses them instead of test plan nodes, but I'd be interested to hear whether @cloud-fan agrees.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK either way. I have already converted SELECT/INSERT/DROP code paths to support multi-catalog in my private 2.3 branch. Pretty straightforward. Converting CREATE would be a lot easier with Ryan's PR 24029.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
Outdated
Show resolved
Hide resolved
This comment has been minimized.
This comment has been minimized.
sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/IdentifierImpl.java
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/Identifier.java
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/identifiers.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/IdentifierImpl.java
Outdated
Show resolved
Hide resolved
@jzhuge, this looks really close to being ready to me! |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
retest this please |
This comment has been minimized.
This comment has been minimized.
import org.apache.spark.annotation.Experimental; | ||
|
||
/** | ||
* An [[Identifier]] implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Looks like Scaladoc conventions used in Javadoc. This should be {@link Identifier}
.
+1 This looks good to me. @cloud-fan, do you have any more review comments? |
This comment has been minimized.
This comment has been minimized.
* Identifies an object in a catalog. | ||
*/ | ||
@Experimental | ||
public interface Identifier { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we use a class directly? I don't see much value of using an interface here, as it has only one implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows us more flexibility than a single concrete class. Changing a class to an interface is not a binary compatible change, so using an interface is the right thing to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I suggest we move the impl class to a private package like org.apache.spark.sql.catalyst
. Also the static method should be moved to the impl class as well, as we only create it inside Spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation class is package-private. If we were to move it to a different package, we would need to make it public for the of
factory method, which would increase its visibility, not decrease it.
import org.apache.spark.sql.catalyst.TableIdentifier | ||
|
||
@Experimental | ||
trait LookupCatalog { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why it's a trait?
My understanding is this PR adds the class of the catalog object identifier and the related parser support. I don't think we have a detailed design of how analyzer looks up catalog yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This trait provides extractors, similar to a trait like PredicateHelper
. These implement the resolution rules from the SPIP using a generic catalog lookup provided by the implementation.
This decouples the resolution rules from how the analyzer looks up catalogs and provides convenient extractors that implement those rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then this should be an internal trait under a private package like org.apache.spark.sql.catalyst
sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala
Show resolved
Hide resolved
Create org.apache.spark.sql.catalog.v2.Identifier and IdentifierImpl. Inherit CatalogIdentifier from v2.Identifier. Encapsulate lookupCatalog and extractor into trait LookupCatalog. SqlBase.g4: Replace MultiPart with Multipart. Rename and simplify the unit test ResolveMultipartIdentifierSuite.
Add extractor LookupCatalog.AsTableIdentifier and a unit test. Remove CatalogIdentifier.
Add comment for AsTableIdentifier to emphasize legacy support only.
Test build #103750 has finished for PR 23848 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan have we addressed all your comments, or did you have any other feedback you wanted to give? Would like to merge this soon to unblock other V2 work, particularly table catalogs.
Possibly try to merge before EOD Pacific time today, at the very latest before end of week?
For everyone else following, please feel free to leave any feedback we would like to address before this goes in.
|
||
def this(catalog: SessionCatalog, conf: SQLConf) = { | ||
this(catalog, conf, conf.optimizerMaxIterations) | ||
} | ||
|
||
def this(lookupCatalog: Option[(String) => CatalogPlugin], catalog: SessionCatalog, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
who will call this constructor? I feel we are adding too much code for future use only. Can we add them when they are needed? It will be good if this PR only add the identifier interface and impl class, and the related parser rules, which is pretty easy to justify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan, I think this commit is reasonably self-contained. Nit-picking about whether a constructor is added in this commit or the next isn't adding much value.
Keep in mind that we make commits self-contained to decrease conflicts and increase the rate at which we can review and accept patches. Is putting this in the next commit really worth the time it takes to change and test that change, if it means that this work is delayed another day?
The parser part and identifier interface/impl class LGTM. The catalog lookup part looks reasonable but I'm not very confident without seeing the actual use case. To move things forward, I'm merging this. I may refactor this part after the table catalog gets it. |
thanks, merging to master! |
Thanks @cloud-fan ! |
Thanks for merging, @cloud-fan, and thanks for working on this, @jzhuge! |
## What changes were proposed in this pull request? - Support N-part identifier in SQL - N-part identifier extractor in Analyzer ## How was this patch tested? - A new unit test suite ResolveMultipartRelationSuite - CatalogLoadingSuite rblue cloud-fan mccheah Closes apache#23848 from jzhuge/SPARK-26946. Authored-by: John Zhuge <jzhuge@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
- Support N-part identifier in SQL - N-part identifier extractor in Analyzer - A new unit test suite ResolveMultipartRelationSuite - CatalogLoadingSuite rblue cloud-fan mccheah Closes apache#23848 from jzhuge/SPARK-26946. Authored-by: John Zhuge <jzhuge@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
- Support N-part identifier in SQL - N-part identifier extractor in Analyzer - A new unit test suite ResolveMultipartRelationSuite - CatalogLoadingSuite rblue cloud-fan mccheah Closes apache#23848 from jzhuge/SPARK-26946. Authored-by: John Zhuge <jzhuge@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
How was this patch tested?
@rBlue @cloud-fan @mccheah