-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-26982][SQL] Enhance describe framework to describe the output of a query. #23883
Conversation
Test build #102728 has finished for PR 23883 at commit
|
retest this please |
Test build #102731 has finished for PR 23883 at commit
|
* {{{ | ||
* DESCRIBE [EXTENDED|FORMATTED] table_name partitionSpec?; | ||
* DESCRIBE QUERY <query> | ||
* DESCRIBE <query> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DESCRIBE QUERY <query>
and DESCRIBE <query>
equal to DESCRIBE [QUERY] <query>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah... would change.
* {{{ | ||
* DESCRIBE [EXTENDED|FORMATTED] table_name partitionSpec?; | ||
* DESCRIBE QUERY <query> | ||
* DESCRIBE <query> | ||
* }}} | ||
*/ | ||
case class DescribeTableCommand( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is named as DescribeTableCommand
, shall we have individual command for describing query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, shall we rename it to DescribeTableOrQueryCommand
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya We could. Just that there are not that much changes. Lets see if others feel the same way. If so, i would change it. Renaming the command sounds good to me :-)
Thanks for the work! I have one question in adivance; any other database-like system supprting this syntax? Also, can you add a entry |
@maropu Thanks. I will update the entries. I got this idea from DB2 which supports this syntax. Here is the sample output for your perusal.
|
Thanks! This Describe command is pretty useful to SQL users! Without this command, our Spark users need to create a temp/persistent view first and then describe it. |
DescribeTableCommand( | ||
visitTableIdentifier(ctx.tableIdentifier), | ||
DescribeTableOrQueryCommand( | ||
Option(visitTableIdentifier(ctx.tableIdentifier)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
visitTableIdentifier
can return null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu with the change, DescribeTableOrQueryCommand takes a optional table identifier. Thats why i wrap it under an Option
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I see.
DescribeTableOrQueryCommand(None, | ||
Option(visitQuery(ctx.query())), | ||
Map.empty[String, String], false) | ||
} |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
* {{{ | ||
* DESCRIBE [EXTENDED|FORMATTED] table_name partitionSpec?; | ||
* DESCRIBE [QUERY] <query> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: DESCRIBE [QUERY] query
? We need <>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If its under the bracket , it means optional isn't it ? Like in the previous example extended
and formatted
are optional ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, [QUERY]
means optional. You mean <query>
means optional, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu Sorry .. i misinterpreted. Here , i am just trying to say that they have to specify a query. Its not optional. How should i say it ? Does the following look better ?
DESCRIBE [QUERY] query_statement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, yes.
throw new AnalysisException("Describe command is not supported for insert statements.") | ||
case u @ Union(children) if children.forall(_.isInstanceOf[InsertIntoTable]) => | ||
throw new AnalysisException("Describe command is not supported for insert statements.") | ||
case u: UnresolvedRelation => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put the two matches above together?
case p if p.collectFirst { case _: InsertIntoTable | _: InsertIntoDir => true }.isDefined =>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu This is fantastic :-). Thanks.. Let me try it.
case u @ Union(children) if children.forall(_.isInstanceOf[InsertIntoTable]) => | ||
throw new AnalysisException("Describe command is not supported for insert statements.") | ||
case u: UnresolvedRelation => | ||
describeTable(sparkSession, u.tableIdentifier, partitionSpec, isExtended, result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test for this match in describe-query.sql
? Also, it might be better to add comments here about which query does this match cover.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu this should have been already covered in describe-table tests. Basically,
a simpledescribe table mytable
will go through it. Thats because its matched by queryPrimary
in sqlBase.g4. I will add comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some minor comments. Looks good except the comments.
@@ -784,7 +785,7 @@ ansiNonReserved | |||
| RESET | RESTRICT | REVOKE | RLIKE | ROLE | ROLES | ROLLBACK | ROLLUP | ROW | ROWS | SCHEMA | SEPARATED | SERDE | |||
| SERDEPROPERTIES | SET | SETS | SHOW | SKEWED | SORT | SORTED | START | STATISTICS | STORED | STRATIFY | STRUCT | |||
| TABLES | TABLESAMPLE | TBLPROPERTIES | TEMPORARY | TERMINATED | TOUCH | TRANSACTION | TRANSACTIONS | TRANSFORM | |||
| TRUE | TRUNCATE | UNARCHIVE | UNBOUNDED | UNCACHE | UNLOCK | UNSET | USE | VALUES | VIEW | WINDOW | |||
| TRUE | TRUNCATE | UNARCHIVE | UNBOUNDED | UNCACHE | UNLOCK | UNSET | USE | VALUES | VIEW | WINDOW | QUERY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's order it with alphabet
* }}} | ||
*/ | ||
case class DescribeTableCommand( | ||
table: TableIdentifier, | ||
case class DescribeTableOrQueryCommand( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we create 2 commands?
Test build #102771 has finished for PR 23883 at commit
|
@@ -48,7 +48,7 @@ class TableIdentifierParserSuite extends SparkFunSuite { | |||
"false", "fetch", "float", "for", "grant", "group", "grouping", "import", "in", | |||
"insert", "int", "into", "is", "pivot", "lateral", "like", "local", "none", "null", | |||
"of", "order", "out", "outer", "partition", "percent", "procedure", "range", "reads", "revoke", | |||
"rollup", "row", "rows", "set", "smallint", "table", "timestamp", "to", "trigger", | |||
"rollup", "row", "rows", "set", "smallint", "table", "query", "timestamp", "to", "trigger", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: if with alphabet order, put it between "procedure" and "range" may be better :)
db20bfb
to
b6a13da
Compare
Test build #102804 has finished for PR 23883 at commit
|
@cloud-fan I have addressed your comments in my last checkin - fyi |
val result = new ArrayBuffer[Row] | ||
query match { | ||
case p if p.collectFirst { case _: InsertIntoTable | _: InsertIntoDir => true }.isDefined => | ||
throw new AnalysisException("Describe command is not supported for insert statements.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this behavior from DB2? sql("INSERT ...").printSchema
works and prints empty schema, shall we make the behavior consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan In DB2, we get an error when we describe on "insert"
db2 => describe insert into t1 values (1,1)
SQL0104N An unexpected token "insert" was found following "DESCRIBE".
Personally, i feel its better to let users know that this command is not meant for "insert" than returning an empty schema. But if you feel otherwise, i will change. Let me know please. You have a point about printSchema
though :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong preference here, but the DB2 does this in the parser, I'm wondering if we should do this in the parser as well.
instead of (DESC | DESCRIBE) QUERY? query
, how about (DESC | DESCRIBE) QUERY? querySpecification
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan the only thing is we may leave out a few statements if we restrict it to querySpecification like VALUES (...), WITH.. etc .. Is it okay if we create a parserRule like following ?
queryDescribe
: stmtType1
| stmtType2
;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan I tried to add a rule, but it actually gets complicated as the rules are intertwined. In order to do this, i have to create multiple parallel rules for describequery and then implement the visit method. So lets throw analysis exception for now ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then how about go with querySpecification
, to support only the SELECT ...
query at the beginning? we can support more in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan OK.. let me give that a try.
Test build #102815 has finished for PR 23883 at commit
|
Test build #102853 has finished for PR 23883 at commit
|
retest this please |
Test build #102858 has finished for PR 23883 at commit
|
Test build #102873 has finished for PR 23883 at commit
|
@@ -117,6 +117,11 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging | |||
} | |||
} | |||
|
|||
override def visitDescQuery(ctx: DescQueryContext): LogicalPlan = withOrigin(ctx) { | |||
plan(ctx.queryTerm). | |||
optionalMap(ctx.queryOrganization)(withQueryResultClauses) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
override def visitDescQuery(ctx: DescQueryContext): LogicalPlan = withOrigin(ctx) {
plan(ctx.queryTerm).optionalMap(ctx.queryOrganization)(withQueryResultClauses)
}
@@ -22,7 +22,7 @@ import java.sql.{Date, Timestamp} | |||
|
|||
import org.apache.spark.sql.Row | |||
import org.apache.spark.sql.catalyst.util.{DateFormatter, DateTimeUtils, TimestampFormatter} | |||
import org.apache.spark.sql.execution.command.{DescribeTableCommand, ExecutedCommandExec, ShowTablesCommand} | |||
import org.apache.spark.sql.execution.command._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import org.apache.spark.sql.execution.command.{DescribeCommandBase, ExecutedCommandExec, ShowTablesCommand}
?
@@ -516,6 +505,35 @@ case class DescribeTableCommand( | |||
new MetadataBuilder().putString("comment", "comment of the column").build())() | |||
) | |||
|
|||
def describeSchema( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this?
protected def describeSchema(schema: StructType, header: Boolean): Seq[Row] = {
val buffer = new ArrayBuffer[Row]
if (header) {
buffer += Row(s"# ${output.head.name}", output(1).name, output(2).name)
}
schema.foreach { column =>
buffer += Row(column.name, column.dataType.simpleString, column.getComment().orNull)
}
buffer
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu describeSchema has two callers. One from the describeTable code path and one from describeQuery. The other caller passes the row buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Can you add protected
, anyway?
@@ -255,6 +256,10 @@ query | |||
: ctes? queryNoWith | |||
; | |||
|
|||
descQuery |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: queryToDesc
LGTM |
} | ||
} | ||
/** | ||
* Command can of the following form. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can of the
-> can be the
?
Otherwise, shall we reuse the previous description, Command that looks like
?
} | ||
} | ||
/** | ||
* Command can of the following form. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
|
||
|
||
-- Error cases. | ||
DESC WITH s AS (SELECT 'hello' as col1) SELECT * FROM s; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @dilipbiswal .
Is this the only exception case in SELECT
syntax? Do we have a plan to support this SELECT
query with WITH
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun Yeah.. Wenchen suggested that we start with simple selects and then improve on it. I am planning to look into CTEs next.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
/** | ||
* Command can of the following form. | ||
* {{{ | ||
* DESCRIBE [QUERY] query_statement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
query_statement
-> select_statement
?
Also, if possible, let's add some comment below line 627 that we don't support CTE syntax yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun Now we support statements of the form :
- SELECT
- SELECTS wrapped under a set op (union/intersect etc)
- VALUES(....)
- TABLE
Thats why i didn't use select_statement.. Any other name you can think of ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got it. BTW, TABLE
and VALUES
doesn't look like query statement. What about enumerating them all clearly?
-- Test tables | ||
CREATE table desc_temp1 (key int COMMENT 'column_comment', val string) USING PARQUET; | ||
CREATE table desc_temp2 (key int, val string) USING PARQUET; | ||
CREATE table desc_temp3 (key int, val string) USING PARQUET; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we remove this 3rd CREATE table desc_temp3
? That is only used at multi-insert test case, FROM ... INSERT ... INSERT
. And, we can use like the following.
FROM desc_temp1 a
INSERT INTO desc_temp1 SELECT *
INSERT INTO desc_temp2 SELECT *;
DESC SELECT key, COUNT(*) as count FROM desc_temp1 group by key; | ||
DESC SELECT 10.00D as col1; | ||
DESC QUERY SELECT key FROM desc_temp1 UNION ALL select CAST(1 AS DOUBLE); | ||
DESC QUERY VALUES(1.00D, 'hello') as tab1(col1, col2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add DESC QUERY FROM desc_temp1 a SELECT *
at line 13?
Test build #102898 has finished for PR 23883 at commit
|
-- cleanup | ||
DROP TABLE desc_temp1; | ||
DROP TABLE desc_temp2; | ||
DROP TABLE desc_temp3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to remove this line, too. And, could you add a new line at the end? GitHub shows a warning here.
struct<> | ||
-- !query 16 output | ||
org.apache.spark.sql.AnalysisException | ||
Table or view not found: desc_temp3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To remove this Table or view not found
exception, please don't forget to regenerate this~
Test build #102927 has finished for PR 23883 at commit
|
Test build #102930 has finished for PR 23883 at commit
|
* 1. SELECT statements | ||
* 2. SELECT statements inside set operators (UNION, INTERSECT etc) | ||
* 3. VALUES statement. | ||
* 4. TABLE statement. Example : TABLE table_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean we support DESCRIBE QUERY TABLE t
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm, just checked and we do support it...
thanks, merging to master! |
…of a query. Currently we can use `df.printSchema` to discover the schema information for a query. We should have a way to describe the output schema of a query using SQL interface. Example: DESCRIBE SELECT * FROM desc_table DESCRIBE QUERY SELECT * FROM desc_table ```SQL spark-sql> create table desc_table (c1 int comment 'c1-comment', c2 decimal comment 'c2-comment', c3 string); spark-sql> desc select * from desc_table; c1 int c1-comment c2 decimal(10,0) c2-comment c3 string NULL ``` Added a new test under SQLQueryTestSuite and SparkSqlParserSuite Closes apache#23883 from dilipbiswal/dkb_describe_query. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Currently we can use
df.printSchema
to discover the schema information for a query. We should have a way to describe the output schema of a query using SQL interface.Example:
DESCRIBE SELECT * FROM desc_table
DESCRIBE QUERY SELECT * FROM desc_table
How was this patch tested?
Added a new test under SQLQueryTestSuite and SparkSqlParserSuite