Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26982][SQL] Enhance describe framework to describe the output of a query. #23883

Closed
wants to merge 12 commits into from

Conversation

dilipbiswal
Copy link
Contributor

What changes were proposed in this pull request?

Currently we can use df.printSchema to discover the schema information for a query. We should have a way to describe the output schema of a query using SQL interface.

Example:

DESCRIBE SELECT * FROM desc_table
DESCRIBE QUERY SELECT * FROM desc_table

spark-sql> create table desc_table (c1 int comment 'c1-comment', c2 decimal comment 'c2-comment', c3 string);

spark-sql> desc select * from desc_table;
c1	int	        c1-comment
c2	decimal(10,0)	c2-comment
c3	string	        NULL

How was this patch tested?

Added a new test under SQLQueryTestSuite and SparkSqlParserSuite

@dilipbiswal dilipbiswal changed the title [SPARK-26982][SQL] Enhance describe frame work to describe the output of a query. [SPARK-26982][SQL] Enhance describe framework to describe the output of a query. Feb 25, 2019
@SparkQA
Copy link

SparkQA commented Feb 25, 2019

Test build #102728 has finished for PR 23883 at commit c7e57e2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 25, 2019

Test build #102731 has finished for PR 23883 at commit c7e57e2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

cc @cloud-fan @gatorsmile

* {{{
* DESCRIBE [EXTENDED|FORMATTED] table_name partitionSpec?;
* DESCRIBE QUERY <query>
* DESCRIBE <query>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DESCRIBE QUERY <query> and DESCRIBE <query> equal to DESCRIBE [QUERY] <query>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... would change.

* {{{
* DESCRIBE [EXTENDED|FORMATTED] table_name partitionSpec?;
* DESCRIBE QUERY <query>
* DESCRIBE <query>
* }}}
*/
case class DescribeTableCommand(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is named as DescribeTableCommand, shall we have individual command for describing query?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, shall we rename it to DescribeTableOrQueryCommand?

Copy link
Contributor Author

@dilipbiswal dilipbiswal Feb 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya We could. Just that there are not that much changes. Lets see if others feel the same way. If so, i would change it. Renaming the command sounds good to me :-)

@maropu
Copy link
Member

maropu commented Feb 25, 2019

Thanks for the work! I have one question in adivance; any other database-like system supprting this syntax? Also, can you add a entry Query in https://github.com/apache/spark/blob/master/docs/sql-reserved-and-non-reserved-keywords.md and update entries in TableIdentifierParserSuite?

@dilipbiswal
Copy link
Contributor Author

dilipbiswal commented Feb 25, 2019

@maropu Thanks. I will update the entries. I got this idea from DB2 which supports this syntax. Here is the sample output for your perusal.

db2 => describe select * from desc_tab;

 Column Information

 Number of columns: 2

 SQL type              Type length  Column name                     Name length
 --------------------  -----------  ------------------------------  -----------
 497   INTEGER                   4  C1                                        2
 449   VARCHAR                  10  C2                                        2


@gatorsmile
Copy link
Member

gatorsmile commented Feb 25, 2019

Thanks! This Describe command is pretty useful to SQL users!

Without this command, our Spark users need to create a temp/persistent view first and then describe it.

DescribeTableCommand(
visitTableIdentifier(ctx.tableIdentifier),
DescribeTableOrQueryCommand(
Option(visitTableIdentifier(ctx.tableIdentifier)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

visitTableIdentifier can return null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu with the change, DescribeTableOrQueryCommand takes a optional table identifier. Thats why i wrap it under an Option.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I see.

DescribeTableOrQueryCommand(None,
Option(visitQuery(ctx.query())),
Map.empty[String, String], false)
}

This comment was marked as outdated.

* {{{
* DESCRIBE [EXTENDED|FORMATTED] table_name partitionSpec?;
* DESCRIBE [QUERY] <query>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: DESCRIBE [QUERY] query? We need <>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If its under the bracket , it means optional isn't it ? Like in the previous example extended and formatted are optional ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, [QUERY] means optional. You mean <query> means optional, too?

Copy link
Contributor Author

@dilipbiswal dilipbiswal Feb 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu Sorry .. i misinterpreted. Here , i am just trying to say that they have to specify a query. Its not optional. How should i say it ? Does the following look better ?

DESCRIBE [QUERY] query_statement

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, yes.

throw new AnalysisException("Describe command is not supported for insert statements.")
case u @ Union(children) if children.forall(_.isInstanceOf[InsertIntoTable]) =>
throw new AnalysisException("Describe command is not supported for insert statements.")
case u: UnresolvedRelation =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put the two matches above together?

        case p if p.collectFirst { case _: InsertIntoTable | _: InsertIntoDir => true }.isDefined =>

Copy link
Contributor Author

@dilipbiswal dilipbiswal Feb 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu This is fantastic :-). Thanks.. Let me try it.

case u @ Union(children) if children.forall(_.isInstanceOf[InsertIntoTable]) =>
throw new AnalysisException("Describe command is not supported for insert statements.")
case u: UnresolvedRelation =>
describeTable(sparkSession, u.tableIdentifier, partitionSpec, isExtended, result)
Copy link
Member

@maropu maropu Feb 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test for this match in describe-query.sql? Also, it might be better to add comments here about which query does this match cover.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu this should have been already covered in describe-table tests. Basically,
a simpledescribe table mytable will go through it. Thats because its matched by queryPrimary in sqlBase.g4. I will add comments.

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some minor comments. Looks good except the comments.

@@ -784,7 +785,7 @@ ansiNonReserved
| RESET | RESTRICT | REVOKE | RLIKE | ROLE | ROLES | ROLLBACK | ROLLUP | ROW | ROWS | SCHEMA | SEPARATED | SERDE
| SERDEPROPERTIES | SET | SETS | SHOW | SKEWED | SORT | SORTED | START | STATISTICS | STORED | STRATIFY | STRUCT
| TABLES | TABLESAMPLE | TBLPROPERTIES | TEMPORARY | TERMINATED | TOUCH | TRANSACTION | TRANSACTIONS | TRANSFORM
| TRUE | TRUNCATE | UNARCHIVE | UNBOUNDED | UNCACHE | UNLOCK | UNSET | USE | VALUES | VIEW | WINDOW
| TRUE | TRUNCATE | UNARCHIVE | UNBOUNDED | UNCACHE | UNLOCK | UNSET | USE | VALUES | VIEW | WINDOW | QUERY
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's order it with alphabet

* }}}
*/
case class DescribeTableCommand(
table: TableIdentifier,
case class DescribeTableOrQueryCommand(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we create 2 commands?

@SparkQA
Copy link

SparkQA commented Feb 26, 2019

Test build #102771 has finished for PR 23883 at commit db20bfb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -48,7 +48,7 @@ class TableIdentifierParserSuite extends SparkFunSuite {
"false", "fetch", "float", "for", "grant", "group", "grouping", "import", "in",
"insert", "int", "into", "is", "pivot", "lateral", "like", "local", "none", "null",
"of", "order", "out", "outer", "partition", "percent", "procedure", "range", "reads", "revoke",
"rollup", "row", "rows", "set", "smallint", "table", "timestamp", "to", "trigger",
"rollup", "row", "rows", "set", "smallint", "table", "query", "timestamp", "to", "trigger",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: if with alphabet order, put it between "procedure" and "range" may be better :)

@SparkQA
Copy link

SparkQA commented Feb 27, 2019

Test build #102804 has finished for PR 23883 at commit b6a13da.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

@cloud-fan I have addressed your comments in my last checkin - fyi

val result = new ArrayBuffer[Row]
query match {
case p if p.collectFirst { case _: InsertIntoTable | _: InsertIntoDir => true }.isDefined =>
throw new AnalysisException("Describe command is not supported for insert statements.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this behavior from DB2? sql("INSERT ...").printSchema works and prints empty schema, shall we make the behavior consistent?

Copy link
Contributor Author

@dilipbiswal dilipbiswal Feb 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan In DB2, we get an error when we describe on "insert"

db2 => describe insert into t1 values (1,1)
SQL0104N  An unexpected token "insert" was found following "DESCRIBE".

Personally, i feel its better to let users know that this command is not meant for "insert" than returning an empty schema. But if you feel otherwise, i will change. Let me know please. You have a point about printSchema though :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong preference here, but the DB2 does this in the parser, I'm wondering if we should do this in the parser as well.

instead of (DESC | DESCRIBE) QUERY? query, how about (DESC | DESCRIBE) QUERY? querySpecification?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan the only thing is we may leave out a few statements if we restrict it to querySpecification like VALUES (...), WITH.. etc .. Is it okay if we create a parserRule like following ?

queryDescribe
: stmtType1
| stmtType2
;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan I tried to add a rule, but it actually gets complicated as the rules are intertwined. In order to do this, i have to create multiple parallel rules for describequery and then implement the visit method. So lets throw analysis exception for now ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then how about go with querySpecification, to support only the SELECT ... query at the beginning? we can support more in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan OK.. let me give that a try.

@SparkQA
Copy link

SparkQA commented Feb 27, 2019

Test build #102815 has finished for PR 23883 at commit e742ae9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class DescribeTableCommand(

@SparkQA
Copy link

SparkQA commented Feb 28, 2019

Test build #102853 has finished for PR 23883 at commit cfbe8db.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • abstract class DescribeCommandBase extends RunnableCommand

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Feb 28, 2019

Test build #102858 has finished for PR 23883 at commit cfbe8db.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • abstract class DescribeCommandBase extends RunnableCommand

@SparkQA
Copy link

SparkQA commented Mar 1, 2019

Test build #102873 has finished for PR 23883 at commit c519fc7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -117,6 +117,11 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging
}
}

override def visitDescQuery(ctx: DescQueryContext): LogicalPlan = withOrigin(ctx) {
plan(ctx.queryTerm).
optionalMap(ctx.queryOrganization)(withQueryResultClauses)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

  override def visitDescQuery(ctx: DescQueryContext): LogicalPlan = withOrigin(ctx) {
    plan(ctx.queryTerm).optionalMap(ctx.queryOrganization)(withQueryResultClauses)
  }

@@ -22,7 +22,7 @@ import java.sql.{Date, Timestamp}

import org.apache.spark.sql.Row
import org.apache.spark.sql.catalyst.util.{DateFormatter, DateTimeUtils, TimestampFormatter}
import org.apache.spark.sql.execution.command.{DescribeTableCommand, ExecutedCommandExec, ShowTablesCommand}
import org.apache.spark.sql.execution.command._
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import org.apache.spark.sql.execution.command.{DescribeCommandBase, ExecutedCommandExec, ShowTablesCommand}?

@@ -516,6 +505,35 @@ case class DescribeTableCommand(
new MetadataBuilder().putString("comment", "comment of the column").build())()
)

def describeSchema(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this?


  protected def describeSchema(schema: StructType, header: Boolean): Seq[Row] = {
    val buffer = new ArrayBuffer[Row]
    if (header) {
      buffer += Row(s"# ${output.head.name}", output(1).name, output(2).name)
    }
    schema.foreach { column =>
      buffer += Row(column.name, column.dataType.simpleString, column.getComment().orNull)
    }
    buffer
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu describeSchema has two callers. One from the describeTable code path and one from describeQuery. The other caller passes the row buffer.

Copy link
Member

@maropu maropu Mar 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Can you add protected, anyway?

@@ -255,6 +256,10 @@ query
: ctes? queryNoWith
;

descQuery
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: queryToDesc

@cloud-fan
Copy link
Contributor

LGTM

}
}
/**
* Command can of the following form.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can of the -> can be the?

Otherwise, shall we reuse the previous description, Command that looks like?

}
}
/**
* Command can of the following form.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.



-- Error cases.
DESC WITH s AS (SELECT 'hello' as col1) SELECT * FROM s;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @dilipbiswal .
Is this the only exception case in SELECT syntax? Do we have a plan to support this SELECT query with WITH?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun Yeah.. Wenchen suggested that we start with simple selects and then improve on it. I am planning to look into CTEs next.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

/**
* Command can of the following form.
* {{{
* DESCRIBE [QUERY] query_statement
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query_statement -> select_statement?

Also, if possible, let's add some comment below line 627 that we don't support CTE syntax yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun Now we support statements of the form :

  1. SELECT
  2. SELECTS wrapped under a set op (union/intersect etc)
  3. VALUES(....)
  4. TABLE
    Thats why i didn't use select_statement.. Any other name you can think of ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it. BTW, TABLE and VALUES doesn't look like query statement. What about enumerating them all clearly?

-- Test tables
CREATE table desc_temp1 (key int COMMENT 'column_comment', val string) USING PARQUET;
CREATE table desc_temp2 (key int, val string) USING PARQUET;
CREATE table desc_temp3 (key int, val string) USING PARQUET;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we remove this 3rd CREATE table desc_temp3? That is only used at multi-insert test case, FROM ... INSERT ... INSERT. And, we can use like the following.

FROM desc_temp1 a
INSERT INTO desc_temp1 SELECT *
INSERT INTO desc_temp2 SELECT *;

DESC SELECT key, COUNT(*) as count FROM desc_temp1 group by key;
DESC SELECT 10.00D as col1;
DESC QUERY SELECT key FROM desc_temp1 UNION ALL select CAST(1 AS DOUBLE);
DESC QUERY VALUES(1.00D, 'hello') as tab1(col1, col2);
Copy link
Member

@dongjoon-hyun dongjoon-hyun Mar 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add DESC QUERY FROM desc_temp1 a SELECT * at line 13?

@SparkQA
Copy link

SparkQA commented Mar 1, 2019

Test build #102898 has finished for PR 23883 at commit ee858a2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

-- cleanup
DROP TABLE desc_temp1;
DROP TABLE desc_temp2;
DROP TABLE desc_temp3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to remove this line, too. And, could you add a new line at the end? GitHub shows a warning here.

struct<>
-- !query 16 output
org.apache.spark.sql.AnalysisException
Table or view not found: desc_temp3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To remove this Table or view not found exception, please don't forget to regenerate this~

@SparkQA
Copy link

SparkQA commented Mar 2, 2019

Test build #102927 has finished for PR 23883 at commit 7556506.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 2, 2019

Test build #102930 has finished for PR 23883 at commit 7d7d7d2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* 1. SELECT statements
* 2. SELECT statements inside set operators (UNION, INTERSECT etc)
* 3. VALUES statement.
* 4. TABLE statement. Example : TABLE table_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean we support DESCRIBE QUERY TABLE t?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, just checked and we do support it...

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 4a486d6 Mar 2, 2019
mccheah pushed a commit to palantir/spark that referenced this pull request May 15, 2019
…of a query.

Currently we can use `df.printSchema` to discover the schema information for a query. We should have a way to describe the output schema of a query using SQL interface.

Example:

DESCRIBE SELECT * FROM desc_table
DESCRIBE QUERY SELECT * FROM desc_table
```SQL

spark-sql> create table desc_table (c1 int comment 'c1-comment', c2 decimal comment 'c2-comment', c3 string);

spark-sql> desc select * from desc_table;
c1	int	        c1-comment
c2	decimal(10,0)	c2-comment
c3	string	        NULL

```
Added a new test under SQLQueryTestSuite and SparkSqlParserSuite

Closes apache#23883 from dilipbiswal/dkb_describe_query.

Authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants