[SPARK-31735][SQL] Include date/timestamp in the summary report #28554

Fokko · 2020-05-16T17:36:21Z

For example, dates are missing from the export:

from datetime import datetime, timedelta, timezone
from pyspark.sql import types as T
from pyspark.sql import Row
from pyspark.sql import functions as F

START = datetime(2014, 1, 1, tzinfo=timezone.utc)

n_days = 22

date_range = [Row(date=(START + timedelta(days=n))) for n in range(0, n_days)]

schema = T.StructType([T.StructField(name="date", dataType=T.DateType(), nullable=False)])

rdd = spark.sparkContext.parallelize(date_range)

df = spark.createDataFrame(data=rdd, schema=schema)

df.agg(F.max("date")).show()

df.summary().show()
+-------+
|summary|
+-------+
|  count|
|   mean|
| stddev|
|    min|
|    25%|
|    50%|
|    75%|
|    max|
+-------+

Even something less common such as arrays are sortable:

elcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.5
      /_/
         
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_172)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 

scala> val singersDF = Seq(
     |   ("beatles", "help|hey jude"),
     |   ("romeo", "eres mia")
     | ).toDF("name", "hit_songs")
singersDF: org.apache.spark.sql.DataFrame = [name: string, hit_songs: string]

scala> 

scala> val actualDF = singersDF.withColumn(
     |   "hit_songs",
     |   split(col("hit_songs"), "\\|")
     | )
actualDF: org.apache.spark.sql.DataFrame = [name: string, hit_songs: array<string>]

scala> actualDF.show()
+-------+----------------+
|   name|       hit_songs|
+-------+----------------+
|beatles|[help, hey jude]|
|  romeo|      [eres mia]|
+-------+----------------+


scala> df.max("hit_songs")
<console>:24: error: not found: value df
       df.max("hit_songs")
       ^

scala> actualDF.max("hit_songs")
<console>:26: error: value max is not a member of org.apache.spark.sql.DataFrame
       actualDF.max("hit_songs")
                ^

scala> actualDF.agg(min("hit_songs"), max("hit_songs"))
res3: org.apache.spark.sql.DataFrame = [min(hit_songs): array<string>, max(hit_songs): array<string>]

scala> actualDF.agg(min("hit_songs"), max("hit_songs")).show()
+--------------+----------------+                                               
|min(hit_songs)|  max(hit_songs)|
+--------------+----------------+
|    [eres mia]|[help, hey jude]|
+--------------+----------------+

Signed-off-by: Fokko Driesprong fokko@apache.org

What changes were proposed in this pull request?

Not filtering the columns to compute statistics on.

Why are the changes needed?

Something as simple as DateTypes are not showing up in the output.

Does this PR introduce any user-facing change?

Might be, as their output will change of the summary if they have columns that aren't part of the summary right now.

How was this patch tested?

Existing tests.

dongjoon-hyun · 2020-05-16T18:16:37Z

ok to test

dongjoon-hyun · 2020-05-16T18:17:49Z

sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala

@@ -264,7 +264,6 @@ object StatFunctions extends Logging {
    }

    val selectedCols = ds.logicalPlan.output
-      .filter(a => a.dataType.isInstanceOf[NumericType] || a.dataType.isInstanceOf[StringType])


Can we keep this whitelist style instead of allowing all, @Fokko ? You can add the missing type here.

Sure, I've just updated the list.

Thank you, @Fokko . Could you update the PR title and description accordingly?

I've updated both the PR and commit, please let me know if this works for you

Shall we remove Even something less common such as arrays are sortable: and its example in the PR description?

SparkQA · 2020-05-17T00:12:49Z

Test build #122735 has finished for PR 28554 at commit 4fcbc6d.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-17T14:55:54Z

Test build #122756 has finished for PR 28554 at commit 51439b7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-05-17T19:14:28Z

And, if you don't mind, could you add a UT for this to prevent a future regression?

SparkQA · 2020-05-18T01:22:26Z

Test build #122760 has finished for PR 28554 at commit b34ec2d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-05-18T02:08:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala

-      .filter(a => a.dataType.isInstanceOf[NumericType] || a.dataType.isInstanceOf[StringType])
+      .filter(a => a.dataType.isInstanceOf[NumericType]
+        || a.dataType.isInstanceOf[StringType]
+        || a.dataType.isInstanceOf[DateType]


Yes, let's write a UT. Does it work BTW? Looks at least mean and date type won't work here.

I'm working on getting the test suite running on my machine, so I need some time. I don't think that the mean will be the issue, this is just the element in the middle of the sorted collection, however, the stddev will be tricky. For the StringType this is just null.

Currently dates are missing from the export: from datetime import datetime, timedelta, timezone from pyspark.sql import types as T from pyspark.sql import Row from pyspark.sql import functions as F START = datetime(2014, 1, 1, tzinfo=timezone.utc) n_days = 22 date_range = [Row(date=(START + timedelta(days=n))) for n in range(0, n_days)] schema = T.StructType([T.StructField(name="date", dataType=T.DateType(), nullable=False)]) rdd = spark.sparkContext.parallelize(date_range) df = spark.createDataFrame(data=rdd, schema=schema) df.agg(F.max("date")).show() df.summary().show() +-------+ |summary| +-------+ | count| | mean| | stddev| | min| | 25%| | 50%| | 75%| | max| +-------+ Would be nice to include these as well Signed-off-by: Fokko Driesprong <fokko@apache.org>

Fokko · 2020-06-07T07:55:26Z

I finally have some to pick this up. Looks like there is some funky behavior. Doing an average on a string just return null, and doing this on a Date, returns an exception:

MacBook-Pro-van-Fokko:spark fokkodriesprong$ spark-shell
20/06/07 09:51:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://192.168.1.113:4040
Spark context available as 'sc' (master = local[*], app id = local-1591516331348).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.5
      /_/
         
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_172)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import java.sql.Date
import java.sql.Date

scala> import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.DataFrame

scala> val person2: DataFrame = Seq(
     |     ("Bob", 16, 176, new Date(2020, 1, 1)),
     |     ("Alice", 32, 164, new Date(2020, 1, 5)),
     |     ("David", 60, 192, new Date(2020, 1, 19)),
     |     ("Amy", 24, 180, new Date(2020, 1, 25))).toDF("name", "age", "height", "birthday")
warning: there were four deprecation warnings; re-run with -deprecation for details
person2: org.apache.spark.sql.DataFrame = [name: string, age: int ... 2 more fields]

scala> person2.select("name").agg(avg('name)).show()
+---------+
|avg(name)|
+---------+
|     null|
+---------+


scala> person2.select("name").agg(avg('birthday)).show()
org.apache.spark.sql.AnalysisException: cannot resolve '`birthday`' given input columns: [name];;
'Aggregate [avg('birthday) AS avg(birthday)#38]
+- Project [name#9]
   +- Project [_1#4 AS name#9, _2#5 AS age#10, _3#6 AS height#11, _4#7 AS birthday#12]
      +- LocalRelation [_1#4, _2#5, _3#6, _4#7]

  at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:111)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:108)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:280)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:280)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:279)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:277)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:277)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:328)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:326)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:277)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:277)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:277)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:328)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:326)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:277)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:277)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:277)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:328)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:326)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:277)
  at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:93)
  at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:93)
  at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:105)
  at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:105)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
  at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:104)
  at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:116)
  at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$2.apply(QueryPlan.scala:121)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:296)
  at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:121)
  at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:126)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
  at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:126)
  at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:93)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:108)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:86)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:86)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:95)
  at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:108)
  at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:105)
  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
  at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:58)
  at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:56)
  at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:78)
  at org.apache.spark.sql.RelationalGroupedDataset.toDF(RelationalGroupedDataset.scala:65)
  at org.apache.spark.sql.RelationalGroupedDataset.agg(RelationalGroupedDataset.scala:224)
  at org.apache.spark.sql.Dataset.agg(Dataset.scala:1804)
  ... 49 elided

SparkQA · 2020-06-07T09:30:17Z

Test build #123602 has finished for PR 28554 at commit 51c3df1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

github-actions · 2020-09-17T00:44:40Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions · 2020-12-27T01:00:38Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

probot-autolabeler bot added the SQL label May 16, 2020

dongjoon-hyun reviewed May 16, 2020

View reviewed changes

Fokko force-pushed the fd-fix-summary branch from 4fcbc6d to 51439b7 Compare May 17, 2020 09:03

Fokko force-pushed the fd-fix-summary branch from 51439b7 to b34ec2d Compare May 17, 2020 19:19

Fokko changed the title ~~[SPARK-31735][CORE] Include all columns in the summary report~~ [SPARK-31735][CORE] Include date/timestamp in the summary report May 17, 2020

HyukjinKwon reviewed May 18, 2020

View reviewed changes

Fokko added 2 commits June 7, 2020 09:19

Add a test

51c3df1

Fokko force-pushed the fd-fix-summary branch from b34ec2d to 51c3df1 Compare June 7, 2020 08:02

jiangxb1987 changed the title ~~[SPARK-31735][CORE] Include date/timestamp in the summary report~~ [SPARK-31735][SQL] Include date/timestamp in the summary report Jun 8, 2020

Fokko mentioned this pull request Jun 8, 2020

[SPARK-10520][SQL] Allow average out of DateType #28754

Closed

Fokko mentioned this pull request Aug 22, 2020

[SPARK-32204][SPARK-32182][DOCS] Add a quickstart page with Binder integration in PySpark documentation #29491

Closed

github-actions bot added the Stale label Sep 17, 2020

HyukjinKwon removed the Stale label Sep 17, 2020

github-actions bot added the Stale label Dec 27, 2020

github-actions bot closed this Dec 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-31735][SQL] Include date/timestamp in the summary report #28554

[SPARK-31735][SQL] Include date/timestamp in the summary report #28554

Fokko commented May 16, 2020 •

edited

Loading

dongjoon-hyun commented May 16, 2020

dongjoon-hyun May 16, 2020 •

edited

Loading

Fokko May 17, 2020

dongjoon-hyun May 17, 2020

Fokko May 17, 2020

dongjoon-hyun May 17, 2020 •

edited

Loading

SparkQA commented May 17, 2020

SparkQA commented May 17, 2020

dongjoon-hyun commented May 17, 2020

SparkQA commented May 18, 2020

HyukjinKwon May 18, 2020

Fokko May 18, 2020

Fokko commented Jun 7, 2020 •

edited

Loading

SparkQA commented Jun 7, 2020

github-actions bot commented Sep 17, 2020

github-actions bot commented Dec 27, 2020

[SPARK-31735][SQL] Include date/timestamp in the summary report #28554

[SPARK-31735][SQL] Include date/timestamp in the summary report #28554

Conversation

Fokko commented May 16, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

dongjoon-hyun commented May 16, 2020

dongjoon-hyun May 16, 2020 • edited Loading

Choose a reason for hiding this comment

Fokko May 17, 2020

Choose a reason for hiding this comment

dongjoon-hyun May 17, 2020

Choose a reason for hiding this comment

Fokko May 17, 2020

Choose a reason for hiding this comment

dongjoon-hyun May 17, 2020 • edited Loading

Choose a reason for hiding this comment

SparkQA commented May 17, 2020

SparkQA commented May 17, 2020

dongjoon-hyun commented May 17, 2020

SparkQA commented May 18, 2020

HyukjinKwon May 18, 2020

Choose a reason for hiding this comment

Fokko May 18, 2020

Choose a reason for hiding this comment

Fokko commented Jun 7, 2020 • edited Loading

SparkQA commented Jun 7, 2020

github-actions bot commented Sep 17, 2020

github-actions bot commented Dec 27, 2020

Fokko commented May 16, 2020 •

edited

Loading

dongjoon-hyun May 16, 2020 •

edited

Loading

dongjoon-hyun May 17, 2020 •

edited

Loading

Fokko commented Jun 7, 2020 •

edited

Loading