Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-31735][SQL] Include date/timestamp in the summary report #28554

Closed
wants to merge 2 commits into from

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented May 16, 2020

For example, dates are missing from the export:

from datetime import datetime, timedelta, timezone
from pyspark.sql import types as T
from pyspark.sql import Row
from pyspark.sql import functions as F

START = datetime(2014, 1, 1, tzinfo=timezone.utc)

n_days = 22

date_range = [Row(date=(START + timedelta(days=n))) for n in range(0, n_days)]

schema = T.StructType([T.StructField(name="date", dataType=T.DateType(), nullable=False)])

rdd = spark.sparkContext.parallelize(date_range)

df = spark.createDataFrame(data=rdd, schema=schema)

df.agg(F.max("date")).show()

df.summary().show()
+-------+
|summary|
+-------+
|  count|
|   mean|
| stddev|
|    min|
|    25%|
|    50%|
|    75%|
|    max|
+-------+

Even something less common such as arrays are sortable:

elcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.5
      /_/
         
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_172)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 

scala> val singersDF = Seq(
     |   ("beatles", "help|hey jude"),
     |   ("romeo", "eres mia")
     | ).toDF("name", "hit_songs")
singersDF: org.apache.spark.sql.DataFrame = [name: string, hit_songs: string]

scala> 

scala> val actualDF = singersDF.withColumn(
     |   "hit_songs",
     |   split(col("hit_songs"), "\\|")
     | )
actualDF: org.apache.spark.sql.DataFrame = [name: string, hit_songs: array<string>]

scala> actualDF.show()
+-------+----------------+
|   name|       hit_songs|
+-------+----------------+
|beatles|[help, hey jude]|
|  romeo|      [eres mia]|
+-------+----------------+


scala> df.max("hit_songs")
<console>:24: error: not found: value df
       df.max("hit_songs")
       ^

scala> actualDF.max("hit_songs")
<console>:26: error: value max is not a member of org.apache.spark.sql.DataFrame
       actualDF.max("hit_songs")
                ^

scala> actualDF.agg(min("hit_songs"), max("hit_songs"))
res3: org.apache.spark.sql.DataFrame = [min(hit_songs): array<string>, max(hit_songs): array<string>]

scala> actualDF.agg(min("hit_songs"), max("hit_songs")).show()
+--------------+----------------+                                               
|min(hit_songs)|  max(hit_songs)|
+--------------+----------------+
|    [eres mia]|[help, hey jude]|
+--------------+----------------+

Signed-off-by: Fokko Driesprong fokko@apache.org

What changes were proposed in this pull request?

Not filtering the columns to compute statistics on.

Why are the changes needed?

Something as simple as DateTypes are not showing up in the output.

Does this PR introduce any user-facing change?

Might be, as their output will change of the summary if they have columns that aren't part of the summary right now.

How was this patch tested?

Existing tests.

@dongjoon-hyun
Copy link
Member

ok to test

@@ -264,7 +264,6 @@ object StatFunctions extends Logging {
}

val selectedCols = ds.logicalPlan.output
.filter(a => a.dataType.isInstanceOf[NumericType] || a.dataType.isInstanceOf[StringType])
Copy link
Member

@dongjoon-hyun dongjoon-hyun May 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep this whitelist style instead of allowing all, @Fokko ? You can add the missing type here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I've just updated the list.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @Fokko . Could you update the PR title and description accordingly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated both the PR and commit, please let me know if this works for you

Copy link
Member

@dongjoon-hyun dongjoon-hyun May 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we remove Even something less common such as arrays are sortable: and its example in the PR description?

@SparkQA
Copy link

SparkQA commented May 17, 2020

Test build #122735 has finished for PR 28554 at commit 4fcbc6d.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 17, 2020

Test build #122756 has finished for PR 28554 at commit 51439b7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

And, if you don't mind, could you add a UT for this to prevent a future regression?

@Fokko Fokko changed the title [SPARK-31735][CORE] Include all columns in the summary report [SPARK-31735][CORE] Include date/timestamp in the summary report May 17, 2020
@SparkQA
Copy link

SparkQA commented May 18, 2020

Test build #122760 has finished for PR 28554 at commit b34ec2d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

.filter(a => a.dataType.isInstanceOf[NumericType] || a.dataType.isInstanceOf[StringType])
.filter(a => a.dataType.isInstanceOf[NumericType]
|| a.dataType.isInstanceOf[StringType]
|| a.dataType.isInstanceOf[DateType]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's write a UT. Does it work BTW? Looks at least mean and date type won't work here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working on getting the test suite running on my machine, so I need some time. I don't think that the mean will be the issue, this is just the element in the middle of the sorted collection, however, the stddev will be tricky. For the StringType this is just null.

Fokko added 2 commits June 7, 2020 09:19
Currently dates are missing from the export:

from datetime import datetime, timedelta, timezone
from pyspark.sql import types as T
from pyspark.sql import Row
from pyspark.sql import functions as F

START = datetime(2014, 1, 1, tzinfo=timezone.utc)

n_days = 22

date_range = [Row(date=(START + timedelta(days=n))) for n in range(0, n_days)]

schema = T.StructType([T.StructField(name="date", dataType=T.DateType(), nullable=False)])

rdd = spark.sparkContext.parallelize(date_range)

df = spark.createDataFrame(data=rdd, schema=schema)

df.agg(F.max("date")).show()

df.summary().show()
+-------+
|summary|
+-------+
|  count|
|   mean|
| stddev|
|    min|
|    25%|
|    50%|
|    75%|
|    max|
+-------+

Would be nice to include these as well

Signed-off-by: Fokko Driesprong <fokko@apache.org>
@Fokko
Copy link
Contributor Author

Fokko commented Jun 7, 2020

I finally have some to pick this up. Looks like there is some funky behavior. Doing an average on a string just return null, and doing this on a Date, returns an exception:

MacBook-Pro-van-Fokko:spark fokkodriesprong$ spark-shell
20/06/07 09:51:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://192.168.1.113:4040
Spark context available as 'sc' (master = local[*], app id = local-1591516331348).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.5
      /_/
         
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_172)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import java.sql.Date
import java.sql.Date

scala> import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.DataFrame

scala> val person2: DataFrame = Seq(
     |     ("Bob", 16, 176, new Date(2020, 1, 1)),
     |     ("Alice", 32, 164, new Date(2020, 1, 5)),
     |     ("David", 60, 192, new Date(2020, 1, 19)),
     |     ("Amy", 24, 180, new Date(2020, 1, 25))).toDF("name", "age", "height", "birthday")
warning: there were four deprecation warnings; re-run with -deprecation for details
person2: org.apache.spark.sql.DataFrame = [name: string, age: int ... 2 more fields]

scala> person2.select("name").agg(avg('name)).show()
+---------+
|avg(name)|
+---------+
|     null|
+---------+


scala> person2.select("name").agg(avg('birthday)).show()
org.apache.spark.sql.AnalysisException: cannot resolve '`birthday`' given input columns: [name];;
'Aggregate [avg('birthday) AS avg(birthday)#38]
+- Project [name#9]
   +- Project [_1#4 AS name#9, _2#5 AS age#10, _3#6 AS height#11, _4#7 AS birthday#12]
      +- LocalRelation [_1#4, _2#5, _3#6, _4#7]

  at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:111)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:108)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:280)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:280)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:279)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:277)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:277)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:328)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:326)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:277)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:277)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:277)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:328)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:326)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:277)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:277)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:277)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:328)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:326)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:277)
  at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:93)
  at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:93)
  at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:105)
  at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:105)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
  at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:104)
  at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:116)
  at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$2.apply(QueryPlan.scala:121)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:296)
  at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:121)
  at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:126)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
  at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:126)
  at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:93)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:108)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:86)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:86)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:95)
  at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:108)
  at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:105)
  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
  at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:58)
  at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:56)
  at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:78)
  at org.apache.spark.sql.RelationalGroupedDataset.toDF(RelationalGroupedDataset.scala:65)
  at org.apache.spark.sql.RelationalGroupedDataset.agg(RelationalGroupedDataset.scala:224)
  at org.apache.spark.sql.Dataset.agg(Dataset.scala:1804)
  ... 49 elided

@SparkQA
Copy link

SparkQA commented Jun 7, 2020

Test build #123602 has finished for PR 28554 at commit 51c3df1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987 jiangxb1987 changed the title [SPARK-31735][CORE] Include date/timestamp in the summary report [SPARK-31735][SQL] Include date/timestamp in the summary report Jun 8, 2020
@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Dec 27, 2020
@github-actions github-actions bot closed this Dec 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants