[SPARK-6117] [SQL] add describe function to DataFrame for summary statis... #5073

azagrebin · 2015-03-17T18:53:52Z

Please review my solution for SPARK-6117

…tistics

AmplabJenkins · 2015-03-17T18:57:12Z

Can one of the admins verify this patch?

rxin · 2015-03-17T21:44:47Z

Thanks for submitting this. Can you try to simplify the implementation? In particular, I think you can build this without being so general. It would also be great to build on GroupedData.agg functions using expressions, rather than just strings.

rxin · 2015-03-17T21:45:14Z

(It would be great to do away with too many levels of nested functions, and the foldLeft there.)

azagrebin · 2015-03-18T00:28:37Z

@rxin Thanks for comments, I have tried to simplify, get rid of nested functions, foldLeft and use expressions to describe statistics.

marmbrus · 2015-03-18T01:13:32Z

sql/core/src/test/scala/org/apache/spark/sql/TestData.scala

+    Row("mean",    null, null) ::
+    Row("stddev",  null, null) ::
+    Row("min",     null, null) ::
+    Row("max",     null, null) :: Nil


This TestData class is really mostly a hold over from when it was much harder to define dataframes inline with test cases. Now that you can just call .toDF on Seq[Tuple], I'd suggest we colocate the answers with the test case.

marmbrus · 2015-03-18T01:13:43Z

ok to test

SparkQA · 2015-03-18T01:18:13Z

Test build #28758 has started for PR 5073 at commit 6111f3c.

This patch merges cleanly.

rxin · 2015-03-18T01:29:12Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala

+
+    def aggCol(name: String = "") = s"'$name' as summary"
+    val statistics = List[(String, Expression => Expression)](
+      "count"  -> (expr => Count(expr)),


do you mind getting rid of the vertical alignment (i.e. don't align the "->")?

SparkQA · 2015-03-18T02:36:17Z

Test build #28758 has finished for PR 5073 at commit 6111f3c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-18T02:36:20Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28758/
Test PASSed.

…umeric columns

…esulting DF, colocate test data with test case

SparkQA · 2015-03-18T09:53:05Z

Test build #28790 has started for PR 5073 at commit f9056ac.

This patch merges cleanly.

azagrebin · 2015-03-18T09:54:52Z

I have done one aggregation, splitten it locally into resulting DataFrame supplemented with schema and statistic names. I have also created nested version of standard deviation expression (stddev) which might be not optimal but can be replaced later.

AmplabJenkins · 2015-03-18T10:02:22Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28789/
Test FAILed.

SparkQA · 2015-03-18T11:14:10Z

Test build #28790 has finished for PR 5073 at commit f9056ac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-18T11:14:13Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28790/
Test PASSed.

rxin · 2015-03-26T07:24:49Z

Thanks. I'm going to merge this, but I will fix some of the minor stuff in a separate PR.

rxin · 2015-03-26T07:56:06Z

FYI the update is here: #5201

…tis... Please review my solution for SPARK-6117 Author: azagrebin <azagrebin@gmail.com> Closes #5073 from azagrebin/SPARK-6117 and squashes the following commits: f9056ac [azagrebin] [SPARK-6117] [SQL] create one aggregation and split it locally into resulting DF, colocate test data with test case ddb3950 [azagrebin] [SPARK-6117] [SQL] simplify implementation, add test for DF without numeric columns 9daf31e [azagrebin] [SPARK-6117] [SQL] add describe function to DataFrame for summary statistics (cherry picked from commit 5bbcd13) Signed-off-by: Reynold Xin <rxin@databricks.com>

[SPARK-6117] [SQL] add describe function to DataFrame for summary sta…

9daf31e

…tistics

marmbrus reviewed Mar 18, 2015
View reviewed changes

rxin reviewed Mar 18, 2015
View reviewed changes

[SPARK-6117] [SQL] simplify implementation, add test for DF without n…

ddb3950

…umeric columns

azagrebin force-pushed the SPARK-6117 branch from 6111f3c to ddb3950 Compare March 18, 2015 09:43

[SPARK-6117] [SQL] create one aggregation and split it locally into r…

f9056ac

…esulting DF, colocate test data with test case

asfgit closed this in 5bbcd13 Mar 26, 2015

rxin mentioned this pull request Mar 26, 2015

[SPARK-6117] [SQL] Improvements to DataFrame.describe() #5201

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-6117] [SQL] add describe function to DataFrame for summary statis... #5073

[SPARK-6117] [SQL] add describe function to DataFrame for summary statis... #5073

azagrebin commented Mar 17, 2015

AmplabJenkins commented Mar 17, 2015

rxin commented Mar 17, 2015

rxin commented Mar 17, 2015

azagrebin commented Mar 18, 2015

marmbrus Mar 18, 2015

marmbrus commented Mar 18, 2015

SparkQA commented Mar 18, 2015

rxin Mar 18, 2015

SparkQA commented Mar 18, 2015

AmplabJenkins commented Mar 18, 2015

SparkQA commented Mar 18, 2015

azagrebin commented Mar 18, 2015

AmplabJenkins commented Mar 18, 2015

SparkQA commented Mar 18, 2015

AmplabJenkins commented Mar 18, 2015

rxin commented Mar 26, 2015

rxin commented Mar 26, 2015

[SPARK-6117] [SQL] add describe function to DataFrame for summary statis... #5073

[SPARK-6117] [SQL] add describe function to DataFrame for summary statis... #5073

Conversation

azagrebin commented Mar 17, 2015

AmplabJenkins commented Mar 17, 2015

rxin commented Mar 17, 2015

rxin commented Mar 17, 2015

azagrebin commented Mar 18, 2015

marmbrus Mar 18, 2015

Choose a reason for hiding this comment

marmbrus commented Mar 18, 2015

SparkQA commented Mar 18, 2015

rxin Mar 18, 2015

Choose a reason for hiding this comment

SparkQA commented Mar 18, 2015

AmplabJenkins commented Mar 18, 2015

SparkQA commented Mar 18, 2015

azagrebin commented Mar 18, 2015

AmplabJenkins commented Mar 18, 2015

SparkQA commented Mar 18, 2015

AmplabJenkins commented Mar 18, 2015

rxin commented Mar 26, 2015

rxin commented Mar 26, 2015