Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-6117] [SQL] add describe function to DataFrame for summary statis... #5073

Closed
wants to merge 3 commits into from

Conversation

azagrebin
Copy link
Contributor

Please review my solution for SPARK-6117

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@rxin
Copy link
Contributor

rxin commented Mar 17, 2015

Thanks for submitting this. Can you try to simplify the implementation? In particular, I think you can build this without being so general. It would also be great to build on GroupedData.agg functions using expressions, rather than just strings.

@rxin
Copy link
Contributor

rxin commented Mar 17, 2015

(It would be great to do away with too many levels of nested functions, and the foldLeft there.)

@azagrebin
Copy link
Contributor Author

@rxin Thanks for comments, I have tried to simplify, get rid of nested functions, foldLeft and use expressions to describe statistics.

Row("mean", null, null) ::
Row("stddev", null, null) ::
Row("min", null, null) ::
Row("max", null, null) :: Nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TestData class is really mostly a hold over from when it was much harder to define dataframes inline with test cases. Now that you can just call .toDF on Seq[Tuple], I'd suggest we colocate the answers with the test case.

@marmbrus
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Mar 18, 2015

Test build #28758 has started for PR 5073 at commit 6111f3c.

  • This patch merges cleanly.


def aggCol(name: String = "") = s"'$name' as summary"
val statistics = List[(String, Expression => Expression)](
"count" -> (expr => Count(expr)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mind getting rid of the vertical alignment (i.e. don't align the "->")?

@SparkQA
Copy link

SparkQA commented Mar 18, 2015

Test build #28758 has finished for PR 5073 at commit 6111f3c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28758/
Test PASSed.

…esulting DF, colocate test data with test case
@SparkQA
Copy link

SparkQA commented Mar 18, 2015

Test build #28790 has started for PR 5073 at commit f9056ac.

  • This patch merges cleanly.

@azagrebin
Copy link
Contributor Author

I have done one aggregation, splitten it locally into resulting DataFrame supplemented with schema and statistic names. I have also created nested version of standard deviation expression (stddev) which might be not optimal but can be replaced later.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28789/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Mar 18, 2015

Test build #28790 has finished for PR 5073 at commit f9056ac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28790/
Test PASSed.

@rxin
Copy link
Contributor

rxin commented Mar 26, 2015

Thanks. I'm going to merge this, but I will fix some of the minor stuff in a separate PR.

@asfgit asfgit closed this in 5bbcd13 Mar 26, 2015
@rxin
Copy link
Contributor

rxin commented Mar 26, 2015

FYI the update is here: #5201

asfgit pushed a commit that referenced this pull request Mar 26, 2015
…tis...

Please review my solution for SPARK-6117

Author: azagrebin <azagrebin@gmail.com>

Closes #5073 from azagrebin/SPARK-6117 and squashes the following commits:

f9056ac [azagrebin] [SPARK-6117] [SQL] create one aggregation and split it locally into resulting DF, colocate test data with test case
ddb3950 [azagrebin] [SPARK-6117] [SQL] simplify implementation, add test for DF without numeric columns
9daf31e [azagrebin] [SPARK-6117] [SQL] add describe function to DataFrame for summary statistics

(cherry picked from commit 5bbcd13)
Signed-off-by: Reynold Xin <rxin@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants