-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-6117] [SQL] add describe function to DataFrame for summary statis... #5073
Conversation
Can one of the admins verify this patch? |
Thanks for submitting this. Can you try to simplify the implementation? In particular, I think you can build this without being so general. It would also be great to build on GroupedData.agg functions using expressions, rather than just strings. |
(It would be great to do away with too many levels of nested functions, and the foldLeft there.) |
@rxin Thanks for comments, I have tried to simplify, get rid of nested functions, foldLeft and use expressions to describe statistics. |
Row("mean", null, null) :: | ||
Row("stddev", null, null) :: | ||
Row("min", null, null) :: | ||
Row("max", null, null) :: Nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This TestData
class is really mostly a hold over from when it was much harder to define dataframes inline with test cases. Now that you can just call .toDF on Seq[Tuple]
, I'd suggest we colocate the answers with the test case.
ok to test |
Test build #28758 has started for PR 5073 at commit
|
|
||
def aggCol(name: String = "") = s"'$name' as summary" | ||
val statistics = List[(String, Expression => Expression)]( | ||
"count" -> (expr => Count(expr)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mind getting rid of the vertical alignment (i.e. don't align the "->")?
Test build #28758 has finished for PR 5073 at commit
|
Test PASSed. |
…esulting DF, colocate test data with test case
Test build #28790 has started for PR 5073 at commit
|
I have done one aggregation, splitten it locally into resulting DataFrame supplemented with schema and statistic names. I have also created nested version of standard deviation expression (stddev) which might be not optimal but can be replaced later. |
Test FAILed. |
Test build #28790 has finished for PR 5073 at commit
|
Test PASSed. |
Thanks. I'm going to merge this, but I will fix some of the minor stuff in a separate PR. |
FYI the update is here: #5201 |
…tis... Please review my solution for SPARK-6117 Author: azagrebin <azagrebin@gmail.com> Closes #5073 from azagrebin/SPARK-6117 and squashes the following commits: f9056ac [azagrebin] [SPARK-6117] [SQL] create one aggregation and split it locally into resulting DF, colocate test data with test case ddb3950 [azagrebin] [SPARK-6117] [SQL] simplify implementation, add test for DF without numeric columns 9daf31e [azagrebin] [SPARK-6117] [SQL] add describe function to DataFrame for summary statistics (cherry picked from commit 5bbcd13) Signed-off-by: Reynold Xin <rxin@databricks.com>
Please review my solution for SPARK-6117