-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-31735][SQL] Include date/timestamp in the summary report #28554
Conversation
ok to test |
@@ -264,7 +264,6 @@ object StatFunctions extends Logging { | |||
} | |||
|
|||
val selectedCols = ds.logicalPlan.output | |||
.filter(a => a.dataType.isInstanceOf[NumericType] || a.dataType.isInstanceOf[StringType]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we keep this whitelist style instead of allowing all, @Fokko ? You can add the missing type here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I've just updated the list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @Fokko . Could you update the PR title and description accordingly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated both the PR and commit, please let me know if this works for you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we remove Even something less common such as arrays are sortable:
and its example in the PR description?
Test build #122735 has finished for PR 28554 at commit
|
Test build #122756 has finished for PR 28554 at commit
|
And, if you don't mind, could you add a UT for this to prevent a future regression? |
Test build #122760 has finished for PR 28554 at commit
|
.filter(a => a.dataType.isInstanceOf[NumericType] || a.dataType.isInstanceOf[StringType]) | ||
.filter(a => a.dataType.isInstanceOf[NumericType] | ||
|| a.dataType.isInstanceOf[StringType] | ||
|| a.dataType.isInstanceOf[DateType] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's write a UT. Does it work BTW? Looks at least mean
and date type won't work here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm working on getting the test suite running on my machine, so I need some time. I don't think that the mean
will be the issue, this is just the element in the middle of the sorted collection, however, the stddev
will be tricky. For the StringType this is just null
.
Currently dates are missing from the export: from datetime import datetime, timedelta, timezone from pyspark.sql import types as T from pyspark.sql import Row from pyspark.sql import functions as F START = datetime(2014, 1, 1, tzinfo=timezone.utc) n_days = 22 date_range = [Row(date=(START + timedelta(days=n))) for n in range(0, n_days)] schema = T.StructType([T.StructField(name="date", dataType=T.DateType(), nullable=False)]) rdd = spark.sparkContext.parallelize(date_range) df = spark.createDataFrame(data=rdd, schema=schema) df.agg(F.max("date")).show() df.summary().show() +-------+ |summary| +-------+ | count| | mean| | stddev| | min| | 25%| | 50%| | 75%| | max| +-------+ Would be nice to include these as well Signed-off-by: Fokko Driesprong <fokko@apache.org>
I finally have some to pick this up. Looks like there is some funky behavior. Doing an average on a string just return
|
Test build #123602 has finished for PR 28554 at commit
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
For example, dates are missing from the export:
Even something less common such as arrays are sortable:
Signed-off-by: Fokko Driesprong fokko@apache.org
What changes were proposed in this pull request?
Not filtering the columns to compute statistics on.
Why are the changes needed?
Something as simple as DateTypes are not showing up in the output.
Does this PR introduce any user-facing change?
Might be, as their output will change of the summary if they have columns that aren't part of the summary right now.
How was this patch tested?
Existing tests.