[SPARK-24605][SQL] size(null) returns null instead of -1 #21598

MaxGekk · 2018-06-20T11:43:38Z

What changes were proposed in this pull request?

In PR, I propose new behavior of size(null) under the config flag spark.sql.legacy.sizeOfNull. If the former one is disabled, the size() function returns null for null input. By default the spark.sql.legacy.sizeOfNull is enabled to keep backward compatibility with previous versions. In that case, size(null) returns -1.

How was this patch tested?

Modified existing tests for the size() function to check new behavior (null) and old one (-1).

SparkQA · 2018-06-20T11:53:29Z

Test build #92128 has finished for PR 21598 at commit e18568f.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-06-20T12:14:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -1314,6 +1314,13 @@ object SQLConf {
      "Other column values can be ignored during parsing even if they are malformed.")
    .booleanConf
    .createWithDefault(true)
+
+  val LEGACY_SIZE_OF_NULL = buildConf("spark.sql.legacy.sizeOfNull")


spark.sql.function.sizeOfNull is better? btw, If major releases happen in Spark, we can remove these kinds of options for back-compatibility? (just a question)

Sounds more consistent to have spark.sql.function prefix.

We would like to reserve the namespace spark.sql.legacy.* for all legacy configurations. /cc @marmbrus @rxin

Do we plan to fix other things accordingly too?

Yes, we would like to change/improve external behavior under flags in the spark.sql.legacy.* namespace. All those flags should be removed in the next major release 3.0. Please, share your thoughts about it.

Removing is fine. I am good to have such prefix but I wonder what's changed after #21427 (comment). Sounds basically similar to what I suggested. Where did that discussion happen?

I've created https://issues.apache.org/jira/browse/SPARK-24625 to track it.

It's similar to #21427 (comment) , but as I replied in that PR, having version specific config is an overkill, while legacy is simpler and more explicit that it will be removed in the future.

That's basically the same except that the postfix includes a specific version, which was just a rough idea.

maropu · 2018-06-20T12:16:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -1314,6 +1314,13 @@ object SQLConf {
      "Other column values can be ignored during parsing even if they are malformed.")
    .booleanConf
    .createWithDefault(true)
+
+  val LEGACY_SIZE_OF_NULL = buildConf("spark.sql.legacy.sizeOfNull")
+    .internal()


internal? Since this is an user-facing option, it is not internal?

maropu · 2018-06-20T12:17:16Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala

+  }
+
+  test("map size function - legacy") {
+    withSQLConf("spark.sql.legacy.sizeOfNull" -> "true") {


SQLConf. LEGACY_SIZE_OF_NULL.key -> "true"

HyukjinKwon · 2018-06-20T12:17:37Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

+    } else {
+      child.dataType match {
+        case _: ArrayType => defineCodeGen(ctx, ev, c => s"($c).numElements()")
+        case _: MapType => defineCodeGen(ctx, ev, c => s"($c).numElements()")


Can we fold both cases?

HyukjinKwon · 2018-06-20T12:29:49Z

Shall we update migration guide too?

maropu · 2018-06-20T12:52:00Z

It seems we don't any behaviour change in current pr (IIUC spark.sql.legacy.sizeOfNull=true keeps the current behaviour). If so, we don't need that update?

But, is it okay to set true at this option by default? Based on the three-value logic, size(null)=null is more reasonable?

mgaido91 · 2018-06-20T14:22:22Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

+  def this(child: Expression) =
+    this(
+      child,
+      legacySizeOfNull = SQLConf.get.getConf(SQLConf.LEGACY_SIZE_OF_NULL))


since now we can access the conf also on executor side, do we need these changes? Can't we just get this value as a val?

If it works now, I will try to read the config on executor's side. I was just struggling to an issue in tests for another PR when SQL configs were not propagated to executors. For example:

spark/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala

Lines 52 to 54 in a40ffc6

val serializer = new JavaSerializer(new SparkConf()).newInstance

val resolver = ResolveTimeZone(new SQLConf)

resolver.resolveTimeZones(serializer.deserialize(serializer.serialize(expression)))

Also I see some other places where configs are read via passing to constructors:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

Line 534 in b8f27ae

forceNullableSchema = SQLConf.get.getConf(SQLConf.FROM_JSON_FORCE_NULLABLE_SCHEMA))

it was made possible in #21376

kiszk · 2018-06-20T15:20:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+
+  val LEGACY_SIZE_OF_NULL = buildConf("spark.sql.legacy.sizeOfNull")
+    .internal()
+    .doc("If it is set to true, size of null returns -1. This is legacy behavior of Hive. " +


What do you mean legacy behavior? If Hive changes its behavior at certain version, it would be good to describe version number explicitly.

I will change the sentence. It seems it is not clear. I just wanted to say that Spark inherited the behavior from Hive. When the size() was implemented, Hive's size(null) returns -1. Most likely Hive still has the behavior at the moment.

SparkQA · 2018-06-20T15:36:25Z

Test build #92130 has finished for PR 21598 at commit b6539a5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-06-20T17:39:33Z

re: #21598 (comment) I missed the default value. Shall we set it to false? I think the JIRA and this PR claim the current behaviour is righter?

…d operand's type

MaxGekk · 2018-06-20T18:19:34Z

I think the JIRA and this PR claim the current behaviour is righter?

@HyukjinKwon Yes but changing current behavior can potentially break existing user's applications. I am not sure we can do it till Spark v3.0 . Correct me if I am wrong.

SparkQA · 2018-06-20T22:08:42Z

Test build #92142 has finished for PR 21598 at commit 64a1b7d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-06-20T23:58:15Z

We will add a configuration for it as a safeguard. I think we should incrementally fix things to be the righter behaviour in upper versions .. really. If you meant in the JIRA and PR, this behaivour should be considered as an option (but not yet sure if it's something righter), then probably it makes sense to leave it true by default.

MaxGekk · 2018-06-21T10:13:22Z

@HyukjinKwon I am sure the changes are right but we would like to keep current behavior up to release 3.0 in which we will remove the flag and old implementation.

HyukjinKwon · 2018-06-21T11:28:10Z

Sorry, can you elaborate why it's special? correcter behaviour should be kept in upper version and we even have a configuration as a safeguard. Who are you referring by "We" BTW. Where did the discussion happen?

MaxGekk · 2018-06-21T12:36:10Z

Sorry, can you elaborate why it's special?

Actually nothing special. I just don't want to break existing user's apps on upgrading Spark's minor releases.

Who are you referring by "We" BTW

Sorry, I made typo. Definitely I meant only me.

HyukjinKwon · 2018-06-21T13:12:18Z

We should store the right behaviour. It can be avoided by setting the configuration. Let's set it to true if you believe this is the righter behaviour. That's at the very least what I have been used to in Spark so far.

rxin · 2018-06-21T16:50:43Z

This is not a "bug" and there is no "right" behavior in APIs. It's been defined as -1 since the very beginning (when was it added?), so we can't just change the default value in a feature release.

rxin · 2018-06-21T18:10:02Z

Do we have other "legacy" configs that we haven't released and can change to match this prefix? It's pretty nice to have a single prefix for stuff like this.

cloud-fan · 2018-06-21T18:22:29Z

@rxin yes we have, I think they are all listed in the 2.4 migration guide

I've created https://issues.apache.org/jira/browse/SPARK-24625 to track it

markhamstra · 2018-06-21T18:25:18Z

so we can't just change the default value in a feature release

Agreed. Once a particular interface and behavior is in our released public API, then we effectively have a contract not to change that behavior. If we are going to provide another behavior before making a new major-number release (e.g. spark-3.0.0), then we have to provide a user configuration option to select that new behavior, and the default behavior if a user doesn't change configuration must be the same as before the optional new behavior.

If there is a clear, severe bug (such as data loss or corruption), only then we can consider changing the public API before making a new major-number release -- but even then we are likely to either go immediately to a new major-number or to at least preserve the old, buggy behavior with a configuration option.

gatorsmile · 2018-06-23T17:48:25Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

+    } else {
+      child.dataType match {
+        case _: ArrayType | _: MapType => defineCodeGen(ctx, ev, c => s"($c).numElements()")
+        case other => throw new UnsupportedOperationException(


Could you reorganize the code? The current flow looks confusing. It appears like that the other types are not supported when legacySizeOfNull is false. However, the input types are already limited to ArrayType and MapType.

HyukjinKwon · 2018-06-23T18:43:05Z

I am good to have the configuration as that's basically what I suggested too. Removing the behaviour in 3.0.0 is fine. My only question left is the default value.

We should change the behavior in 3.0. Before 3.0 release, we introduce a conf and make it configurable. The default is to keep the current behavior unchanged.

@gatorsmile, do you think this is specific to this PR and JIRA, or we should do the same things for other changes from now on? If it's latter, it should really be discussed in the dev mailing list in a separate thread.

markhamstra · 2018-06-23T23:25:57Z

@HyukjinKwon this is not new policy. It is what Apache Spark has guaranteed in its version numbering and public API since 1.0.0. It is not a matter of "from now on", but rather of whether committers have started allowing our standards to slip. It may well be time for a discussion of that and of better tools to help guarantee that additions and changes to the public API are adequately discussed and reviewed, appropriate InterfaceStability annotations are applied, etc.

HyukjinKwon · 2018-06-24T01:20:28Z

@markhamstra, Spark sometimes has some behaviour changes for some bug fixes or in few other cases so far. At least, see the similar configurations added in the migration guide. It sounded we are setting a hard limit here whether it's a bug or not. If the standards don't reflect the practice, it really should be discussed to be corrected or complied. It is a matter of "from now on" to me since it sounds a bit different to the practice so far if I understood correctly.

maropu · 2018-06-24T01:34:24Z

IMHO we need to have clear decision rules for these kinds of behaviour changes in the contribution guide. In the past migration guide descriptions, it seems we've already accept some behaviour changes? e.g., elt and concat, these apis are of public APIs though.

gatorsmile · 2018-06-24T23:42:08Z

All the behavior changes need very careful reviews and discussions. Whenever we decide to make a behavior change, we should document it in the migration guide and provide a conf to restore it back to the original behavior. Before the release, the whole community can review the changes again and decide whether any change we should revert or adjust. Based on my understanding, the decision is made case by case. For this specific case, we do not have a very strong reason to change the default value. Thus, we can keep it unchanged.

HyukjinKwon · 2018-06-25T01:11:54Z

Based on my understanding, the decision is made case by case.

I concur.

markhamstra · 2018-06-25T20:42:23Z

case by case

Yes, but... this by itself makes the decision appear far too discretionary. Instead, in any PR where you are changing the published interface or behavior of part of Spark's public API, you should be highlighting the change for additional review and providing a really strong argument for why we cannot retain the prior interface and/or default behavior. It is simply not up to an individual committer to decide on their own discretion that the public API should be different than what it, in fact, is. Changing the public API is a big deal -- which is why most additions to the public API should, in my opinion, come in with an InterfaceStability annotation that will allow us to change them before a new major-number release.

This doesn't apply to changes to internal APIs. Neither does it apply to bug fixes where Spark isn't actually doing what the public API says it is supposed to do -- although in cases where we expect that users have come to safely rely upon certain buggy behavior, we may choose to retain that buggy behavior under a new configuration setting.

HyukjinKwon · 2018-06-26T16:16:57Z

I don't think it's too discretionary. We have a safeguard to control the behaviour. Spark mentions it in the migration guide. In case of changing public interface which breaks binary or source compatibility, there should really be strong argument, sure. For clarification, I don't think such change is made usually.

In this case, it changes a behaviour even with a safeguard. Sounds pretty minor and isolated. I wonder why this is suddenly poped up. As I said, if the standards don't reflect the practice, the standards should be corrected or the practice should be complied. Committer's judgement is needed time to time. We need more committers for the more proper review iteration, and usually trust them. Let's roll it forward.

If you prefer more conservative distribution, it should be an option to consider using a maintenance release.

we may choose to retain that buggy behavior

I strongly disagree. We should fix the buggy behavior. There's no point of having upper versions.

If you strongly doubt it, please open a discussion in the mailing list and see if we get agreed at some point.

rxin · 2018-06-26T16:29:10Z

It’s actually common software engineering practice to keep “buggy” semantics if a bug has been out there long enough and a lot of applications depend on the semantics.

rxin · 2018-06-26T16:32:31Z

Here: https://en.wikipedia.org/wiki/Bug_compatibility BTW for this one I actually think we should change it, in 3.0.

HyukjinKwon · 2018-06-26T16:34:33Z

@rxin, I understand the concerns about having behaviour changes. I discussed about this with other committers and community a lot.

So, do you basically imply that we should support to keep the buggy behaviours, leave them by default and have configurations to enable them? I was thinking this is a bit different with what I am used to and it needs a discussion and agreement from the community.

HyukjinKwon · 2018-06-26T16:37:44Z

I don't object about the default value here if it's specific to this JIRA and PR for clarification as I said above. If this says about general stuff, to me it sounds it needs a separate discussion.

# Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala

gatorsmile · 2018-06-26T21:12:06Z

LGTM

SparkQA · 2018-06-27T00:53:29Z

Test build #92354 has finished for PR 21598 at commit ae2c7f1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-06-27T02:34:27Z

thanks, merging to master!

@HyukjinKwon the discussion is specific to this PR. We've changed a bunch of buggy behaviors in this release.

maropu · 2018-06-27T03:30:41Z

~~@cloud-fan This merge breaks the build? https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92364/consoleFull~~
NVM, the build passed.

rxin · 2018-06-27T04:32:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+
+  val LEGACY_SIZE_OF_NULL = buildConf("spark.sql.legacy.sizeOfNull")
+    .doc("If it is set to true, size of null returns -1. This behavior was inherited from Hive. " +
+      "The size function returns null for null input if the flag is disabled.")


perhaps you should say this will be updated to false in spark 3.0?

### What changes were proposed in this pull request? Set the default value of the `spark.sql.legacy.sizeOfNull` config to `false`. That changes behavior of the `size()` function for `NULL`. The function will return `NULL` for `NULL` instead of `-1`. ### Why are the changes needed? There is the agreement in the PR #21598 (comment) to change behavior in Spark 3.0. ### Does this PR introduce any user-facing change? Yes. Before: ```sql spark-sql> select size(NULL); -1 ``` After: ```sql spark-sql> select size(NULL); NULL ``` ### How was this patch tested? By the `check outputs of expression examples` test in `SQLQuerySuite` which runs expression examples. Closes #26051 from MaxGekk/sizeof-null-returns-null. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Set the default value of the `spark.sql.legacy.sizeOfNull` config to `false`. That changes behavior of the `size()` function for `NULL`. The function will return `NULL` for `NULL` instead of `-1`. ### Why are the changes needed? There is the agreement in the PR apache#21598 (comment) to change behavior in Spark 3.0. ### Does this PR introduce any user-facing change? Yes. Before: ```sql spark-sql> select size(NULL); -1 ``` After: ```sql spark-sql> select size(NULL); NULL ``` ### How was this patch tested? By the `check outputs of expression examples` test in `SQLQuerySuite` which runs expression examples. Closes apache#26051 from MaxGekk/sizeof-null-returns-null. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

MaxGekk added 2 commits June 20, 2018 07:59

New implementation of size returns null for null input

2338b78

Test for legacy and new implementation of size()

e18568f

MaxGekk added 2 commits June 20, 2018 13:58

Call single param constructor

64dc3b2

Fix expression tests

b6539a5

maropu reviewed Jun 20, 2018

View reviewed changes

HyukjinKwon reviewed Jun 20, 2018

View reviewed changes

mgaido91 reviewed Jun 20, 2018

View reviewed changes

kiszk reviewed Jun 20, 2018

View reviewed changes

MaxGekk added 2 commits June 20, 2018 19:31

isPublic flag is reverted back for the global config

80114b3

Replacing config name by var name in SQLConf

5d58d82

MaxGekk added 2 commits June 20, 2018 19:50

Folding Array and Map cases to one, throwing exception for unsupporte…

431fca8

…d operand's type

Changing config description

64a1b7d

gatorsmile reviewed Jun 23, 2018

View reviewed changes

MaxGekk added 4 commits June 26, 2018 22:32

Updated description and added a couple examples

10e7a2d

Unifying code for legacy and new path

cf9364d

Merge remote-tracking branch 'origin/master' into size-of-null2

bb21e13

# Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala

Fix merge conflicts

ae2c7f1

asfgit closed this in d08f53d Jun 27, 2018

rxin reviewed Jun 27, 2018

View reviewed changes

MaxGekk deleted the legacy-size-of-null branch August 17, 2019 13:34

MaxGekk mentioned this pull request Oct 8, 2019

[SPARK-24640][SQL] Return NULL from size(NULL) by default #26051

Closed

	val serializer = new JavaSerializer(new SparkConf()).newInstance
	val resolver = ResolveTimeZone(new SQLConf)
	resolver.resolveTimeZones(serializer.deserialize(serializer.serialize(expression)))

[SPARK-24605][SQL] size(null) returns null instead of -1 #21598

[SPARK-24605][SQL] size(null) returns null instead of -1 #21598

Conversation

MaxGekk commented Jun 20, 2018

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jun 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Jun 20, 2018

maropu commented Jun 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 20, 2018

HyukjinKwon commented Jun 20, 2018

MaxGekk commented Jun 20, 2018 • edited Loading

SparkQA commented Jun 20, 2018

HyukjinKwon commented Jun 20, 2018 • edited Loading

MaxGekk commented Jun 21, 2018

HyukjinKwon commented Jun 21, 2018 • edited Loading

MaxGekk commented Jun 21, 2018

HyukjinKwon commented Jun 21, 2018

rxin commented Jun 21, 2018

rxin commented Jun 21, 2018

cloud-fan commented Jun 21, 2018

markhamstra commented Jun 21, 2018

Choose a reason for hiding this comment

HyukjinKwon commented Jun 23, 2018

markhamstra commented Jun 23, 2018

HyukjinKwon commented Jun 24, 2018 • edited Loading

maropu commented Jun 24, 2018

gatorsmile commented Jun 24, 2018 • edited Loading

HyukjinKwon commented Jun 25, 2018

markhamstra commented Jun 25, 2018 • edited Loading

HyukjinKwon commented Jun 26, 2018 • edited Loading

rxin commented Jun 26, 2018 via email • edited Loading

rxin commented Jun 26, 2018 via email • edited Loading

HyukjinKwon commented Jun 26, 2018

HyukjinKwon commented Jun 26, 2018

gatorsmile commented Jun 26, 2018 • edited Loading

SparkQA commented Jun 27, 2018

cloud-fan commented Jun 27, 2018

maropu commented Jun 27, 2018 • edited Loading

Choose a reason for hiding this comment

MaxGekk commented Jun 20, 2018 •

edited

Loading

HyukjinKwon commented Jun 20, 2018 •

edited

Loading

HyukjinKwon commented Jun 21, 2018 •

edited

Loading

HyukjinKwon commented Jun 24, 2018 •

edited

Loading

gatorsmile commented Jun 24, 2018 •

edited

Loading

markhamstra commented Jun 25, 2018 •

edited

Loading

HyukjinKwon commented Jun 26, 2018 •

edited

Loading

rxin commented Jun 26, 2018 via email •

edited

Loading

rxin commented Jun 26, 2018 via email •

edited

Loading

gatorsmile commented Jun 26, 2018 •

edited

Loading

maropu commented Jun 27, 2018 •

edited

Loading