-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-24605][SQL] size(null) returns null instead of -1 #21598
Conversation
Test build #92128 has finished for PR 21598 at commit
|
@@ -1314,6 +1314,13 @@ object SQLConf { | |||
"Other column values can be ignored during parsing even if they are malformed.") | |||
.booleanConf | |||
.createWithDefault(true) | |||
|
|||
val LEGACY_SIZE_OF_NULL = buildConf("spark.sql.legacy.sizeOfNull") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark.sql.function.sizeOfNull
is better? btw, If major releases happen in Spark, we can remove these kinds of options for back-compatibility? (just a question)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds more consistent to have spark.sql.function
prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we plan to fix other things accordingly too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we would like to change/improve external behavior under flags in the spark.sql.legacy.*
namespace. All those flags should be removed in the next major release 3.0. Please, share your thoughts about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing is fine. I am good to have such prefix but I wonder what's changed after #21427 (comment). Sounds basically similar to what I suggested. Where did that discussion happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've created https://issues.apache.org/jira/browse/SPARK-24625 to track it.
It's similar to #21427 (comment) , but as I replied in that PR, having version specific config is an overkill, while legacy
is simpler and more explicit that it will be removed in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's basically the same except that the postfix includes a specific version, which was just a rough idea.
@@ -1314,6 +1314,13 @@ object SQLConf { | |||
"Other column values can be ignored during parsing even if they are malformed.") | |||
.booleanConf | |||
.createWithDefault(true) | |||
|
|||
val LEGACY_SIZE_OF_NULL = buildConf("spark.sql.legacy.sizeOfNull") | |||
.internal() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
internal
? Since this is an user-facing option, it is not internal
?
} | ||
|
||
test("map size function - legacy") { | ||
withSQLConf("spark.sql.legacy.sizeOfNull" -> "true") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQLConf. LEGACY_SIZE_OF_NULL.key -> "true"
} else { | ||
child.dataType match { | ||
case _: ArrayType => defineCodeGen(ctx, ev, c => s"($c).numElements()") | ||
case _: MapType => defineCodeGen(ctx, ev, c => s"($c).numElements()") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we fold both cases?
Shall we update migration guide too? |
It seems we don't any behaviour change in current pr (IIUC But, is it okay to set true at this option by default? Based on the three-value logic, |
def this(child: Expression) = | ||
this( | ||
child, | ||
legacySizeOfNull = SQLConf.get.getConf(SQLConf.LEGACY_SIZE_OF_NULL)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since now we can access the conf also on executor side, do we need these changes? Can't we just get this value as a val
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it works now, I will try to read the config on executor's side. I was just struggling to an issue in tests for another PR when SQL configs were not propagated to executors. For example:
Lines 52 to 54 in a40ffc6
val serializer = new JavaSerializer(new SparkConf()).newInstance | |
val resolver = ResolveTimeZone(new SQLConf) | |
resolver.resolveTimeZones(serializer.deserialize(serializer.serialize(expression))) |
Also I see some other places where configs are read via passing to constructors:
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
Line 534 in b8f27ae
forceNullableSchema = SQLConf.get.getConf(SQLConf.FROM_JSON_FORCE_NULLABLE_SCHEMA)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was made possible in #21376
|
||
val LEGACY_SIZE_OF_NULL = buildConf("spark.sql.legacy.sizeOfNull") | ||
.internal() | ||
.doc("If it is set to true, size of null returns -1. This is legacy behavior of Hive. " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean legacy behavior
? If Hive changes its behavior at certain version, it would be good to describe version number explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change the sentence. It seems it is not clear. I just wanted to say that Spark inherited the behavior from Hive. When the size()
was implemented, Hive's size(null)
returns -1
. Most likely Hive still has the behavior at the moment.
Test build #92130 has finished for PR 21598 at commit
|
re: #21598 (comment) I missed the default value. Shall we set it to |
@HyukjinKwon Yes but changing current behavior can potentially break existing user's applications. I am not sure we can do it till Spark v3.0 . Correct me if I am wrong. |
Test build #92142 has finished for PR 21598 at commit
|
We will add a configuration for it as a safeguard. I think we should incrementally fix things to be the righter behaviour in upper versions .. really. If you meant in the JIRA and PR, this behaivour should be considered as an option (but not yet sure if it's something righter), then probably it makes sense to leave it |
@HyukjinKwon I am sure the changes are right but we would like to keep current behavior up to release 3.0 in which we will remove the flag and old implementation. |
Sorry, can you elaborate why it's special? correcter behaviour should be kept in upper version and we even have a configuration as a safeguard. Who are you referring by "We" BTW. Where did the discussion happen? |
Actually nothing special. I just don't want to break existing user's apps on upgrading Spark's minor releases.
Sorry, I made typo. Definitely I meant only me. |
We should store the right behaviour. It can be avoided by setting the configuration. Let's set it to true if you believe this is the righter behaviour. That's at the very least what I have been used to in Spark so far. |
This is not a "bug" and there is no "right" behavior in APIs. It's been defined as -1 since the very beginning (when was it added?), so we can't just change the default value in a feature release. |
Do we have other "legacy" configs that we haven't released and can change to match this prefix? It's pretty nice to have a single prefix for stuff like this. |
@rxin yes we have, I think they are all listed in the 2.4 migration guide I've created https://issues.apache.org/jira/browse/SPARK-24625 to track it |
Agreed. Once a particular interface and behavior is in our released public API, then we effectively have a contract not to change that behavior. If we are going to provide another behavior before making a new major-number release (e.g. spark-3.0.0), then we have to provide a user configuration option to select that new behavior, and the default behavior if a user doesn't change configuration must be the same as before the optional new behavior. If there is a clear, severe bug (such as data loss or corruption), only then we can consider changing the public API before making a new major-number release -- but even then we are likely to either go immediately to a new major-number or to at least preserve the old, buggy behavior with a configuration option. |
} else { | ||
child.dataType match { | ||
case _: ArrayType | _: MapType => defineCodeGen(ctx, ev, c => s"($c).numElements()") | ||
case other => throw new UnsupportedOperationException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you reorganize the code? The current flow looks confusing. It appears like that the other types are not supported when legacySizeOfNull
is false. However, the input types are already limited to ArrayType and MapType.
I am good to have the configuration as that's basically what I suggested too. Removing the behaviour in 3.0.0 is fine. My only question left is the default value.
@gatorsmile, do you think this is specific to this PR and JIRA, or we should do the same things for other changes from now on? If it's latter, it should really be discussed in the dev mailing list in a separate thread. |
@HyukjinKwon this is not new policy. It is what Apache Spark has guaranteed in its version numbering and public API since 1.0.0. It is not a matter of "from now on", but rather of whether committers have started allowing our standards to slip. It may well be time for a discussion of that and of better tools to help guarantee that additions and changes to the public API are adequately discussed and reviewed, appropriate InterfaceStability annotations are applied, etc. |
@markhamstra, Spark sometimes has some behaviour changes for some bug fixes or in few other cases so far. At least, see the similar configurations added in the migration guide. It sounded we are setting a hard limit here whether it's a bug or not. If the standards don't reflect the practice, it really should be discussed to be corrected or complied. It is a matter of "from now on" to me since it sounds a bit different to the practice so far if I understood correctly. |
IMHO we need to have clear decision rules for these kinds of behaviour changes in the contribution guide. In the past migration guide descriptions, it seems we've already accept some behaviour changes? e.g., |
All the behavior changes need very careful reviews and discussions. Whenever we decide to make a behavior change, we should document it in the migration guide and provide a conf to restore it back to the original behavior. Before the release, the whole community can review the changes again and decide whether any change we should revert or adjust. Based on my understanding, the decision is made case by case. For this specific case, we do not have a very strong reason to change the default value. Thus, we can keep it unchanged. |
I concur. |
Yes, but... this by itself makes the decision appear far too discretionary. Instead, in any PR where you are changing the published interface or behavior of part of Spark's public API, you should be highlighting the change for additional review and providing a really strong argument for why we cannot retain the prior interface and/or default behavior. It is simply not up to an individual committer to decide on their own discretion that the public API should be different than what it, in fact, is. Changing the public API is a big deal -- which is why most additions to the public API should, in my opinion, come in with an InterfaceStability annotation that will allow us to change them before a new major-number release. This doesn't apply to changes to internal APIs. Neither does it apply to bug fixes where Spark isn't actually doing what the public API says it is supposed to do -- although in cases where we expect that users have come to safely rely upon certain buggy behavior, we may choose to retain that buggy behavior under a new configuration setting. |
I don't think it's too discretionary. We have a safeguard to control the behaviour. Spark mentions it in the migration guide. In case of changing public interface which breaks binary or source compatibility, there should really be strong argument, sure. For clarification, I don't think such change is made usually. In this case, it changes a behaviour even with a safeguard. Sounds pretty minor and isolated. I wonder why this is suddenly poped up. As I said, if the standards don't reflect the practice, the standards should be corrected or the practice should be complied. Committer's judgement is needed time to time. We need more committers for the more proper review iteration, and usually trust them. Let's roll it forward. If you prefer more conservative distribution, it should be an option to consider using a maintenance release.
I strongly disagree. We should fix the buggy behavior. There's no point of having upper versions. If you strongly doubt it, please open a discussion in the mailing list and see if we get agreed at some point. |
It’s actually common software engineering practice to keep “buggy”
semantics if a bug has been out there long enough and a lot of applications
depend on the semantics.
|
Here: https://en.wikipedia.org/wiki/Bug_compatibility
BTW for this one I actually think we should change it, in 3.0.
|
@rxin, I understand the concerns about having behaviour changes. I discussed about this with other committers and community a lot. So, do you basically imply that we should support to keep the buggy behaviours, leave them by default and have configurations to enable them? I was thinking this is a bit different with what I am used to and it needs a discussion and agreement from the community. |
I don't object about the default value here if it's specific to this JIRA and PR for clarification as I said above. If this says about general stuff, to me it sounds it needs a separate discussion. |
# Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala
LGTM |
Test build #92354 has finished for PR 21598 at commit
|
thanks, merging to master! @HyukjinKwon the discussion is specific to this PR. We've changed a bunch of buggy behaviors in this release. |
|
|
||
val LEGACY_SIZE_OF_NULL = buildConf("spark.sql.legacy.sizeOfNull") | ||
.doc("If it is set to true, size of null returns -1. This behavior was inherited from Hive. " + | ||
"The size function returns null for null input if the flag is disabled.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps you should say this will be updated to false in spark 3.0?
### What changes were proposed in this pull request? Set the default value of the `spark.sql.legacy.sizeOfNull` config to `false`. That changes behavior of the `size()` function for `NULL`. The function will return `NULL` for `NULL` instead of `-1`. ### Why are the changes needed? There is the agreement in the PR #21598 (comment) to change behavior in Spark 3.0. ### Does this PR introduce any user-facing change? Yes. Before: ```sql spark-sql> select size(NULL); -1 ``` After: ```sql spark-sql> select size(NULL); NULL ``` ### How was this patch tested? By the `check outputs of expression examples` test in `SQLQuerySuite` which runs expression examples. Closes #26051 from MaxGekk/sizeof-null-returns-null. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request? Set the default value of the `spark.sql.legacy.sizeOfNull` config to `false`. That changes behavior of the `size()` function for `NULL`. The function will return `NULL` for `NULL` instead of `-1`. ### Why are the changes needed? There is the agreement in the PR apache#21598 (comment) to change behavior in Spark 3.0. ### Does this PR introduce any user-facing change? Yes. Before: ```sql spark-sql> select size(NULL); -1 ``` After: ```sql spark-sql> select size(NULL); NULL ``` ### How was this patch tested? By the `check outputs of expression examples` test in `SQLQuerySuite` which runs expression examples. Closes apache#26051 from MaxGekk/sizeof-null-returns-null. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
What changes were proposed in this pull request?
In PR, I propose new behavior of
size(null)
under the config flagspark.sql.legacy.sizeOfNull
. If the former one is disabled, thesize()
function returnsnull
fornull
input. By default thespark.sql.legacy.sizeOfNull
is enabled to keep backward compatibility with previous versions. In that case,size(null)
returns-1
.How was this patch tested?
Modified existing tests for the
size()
function to check new behavior (null
) and old one (-1
).