Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24605][SQL] size(null) returns null instead of -1 #21598

Closed
wants to merge 12 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Jun 20, 2018

What changes were proposed in this pull request?

In PR, I propose new behavior of size(null) under the config flag spark.sql.legacy.sizeOfNull. If the former one is disabled, the size() function returns null for null input. By default the spark.sql.legacy.sizeOfNull is enabled to keep backward compatibility with previous versions. In that case, size(null) returns -1.

How was this patch tested?

Modified existing tests for the size() function to check new behavior (null) and old one (-1).

@SparkQA
Copy link

SparkQA commented Jun 20, 2018

Test build #92128 has finished for PR 21598 at commit e18568f.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -1314,6 +1314,13 @@ object SQLConf {
"Other column values can be ignored during parsing even if they are malformed.")
.booleanConf
.createWithDefault(true)

val LEGACY_SIZE_OF_NULL = buildConf("spark.sql.legacy.sizeOfNull")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark.sql.function.sizeOfNull is better? btw, If major releases happen in Spark, we can remove these kinds of options for back-compatibility? (just a question)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds more consistent to have spark.sql.function prefix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would like to reserve the namespace spark.sql.legacy.* for all legacy configurations. /cc @marmbrus @rxin

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we plan to fix other things accordingly too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we would like to change/improve external behavior under flags in the spark.sql.legacy.* namespace. All those flags should be removed in the next major release 3.0. Please, share your thoughts about it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing is fine. I am good to have such prefix but I wonder what's changed after #21427 (comment). Sounds basically similar to what I suggested. Where did that discussion happen?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created https://issues.apache.org/jira/browse/SPARK-24625 to track it.

It's similar to #21427 (comment) , but as I replied in that PR, having version specific config is an overkill, while legacy is simpler and more explicit that it will be removed in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's basically the same except that the postfix includes a specific version, which was just a rough idea.

@@ -1314,6 +1314,13 @@ object SQLConf {
"Other column values can be ignored during parsing even if they are malformed.")
.booleanConf
.createWithDefault(true)

val LEGACY_SIZE_OF_NULL = buildConf("spark.sql.legacy.sizeOfNull")
.internal()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal? Since this is an user-facing option, it is not internal?

}

test("map size function - legacy") {
withSQLConf("spark.sql.legacy.sizeOfNull" -> "true") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQLConf. LEGACY_SIZE_OF_NULL.key -> "true"

} else {
child.dataType match {
case _: ArrayType => defineCodeGen(ctx, ev, c => s"($c).numElements()")
case _: MapType => defineCodeGen(ctx, ev, c => s"($c).numElements()")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we fold both cases?

@HyukjinKwon
Copy link
Member

Shall we update migration guide too?

@maropu
Copy link
Member

maropu commented Jun 20, 2018

It seems we don't any behaviour change in current pr (IIUC spark.sql.legacy.sizeOfNull=true keeps the current behaviour). If so, we don't need that update?

But, is it okay to set true at this option by default? Based on the three-value logic, size(null)=null is more reasonable?

def this(child: Expression) =
this(
child,
legacySizeOfNull = SQLConf.get.getConf(SQLConf.LEGACY_SIZE_OF_NULL))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since now we can access the conf also on executor side, do we need these changes? Can't we just get this value as a val?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it works now, I will try to read the config on executor's side. I was just struggling to an issue in tests for another PR when SQL configs were not propagated to executors. For example:

val serializer = new JavaSerializer(new SparkConf()).newInstance
val resolver = ResolveTimeZone(new SQLConf)
resolver.resolveTimeZones(serializer.deserialize(serializer.serialize(expression)))

Also I see some other places where configs are read via passing to constructors:

forceNullableSchema = SQLConf.get.getConf(SQLConf.FROM_JSON_FORCE_NULLABLE_SCHEMA))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was made possible in #21376


val LEGACY_SIZE_OF_NULL = buildConf("spark.sql.legacy.sizeOfNull")
.internal()
.doc("If it is set to true, size of null returns -1. This is legacy behavior of Hive. " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean legacy behavior? If Hive changes its behavior at certain version, it would be good to describe version number explicitly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change the sentence. It seems it is not clear. I just wanted to say that Spark inherited the behavior from Hive. When the size() was implemented, Hive's size(null) returns -1. Most likely Hive still has the behavior at the moment.

@SparkQA
Copy link

SparkQA commented Jun 20, 2018

Test build #92130 has finished for PR 21598 at commit b6539a5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

re: #21598 (comment) I missed the default value. Shall we set it to false? I think the JIRA and this PR claim the current behaviour is righter?

@MaxGekk
Copy link
Member Author

MaxGekk commented Jun 20, 2018

I think the JIRA and this PR claim the current behaviour is righter?

@HyukjinKwon Yes but changing current behavior can potentially break existing user's applications. I am not sure we can do it till Spark v3.0 . Correct me if I am wrong.

@SparkQA
Copy link

SparkQA commented Jun 20, 2018

Test build #92142 has finished for PR 21598 at commit 64a1b7d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jun 20, 2018

We will add a configuration for it as a safeguard. I think we should incrementally fix things to be the righter behaviour in upper versions .. really. If you meant in the JIRA and PR, this behaivour should be considered as an option (but not yet sure if it's something righter), then probably it makes sense to leave it true by default.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jun 21, 2018

@HyukjinKwon I am sure the changes are right but we would like to keep current behavior up to release 3.0 in which we will remove the flag and old implementation.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jun 21, 2018

Sorry, can you elaborate why it's special? correcter behaviour should be kept in upper version and we even have a configuration as a safeguard. Who are you referring by "We" BTW. Where did the discussion happen?

@MaxGekk
Copy link
Member Author

MaxGekk commented Jun 21, 2018

Sorry, can you elaborate why it's special?

Actually nothing special. I just don't want to break existing user's apps on upgrading Spark's minor releases.

Who are you referring by "We" BTW

Sorry, I made typo. Definitely I meant only me.

@HyukjinKwon
Copy link
Member

We should store the right behaviour. It can be avoided by setting the configuration. Let's set it to true if you believe this is the righter behaviour. That's at the very least what I have been used to in Spark so far.

@rxin
Copy link
Contributor

rxin commented Jun 21, 2018

This is not a "bug" and there is no "right" behavior in APIs. It's been defined as -1 since the very beginning (when was it added?), so we can't just change the default value in a feature release.

@rxin
Copy link
Contributor

rxin commented Jun 21, 2018

Do we have other "legacy" configs that we haven't released and can change to match this prefix? It's pretty nice to have a single prefix for stuff like this.

@cloud-fan
Copy link
Contributor

@rxin yes we have, I think they are all listed in the 2.4 migration guide

I've created https://issues.apache.org/jira/browse/SPARK-24625 to track it

@markhamstra
Copy link
Contributor

so we can't just change the default value in a feature release

Agreed. Once a particular interface and behavior is in our released public API, then we effectively have a contract not to change that behavior. If we are going to provide another behavior before making a new major-number release (e.g. spark-3.0.0), then we have to provide a user configuration option to select that new behavior, and the default behavior if a user doesn't change configuration must be the same as before the optional new behavior.

If there is a clear, severe bug (such as data loss or corruption), only then we can consider changing the public API before making a new major-number release -- but even then we are likely to either go immediately to a new major-number or to at least preserve the old, buggy behavior with a configuration option.

} else {
child.dataType match {
case _: ArrayType | _: MapType => defineCodeGen(ctx, ev, c => s"($c).numElements()")
case other => throw new UnsupportedOperationException(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you reorganize the code? The current flow looks confusing. It appears like that the other types are not supported when legacySizeOfNull is false. However, the input types are already limited to ArrayType and MapType.

@HyukjinKwon
Copy link
Member

I am good to have the configuration as that's basically what I suggested too. Removing the behaviour in 3.0.0 is fine. My only question left is the default value.

We should change the behavior in 3.0. Before 3.0 release, we introduce a conf and make it configurable. The default is to keep the current behavior unchanged.

@gatorsmile, do you think this is specific to this PR and JIRA, or we should do the same things for other changes from now on? If it's latter, it should really be discussed in the dev mailing list in a separate thread.

@markhamstra
Copy link
Contributor

@HyukjinKwon this is not new policy. It is what Apache Spark has guaranteed in its version numbering and public API since 1.0.0. It is not a matter of "from now on", but rather of whether committers have started allowing our standards to slip. It may well be time for a discussion of that and of better tools to help guarantee that additions and changes to the public API are adequately discussed and reviewed, appropriate InterfaceStability annotations are applied, etc.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jun 24, 2018

@markhamstra, Spark sometimes has some behaviour changes for some bug fixes or in few other cases so far. At least, see the similar configurations added in the migration guide. It sounded we are setting a hard limit here whether it's a bug or not. If the standards don't reflect the practice, it really should be discussed to be corrected or complied. It is a matter of "from now on" to me since it sounds a bit different to the practice so far if I understood correctly.

@maropu
Copy link
Member

maropu commented Jun 24, 2018

IMHO we need to have clear decision rules for these kinds of behaviour changes in the contribution guide. In the past migration guide descriptions, it seems we've already accept some behaviour changes? e.g., elt and concat, these apis are of public APIs though.

@gatorsmile
Copy link
Member

gatorsmile commented Jun 24, 2018

All the behavior changes need very careful reviews and discussions. Whenever we decide to make a behavior change, we should document it in the migration guide and provide a conf to restore it back to the original behavior. Before the release, the whole community can review the changes again and decide whether any change we should revert or adjust. Based on my understanding, the decision is made case by case. For this specific case, we do not have a very strong reason to change the default value. Thus, we can keep it unchanged.

@HyukjinKwon
Copy link
Member

Based on my understanding, the decision is made case by case.

I concur.

@markhamstra
Copy link
Contributor

markhamstra commented Jun 25, 2018

case by case

Yes, but... this by itself makes the decision appear far too discretionary. Instead, in any PR where you are changing the published interface or behavior of part of Spark's public API, you should be highlighting the change for additional review and providing a really strong argument for why we cannot retain the prior interface and/or default behavior. It is simply not up to an individual committer to decide on their own discretion that the public API should be different than what it, in fact, is. Changing the public API is a big deal -- which is why most additions to the public API should, in my opinion, come in with an InterfaceStability annotation that will allow us to change them before a new major-number release.

This doesn't apply to changes to internal APIs. Neither does it apply to bug fixes where Spark isn't actually doing what the public API says it is supposed to do -- although in cases where we expect that users have come to safely rely upon certain buggy behavior, we may choose to retain that buggy behavior under a new configuration setting.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jun 26, 2018

I don't think it's too discretionary. We have a safeguard to control the behaviour. Spark mentions it in the migration guide. In case of changing public interface which breaks binary or source compatibility, there should really be strong argument, sure. For clarification, I don't think such change is made usually.

In this case, it changes a behaviour even with a safeguard. Sounds pretty minor and isolated. I wonder why this is suddenly poped up. As I said, if the standards don't reflect the practice, the standards should be corrected or the practice should be complied. Committer's judgement is needed time to time. We need more committers for the more proper review iteration, and usually trust them. Let's roll it forward.

If you prefer more conservative distribution, it should be an option to consider using a maintenance release.

we may choose to retain that buggy behavior

I strongly disagree. We should fix the buggy behavior. There's no point of having upper versions.

If you strongly doubt it, please open a discussion in the mailing list and see if we get agreed at some point.

@rxin
Copy link
Contributor

rxin commented Jun 26, 2018 via email

@rxin
Copy link
Contributor

rxin commented Jun 26, 2018 via email

@HyukjinKwon
Copy link
Member

@rxin, I understand the concerns about having behaviour changes. I discussed about this with other committers and community a lot.

So, do you basically imply that we should support to keep the buggy behaviours, leave them by default and have configurations to enable them? I was thinking this is a bit different with what I am used to and it needs a discussion and agreement from the community.

@HyukjinKwon
Copy link
Member

I don't object about the default value here if it's specific to this JIRA and PR for clarification as I said above. If this says about general stuff, to me it sounds it needs a separate discussion.

@gatorsmile
Copy link
Member

gatorsmile commented Jun 26, 2018

LGTM

@SparkQA
Copy link

SparkQA commented Jun 27, 2018

Test build #92354 has finished for PR 21598 at commit ae2c7f1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@HyukjinKwon the discussion is specific to this PR. We've changed a bunch of buggy behaviors in this release.

@asfgit asfgit closed this in d08f53d Jun 27, 2018
@maropu
Copy link
Member

maropu commented Jun 27, 2018

@cloud-fan This merge breaks the build? https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92364/consoleFull
NVM, the build passed.


val LEGACY_SIZE_OF_NULL = buildConf("spark.sql.legacy.sizeOfNull")
.doc("If it is set to true, size of null returns -1. This behavior was inherited from Hive. " +
"The size function returns null for null input if the flag is disabled.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps you should say this will be updated to false in spark 3.0?

@MaxGekk MaxGekk deleted the legacy-size-of-null branch August 17, 2019 13:34
HyukjinKwon pushed a commit that referenced this pull request Oct 8, 2019
### What changes were proposed in this pull request?
Set the default value of the `spark.sql.legacy.sizeOfNull` config to `false`. That changes behavior of the `size()` function for `NULL`. The function will return `NULL` for `NULL` instead of `-1`.

### Why are the changes needed?
There is the agreement in the PR #21598 (comment) to change behavior in Spark 3.0.

### Does this PR introduce any user-facing change?
Yes.
Before:
```sql
spark-sql> select size(NULL);
-1
```
After:
```sql
spark-sql> select size(NULL);
NULL
```

### How was this patch tested?
By the `check outputs of expression examples` test in `SQLQuerySuite` which runs expression examples.

Closes #26051 from MaxGekk/sizeof-null-returns-null.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
atronchi pushed a commit to atronchi/spark that referenced this pull request Oct 23, 2019
### What changes were proposed in this pull request?
Set the default value of the `spark.sql.legacy.sizeOfNull` config to `false`. That changes behavior of the `size()` function for `NULL`. The function will return `NULL` for `NULL` instead of `-1`.

### Why are the changes needed?
There is the agreement in the PR apache#21598 (comment) to change behavior in Spark 3.0.

### Does this PR introduce any user-facing change?
Yes.
Before:
```sql
spark-sql> select size(NULL);
-1
```
After:
```sql
spark-sql> select size(NULL);
NULL
```

### How was this patch tested?
By the `check outputs of expression examples` test in `SQLQuerySuite` which runs expression examples.

Closes apache#26051 from MaxGekk/sizeof-null-returns-null.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants