-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-25829][SQL] Add config spark.sql.legacy.allowDuplicatedMapKeys
and change the default behavior
#27478
Conversation
Test build #117998 has finished for PR 27478 at commit
|
Retest this please. |
@@ -651,8 +651,10 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSparkSession { | |||
Row(null) | |||
) | |||
|
|||
checkAnswer(df1.selectExpr("map_concat(map1, map2)"), expected1a) | |||
checkAnswer(df1.select(map_concat($"map1", $"map2")), expected1a) | |||
withSQLConf(SQLConf.DEDUPLICATE_MAP_KEY_WITH_LAST_WINS_POLICY.key -> "true") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have one test that checks the exception when this configuration is false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, added in ArrayBasedMapBuilderSuite
.
Test build #118005 has finished for PR 27478 at commit
|
docs/sql-migration-guide.md
Outdated
@@ -49,7 +49,7 @@ license: | | |||
|
|||
- In Spark version 2.4 and earlier, float/double -0.0 is semantically equal to 0.0, but -0.0 and 0.0 are considered as different values when used in aggregate grouping keys, window partition keys and join keys. Since Spark 3.0, this bug is fixed. For example, `Seq(-0.0, 0.0).toDF("d").groupBy("d").count()` returns `[(0.0, 2)]` in Spark 3.0, and `[(0.0, 1), (-0.0, 1)]` in Spark 2.4 and earlier. | |||
|
|||
- In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, these built-in functions will remove duplicated map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be undefined. | |||
- In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, new config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` was added, with the default value `false`, Spark will throw RuntimeException while duplicated keys are found. If set to `true`, these built-in functions will remove duplicated map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be undefined. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @cloud-fan , @gatorsmile , @rxin , @marmbrus .
So, RuntimeException by default
is better in Apache Spark 3.0.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The principle is still no failure by default, but version upgrade is a different case. We should try our best to avoid "silent result changing".
We may want to introduce a new category of configs, which is for migration-only, and should be removed in the next major version.
Thoughts? @srowen @viirya @maropu @bart-samwel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mention this spark.sql.deduplicateMapKey.lastWinsPolicy.enabled
config is only useful for migration and could be removed in future version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This idea sounds reasonable. If we think a behavior change is correct but for migration reason, we don't want a silent behavior change, a config like this one is an option we can use to notify users such change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the idea to avoid "silent result changing". Btw, we couldn't keep the old (spark 2.4) behaviour for duplicate keys by using a legacy option? If we couldn't do because of some reasons, the proposed one (the runtime exception) looks reasonable to me, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea sounds fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for the idea of highlight this config is migration only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with the rule, @cloud-fan . At this time, could you add a guideline into our website? Actually, many correctness bug fixes in the migration guide have been silent result changing
with a legacy fallback config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree to have a guideline if we need the new rule, to separate the silent result changing here is caused by the behavior change, not by the bug fixes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I have in mind is:
- if it's a bug fix (the previous result is definitely wrong), then no config is needed. If the impact is big, we can add a legacy config which is false by default.
- if it makes the behavior better, we should either add a config and use the old behavior by default, or fail by default and ask users to set config explicitly and pick the desired behavior.
I'm trying to think more cases, will send an email to the dev list soon.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
Test build #118040 has finished for PR 27478 at commit
|
Test build #118052 has finished for PR 27478 at commit
|
I google searched "deduplicate map key" and there is no matching result. |
Thanks Gengliang, as the config still related to legacy behavior, I rename it to |
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
LGTM, can you update PR title as well? |
Test build #118423 has finished for PR 27478 at commit
|
spark.sql.deduplicateMapKey.lastWinsPolicy.enabled
and change the default behaviorspark.sql.legacy.allowDuplicatedMapKeys
and change the default behavior
Test build #118468 has finished for PR 27478 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala
Outdated
Show resolved
Hide resolved
hiveResultString(df.queryExecution.executedPlan).mkString("\n")) | ||
val expected = unindentAndTrim(output) | ||
assert(actual === expected) | ||
withSQLConf(SQLConf.LEGACY_ALLOW_DUPLICATED_MAP_KEY.key -> "true") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you check the map-related builtin expressions? seems they use wrong examples that have duplicated keys.
Test build #118547 has finished for PR 27478 at commit
|
retest this please |
Test build #118537 has finished for PR 27478 at commit
|
retest it please |
retest this please |
Test build #118567 has finished for PR 27478 at commit
|
retest it please |
retest this please |
Test build #118551 has finished for PR 27478 at commit
|
thanks, merging to master/3.0! |
…s` and change the default behavior ### What changes were proposed in this pull request? This is a follow-up for #23124, add a new config `spark.sql.legacy.allowDuplicatedMapKeys` to control the behavior of removing duplicated map keys in build-in functions. With the default value `false`, Spark will throw a RuntimeException while duplicated keys are found. ### Why are the changes needed? Prevent silent behavior changes. ### Does this PR introduce any user-facing change? Yes, new config added and the default behavior for duplicated map keys changed to RuntimeException thrown. ### How was this patch tested? Modify existing UT. Closes #27478 from xuanyuanking/SPARK-25892-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ab186e3) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Test build #118578 has finished for PR 27478 at commit
|
Thanks all for the review. |
…s` and change the default behavior ### What changes were proposed in this pull request? This is a follow-up for apache#23124, add a new config `spark.sql.legacy.allowDuplicatedMapKeys` to control the behavior of removing duplicated map keys in build-in functions. With the default value `false`, Spark will throw a RuntimeException while duplicated keys are found. ### Why are the changes needed? Prevent silent behavior changes. ### Does this PR introduce any user-facing change? Yes, new config added and the default behavior for duplicated map keys changed to RuntimeException thrown. ### How was this patch tested? Modify existing UT. Closes apache#27478 from xuanyuanking/SPARK-25892-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
This is a follow-up for #23124, add a new config
spark.sql.legacy.allowDuplicatedMapKeys
to control the behavior of removing duplicated map keys in build-in functions. With the default valuefalse
, Spark will throw a RuntimeException while duplicated keys are found.Why are the changes needed?
Prevent silent behavior changes.
Does this PR introduce any user-facing change?
Yes, new config added and the default behavior for duplicated map keys changed to RuntimeException thrown.
How was this patch tested?
Modify existing UT.