[SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior #27478

xuanyuanking · 2020-02-06T14:36:04Z

What changes were proposed in this pull request?

This is a follow-up for #23124, add a new config spark.sql.legacy.allowDuplicatedMapKeys to control the behavior of removing duplicated map keys in build-in functions. With the default value false, Spark will throw a RuntimeException while duplicated keys are found.

Why are the changes needed?

Prevent silent behavior changes.

Does this PR introduce any user-facing change?

Yes, new config added and the default behavior for duplicated map keys changed to RuntimeException thrown.

How was this patch tested?

Modify existing UT.

SparkQA · 2020-02-06T18:41:57Z

Test build #117998 has finished for PR 27478 at commit 09cfb67.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-02-06T23:01:50Z

Retest this please.

python/pyspark/sql/functions.py

HyukjinKwon · 2020-02-07T00:52:25Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala

@@ -651,8 +651,10 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSparkSession {
      Row(null)
    )

-    checkAnswer(df1.selectExpr("map_concat(map1, map2)"), expected1a)
-    checkAnswer(df1.select(map_concat($"map1", $"map2")), expected1a)
+    withSQLConf(SQLConf.DEDUPLICATE_MAP_KEY_WITH_LAST_WINS_POLICY.key -> "true") {


Can we have one test that checks the exception when this configuration is false?

Thanks, added in ArrayBasedMapBuilderSuite.

SparkQA · 2020-02-07T02:48:55Z

Test build #118005 has finished for PR 27478 at commit 09cfb67.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-02-07T04:51:24Z

docs/sql-migration-guide.md

@@ -49,7 +49,7 @@ license: |

  - In Spark version 2.4 and earlier, float/double -0.0 is semantically equal to 0.0, but -0.0 and 0.0 are considered as different values when used in aggregate grouping keys, window partition keys and join keys. Since Spark 3.0, this bug is fixed. For example, `Seq(-0.0, 0.0).toDF("d").groupBy("d").count()` returns `[(0.0, 2)]` in Spark 3.0, and `[(0.0, 1), (-0.0, 1)]` in Spark 2.4 and earlier.

-  - In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, these built-in functions will remove duplicated map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be undefined.
+  - In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, new config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` was added, with the default value `false`, Spark will throw RuntimeException while duplicated keys are found. If set to `true`, these built-in functions will remove duplicated map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be undefined.


Hi, @cloud-fan , @gatorsmile , @rxin , @marmbrus .
So, RuntimeException by default is better in Apache Spark 3.0.0?

The principle is still no failure by default, but version upgrade is a different case. We should try our best to avoid "silent result changing".

We may want to introduce a new category of configs, which is for migration-only, and should be removed in the next major version.

Thoughts? @srowen @viirya @maropu @bart-samwel

Should we mention this spark.sql.deduplicateMapKey.lastWinsPolicy.enabled config is only useful for migration and could be removed in future version?

This idea sounds reasonable. If we think a behavior change is correct but for migration reason, we don't want a silent behavior change, a config like this one is an option we can use to notify users such change.

I agree with the idea to avoid "silent result changing". Btw, we couldn't keep the old (spark 2.4) behaviour for duplicate keys by using a legacy option? If we couldn't do because of some reasons, the proposed one (the runtime exception) looks reasonable to me, too.

The idea sounds fine.

+1 for the idea of highlight this config is migration only.

I'm okay with the rule, @cloud-fan . At this time, could you add a guideline into our website? Actually, many correctness bug fixes in the migration guide have been silent result changing with a legacy fallback config.

Agree to have a guideline if we need the new rule, to separate the silent result changing here is caused by the behavior change, not by the bug fixes.

What I have in mind is:

if it's a bug fix (the previous result is definitely wrong), then no config is needed. If the impact is big, we can add a legacy config which is false by default.

if it makes the behavior better, we should either add a config and use the old behavior by default, or fail by default and ask users to set config explicitly and pick the desired behavior.

I'm trying to think more cases, will send an email to the dev list soon.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

SparkQA · 2020-02-07T19:06:23Z

Test build #118040 has finished for PR 27478 at commit fe1fb0f.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-08T05:46:37Z

Test build #118052 has finished for PR 27478 at commit 66dc51f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2020-02-12T05:38:21Z

I google searched "deduplicate map key" and there is no matching result.
Since there will be runtime exception on duplicated keys, how about renaming the config to spark.sql.allowDuplicatedMapKeys.enabled?

xuanyuanking · 2020-02-14T11:31:37Z

Thanks Gengliang, as the config still related to legacy behavior, I rename it to spark.sql.legacy.allowDuplicatedMapKeys.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

cloud-fan · 2020-02-14T11:47:58Z

LGTM, can you update PR title as well?

SparkQA · 2020-02-14T11:54:15Z

Test build #118423 has finished for PR 27478 at commit f622180.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-15T12:44:39Z

Test build #118468 has finished for PR 27478 at commit b102c36.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala

cloud-fan · 2020-02-17T06:52:48Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

-                hiveResultString(df.queryExecution.executedPlan).mkString("\n"))
-              val expected = unindentAndTrim(output)
-              assert(actual === expected)
+              withSQLConf(SQLConf.LEGACY_ALLOW_DUPLICATED_MAP_KEY.key -> "true") {


can you check the map-related builtin expressions? seems they use wrong examples that have duplicated keys.

SparkQA · 2020-02-17T07:51:30Z

Test build #118547 has finished for PR 27478 at commit feb8c0a.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-02-17T07:58:03Z

retest this please

SparkQA · 2020-02-17T08:05:02Z

Test build #118537 has finished for PR 27478 at commit 905ae96.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-02-17T08:25:11Z

retest it please

xuanyuanking · 2020-02-17T09:55:03Z

retest this please

SparkQA · 2020-02-17T12:24:09Z

Test build #118567 has finished for PR 27478 at commit feb8c0a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-02-17T12:33:40Z

retest it please

cloud-fan · 2020-02-17T12:33:52Z

retest this please

SparkQA · 2020-02-17T13:44:10Z

Test build #118551 has finished for PR 27478 at commit feb8c0a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-02-17T14:07:15Z

thanks, merging to master/3.0!

…s` and change the default behavior ### What changes were proposed in this pull request? This is a follow-up for #23124, add a new config `spark.sql.legacy.allowDuplicatedMapKeys` to control the behavior of removing duplicated map keys in build-in functions. With the default value `false`, Spark will throw a RuntimeException while duplicated keys are found. ### Why are the changes needed? Prevent silent behavior changes. ### Does this PR introduce any user-facing change? Yes, new config added and the default behavior for duplicated map keys changed to RuntimeException thrown. ### How was this patch tested? Modify existing UT. Closes #27478 from xuanyuanking/SPARK-25892-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ab186e3) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

SparkQA · 2020-02-17T18:50:37Z

Test build #118578 has finished for PR 27478 at commit feb8c0a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2020-02-18T02:11:39Z

Thanks all for the review.

…s` and change the default behavior ### What changes were proposed in this pull request? This is a follow-up for apache#23124, add a new config `spark.sql.legacy.allowDuplicatedMapKeys` to control the behavior of removing duplicated map keys in build-in functions. With the default value `false`, Spark will throw a RuntimeException while duplicated keys are found. ### Why are the changes needed? Prevent silent behavior changes. ### Does this PR introduce any user-facing change? Yes, new config added and the default behavior for duplicated map keys changed to RuntimeException thrown. ### How was this patch tested? Modify existing UT. Closes apache#27478 from xuanyuanking/SPARK-25892-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Add config for the defaul behavior

09cfb67

HyukjinKwon reviewed Feb 7, 2020

View reviewed changes

python/pyspark/sql/functions.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Feb 7, 2020

View reviewed changes

dongjoon-hyun reviewed Feb 7, 2020

View reviewed changes

cloud-fan reviewed Feb 7, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala Outdated Show resolved Hide resolved

viirya reviewed Feb 7, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

comment address, ut fix

fe1fb0f

comment address, ut fix

66dc51f

dongjoon-hyun added the SQL label Feb 10, 2020

rename config

f622180

cloud-fan reviewed Feb 14, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

xuanyuanking changed the title ~~[SPARK-25829][SQL] Add config spark.sql.deduplicateMapKey.lastWinsPolicy.enabled and change the default behavior~~ [SPARK-25829][SQL] Add config spark.sql.legacy.allowDuplicatedMapKeys and change the default behavior Feb 15, 2020

fix

b102c36

cloud-fan reviewed Feb 15, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala Outdated Show resolved Hide resolved

change error message

905ae96

cloud-fan reviewed Feb 17, 2020

View reviewed changes

rephrase the demo for map_concat

feb8c0a

cloud-fan closed this in ab186e3 Feb 17, 2020

xuanyuanking deleted the SPARK-25892-follow branch February 18, 2020 02:11

[SPARK-25829][SQL] Add config spark.sql.legacy.allowDuplicatedMapKeys and change the default behavior #27478

[SPARK-25829][SQL] Add config spark.sql.legacy.allowDuplicatedMapKeys and change the default behavior #27478

Conversation

xuanyuanking commented Feb 6, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Feb 6, 2020

dongjoon-hyun commented Feb 6, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 7, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xuanyuanking Feb 8, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 7, 2020

SparkQA commented Feb 8, 2020

gengliangwang commented Feb 12, 2020 • edited Loading

xuanyuanking commented Feb 14, 2020

cloud-fan commented Feb 14, 2020

SparkQA commented Feb 14, 2020

SparkQA commented Feb 15, 2020

Choose a reason for hiding this comment

SparkQA commented Feb 17, 2020

cloud-fan commented Feb 17, 2020

SparkQA commented Feb 17, 2020

cloud-fan commented Feb 17, 2020

xuanyuanking commented Feb 17, 2020

SparkQA commented Feb 17, 2020

cloud-fan commented Feb 17, 2020

cloud-fan commented Feb 17, 2020

SparkQA commented Feb 17, 2020

cloud-fan commented Feb 17, 2020

SparkQA commented Feb 17, 2020

xuanyuanking commented Feb 18, 2020

[SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior #27478

[SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior #27478

xuanyuanking commented Feb 6, 2020 •

edited

Loading

xuanyuanking Feb 8, 2020 •

edited

Loading

gengliangwang commented Feb 12, 2020 •

edited

Loading