Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25829][SQL] Add config spark.sql.legacy.allowDuplicatedMapKeys and change the default behavior #27478

Closed
wants to merge 7 commits into from

Conversation

xuanyuanking
Copy link
Member

@xuanyuanking xuanyuanking commented Feb 6, 2020

What changes were proposed in this pull request?

This is a follow-up for #23124, add a new config spark.sql.legacy.allowDuplicatedMapKeys to control the behavior of removing duplicated map keys in build-in functions. With the default value false, Spark will throw a RuntimeException while duplicated keys are found.

Why are the changes needed?

Prevent silent behavior changes.

Does this PR introduce any user-facing change?

Yes, new config added and the default behavior for duplicated map keys changed to RuntimeException thrown.

How was this patch tested?

Modify existing UT.

@SparkQA
Copy link

SparkQA commented Feb 6, 2020

Test build #117998 has finished for PR 27478 at commit 09cfb67.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

@@ -651,8 +651,10 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSparkSession {
Row(null)
)

checkAnswer(df1.selectExpr("map_concat(map1, map2)"), expected1a)
checkAnswer(df1.select(map_concat($"map1", $"map2")), expected1a)
withSQLConf(SQLConf.DEDUPLICATE_MAP_KEY_WITH_LAST_WINS_POLICY.key -> "true") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have one test that checks the exception when this configuration is false?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, added in ArrayBasedMapBuilderSuite.

@SparkQA
Copy link

SparkQA commented Feb 7, 2020

Test build #118005 has finished for PR 27478 at commit 09cfb67.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -49,7 +49,7 @@ license: |

- In Spark version 2.4 and earlier, float/double -0.0 is semantically equal to 0.0, but -0.0 and 0.0 are considered as different values when used in aggregate grouping keys, window partition keys and join keys. Since Spark 3.0, this bug is fixed. For example, `Seq(-0.0, 0.0).toDF("d").groupBy("d").count()` returns `[(0.0, 2)]` in Spark 3.0, and `[(0.0, 1), (-0.0, 1)]` in Spark 2.4 and earlier.

- In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, these built-in functions will remove duplicated map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be undefined.
- In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, new config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` was added, with the default value `false`, Spark will throw RuntimeException while duplicated keys are found. If set to `true`, these built-in functions will remove duplicated map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be undefined.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @cloud-fan , @gatorsmile , @rxin , @marmbrus .
So, RuntimeException by default is better in Apache Spark 3.0.0?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The principle is still no failure by default, but version upgrade is a different case. We should try our best to avoid "silent result changing".

We may want to introduce a new category of configs, which is for migration-only, and should be removed in the next major version.

Thoughts? @srowen @viirya @maropu @bart-samwel

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mention this spark.sql.deduplicateMapKey.lastWinsPolicy.enabled config is only useful for migration and could be removed in future version?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This idea sounds reasonable. If we think a behavior change is correct but for migration reason, we don't want a silent behavior change, a config like this one is an option we can use to notify users such change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the idea to avoid "silent result changing". Btw, we couldn't keep the old (spark 2.4) behaviour for duplicate keys by using a legacy option? If we couldn't do because of some reasons, the proposed one (the runtime exception) looks reasonable to me, too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea sounds fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the idea of highlight this config is migration only.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with the rule, @cloud-fan . At this time, could you add a guideline into our website? Actually, many correctness bug fixes in the migration guide have been silent result changing with a legacy fallback config.

Copy link
Member Author

@xuanyuanking xuanyuanking Feb 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree to have a guideline if we need the new rule, to separate the silent result changing here is caused by the behavior change, not by the bug fixes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I have in mind is:

  • if it's a bug fix (the previous result is definitely wrong), then no config is needed. If the impact is big, we can add a legacy config which is false by default.
  • if it makes the behavior better, we should either add a config and use the old behavior by default, or fail by default and ask users to set config explicitly and pick the desired behavior.

I'm trying to think more cases, will send an email to the dev list soon.

@SparkQA
Copy link

SparkQA commented Feb 7, 2020

Test build #118040 has finished for PR 27478 at commit fe1fb0f.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 8, 2020

Test build #118052 has finished for PR 27478 at commit 66dc51f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member

gengliangwang commented Feb 12, 2020

I google searched "deduplicate map key" and there is no matching result.
Since there will be runtime exception on duplicated keys, how about renaming the config to spark.sql.allowDuplicatedMapKeys.enabled?

@xuanyuanking
Copy link
Member Author

Thanks Gengliang, as the config still related to legacy behavior, I rename it to spark.sql.legacy.allowDuplicatedMapKeys.

@cloud-fan
Copy link
Contributor

LGTM, can you update PR title as well?

@SparkQA
Copy link

SparkQA commented Feb 14, 2020

Test build #118423 has finished for PR 27478 at commit f622180.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xuanyuanking xuanyuanking changed the title [SPARK-25829][SQL] Add config spark.sql.deduplicateMapKey.lastWinsPolicy.enabled and change the default behavior [SPARK-25829][SQL] Add config spark.sql.legacy.allowDuplicatedMapKeys and change the default behavior Feb 15, 2020
@SparkQA
Copy link

SparkQA commented Feb 15, 2020

Test build #118468 has finished for PR 27478 at commit b102c36.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

hiveResultString(df.queryExecution.executedPlan).mkString("\n"))
val expected = unindentAndTrim(output)
assert(actual === expected)
withSQLConf(SQLConf.LEGACY_ALLOW_DUPLICATED_MAP_KEY.key -> "true") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you check the map-related builtin expressions? seems they use wrong examples that have duplicated keys.

@SparkQA
Copy link

SparkQA commented Feb 17, 2020

Test build #118547 has finished for PR 27478 at commit feb8c0a.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Feb 17, 2020

Test build #118537 has finished for PR 27478 at commit 905ae96.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest it please

@xuanyuanking
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 17, 2020

Test build #118567 has finished for PR 27478 at commit feb8c0a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest it please

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Feb 17, 2020

Test build #118551 has finished for PR 27478 at commit feb8c0a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.0!

@cloud-fan cloud-fan closed this in ab186e3 Feb 17, 2020
cloud-fan pushed a commit that referenced this pull request Feb 17, 2020
…s` and change the default behavior

### What changes were proposed in this pull request?
This is a follow-up for #23124, add a new config `spark.sql.legacy.allowDuplicatedMapKeys` to control the behavior of removing duplicated map keys in build-in functions. With the default value `false`, Spark will throw a RuntimeException while duplicated keys are found.

### Why are the changes needed?
Prevent silent behavior changes.

### Does this PR introduce any user-facing change?
Yes, new config added and the default behavior for duplicated map keys changed to RuntimeException thrown.

### How was this patch tested?
Modify existing UT.

Closes #27478 from xuanyuanking/SPARK-25892-follow.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit ab186e3)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@SparkQA
Copy link

SparkQA commented Feb 17, 2020

Test build #118578 has finished for PR 27478 at commit feb8c0a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xuanyuanking
Copy link
Member Author

Thanks all for the review.

@xuanyuanking xuanyuanking deleted the SPARK-25892-follow branch February 18, 2020 02:11
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
…s` and change the default behavior

### What changes were proposed in this pull request?
This is a follow-up for apache#23124, add a new config `spark.sql.legacy.allowDuplicatedMapKeys` to control the behavior of removing duplicated map keys in build-in functions. With the default value `false`, Spark will throw a RuntimeException while duplicated keys are found.

### Why are the changes needed?
Prevent silent behavior changes.

### Does this PR introduce any user-facing change?
Yes, new config added and the default behavior for duplicated map keys changed to RuntimeException thrown.

### How was this patch tested?
Modify existing UT.

Closes apache#27478 from xuanyuanking/SPARK-25892-follow.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants