[SPARK-25829][SQL] remove duplicated map keys with last wins policy #23124

cloud-fan · 2018-11-23T05:33:45Z

What changes were proposed in this pull request?

Currently duplicated map keys are not handled consistently. For example, map look up respects the duplicated key appears first, Dataset.collect only keeps the duplicated key appears last, MapKeys returns duplicated keys, etc.

This PR proposes to remove duplicated map keys with last wins policy, to follow Java/Scala and Presto. It only applies to built-in functions, as users can create map with duplicated map keys via private APIs anyway.

updated functions: CreateMap, MapFromArrays, MapFromEntries, StringToMap, MapConcat, TransformKeys.

For other places:

data source v1 doesn't have this problem, as users need to provide a java/scala map, which can't have duplicated keys.
data source v2 may have this problem. I've added a note to ArrayBasedMapData to ask the caller to take care of duplicated keys. In the future we should enforce it in the stable data APIs for data source v2.
UDF doesn't have this problem, as users need to provide a java/scala map. Same as data source v1.
file format. I checked all of them and only parquet does not enforce it. For backward compatibility reasons I change nothing but leave a note saying that the behavior will be undefined if users write map with duplicated keys to parquet files. Maybe we can add a config and fail by default if parquet files have map with duplicated keys. This can be done in followup.

How was this patch tested?

updated tests and new tests

cloud-fan · 2018-11-23T05:35:24Z

cc @dongjoon-hyun @gatorsmile @viirya @kiszk @mgaido91

cloud-fan · 2018-11-23T05:38:02Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/InternalRow.scala

-    case _: MapType => (input, ordinal) => input.getMap(ordinal)
-    case u: UserDefinedType[_] => getAccessor(u.sqlType)
-    case _ => (input, ordinal) => input.get(ordinal, dataType)
+  def getAccessor(dt: DataType, nullable: Boolean = true): (SpecializedGetters, Int) => Any = {


I can move it to a new PR if others think it's necessary. It's a little dangerous to ask the caller side to take care of null values.

cloud-fan · 2018-11-23T05:39:59Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

-      val isKeyPrimitive = CodeGenerator.isPrimitiveType(dataType.keyType)
-      val isValuePrimitive = CodeGenerator.isPrimitiveType(dataType.valueType)
-      val code = if (isKeyPrimitive && isValuePrimitive) {
-        genCodeForPrimitiveElements(ctx, c, ev.value, numEntries)


It's unclear how we can keep this optimization if we need to remove duplicated keys. Personally I don't think it's worth the effort to keep such a complex optimization for non-critial code path.

This change allows us to focus on optimizing ArrayBasedMapBuilder in another PR.

dongjoon-hyun · 2018-11-23T06:02:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala

+ * duplicated map keys w.r.t. the last wins policy.
+ */
+class ArrayBasedMapBuilder(keyType: DataType, valueType: DataType) extends Serializable {
+  assert(!keyType.existsRecursively(_.isInstanceOf[MapType]), "key of map cannot be/contain map")


Shall we add assert to prevent NullType here, too?

dongjoon-hyun · 2018-11-23T06:06:07Z

docs/sql-migration-guide-upgrade.md

@@ -19,6 +19,8 @@ displayTitle: Spark SQL Upgrading Guide

  - In Spark version 2.4 and earlier, users can create map values with map type key via built-in function like `CreateMap`, `MapFromArrays`, etc. Since Spark 3.0, it's not allowed to create map values with map type key with these built-in functions. Users can still read map values with map type key from data source or Java/Scala collections, though they are not very useful.

+  - In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, these built-in functions will remove duplicated map keys with last wins policy.


Can we merge this with the above sentence at line 20? Both are different, but are related very strongly. In fact, it's a change of Map semantics.

They are related, but they are not the same. For example, we don't support map type as key, because we can't check equality of map type correctly. This is just a current implementation limitation, and we may relax it in the future.

Duplicated map keys is a real problem and we will never allow it.

dongjoon-hyun · 2018-11-23T06:07:55Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala

-    override def end(): Unit =
+    override def end(): Unit = {
+      // The parquet map may contains null or duplicated map keys. When it happens, the behavior is
+      // undefined.


What about creating a Spark JIRA issue for this and embedded that ID here?

dongjoon-hyun · 2018-11-23T06:18:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala

+    case _ =>
+      // for complex types, use interpreted ordering to be able to compare unsafe data with safe
+      // data, e.g. UnsafeRow vs GenericInternalRow.
+      mutable.TreeMap.empty[Any, Int](TypeUtils.getInterpretedOrdering(keyType))


scala> sql("select map(null,2)") res1: org.apache.spark.sql.DataFrame = [map(NULL, 2): map<null,int>] scala> sql("select map(null,2)").collect scala.MatchError: NullType (of class org.apache.spark.sql.types.NullType$) at org.apache.spark.sql.catalyst.util.TypeUtils$.getInterpretedOrdering(TypeUtils.scala:67) at org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder.keyToIndex$lzycompute(ArrayBasedMapBuilder.scala:37)

I think we should fail it at analyzer phase, and other map-producing functions should do it as well. Can you create a JIRA for it? thanks!

After merging this PR, I'll check again and file a JIRA for that.

SparkQA · 2018-11-23T08:05:02Z

Test build #99209 has finished for PR 23124 at commit cbcd5d7.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ArrayBasedMapBuilder(keyType: DataType, valueType: DataType) extends Serializable

mgaido91 · 2018-11-23T08:09:34Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

  override def eval(input: InternalRow): Any = {
-    val maps = children.map(_.eval(input))
+    val maps = children.map(_.eval(input).asInstanceOf[MapData]).toArray


why do we need toArray here?

I need to access it by index below, turn it to array so that the access is guaranteed to be O(1).

well, my understanding is that we could do a maps.foreach instead of accessing them by index. I don't see the index to be significant at all, but maybe I am missing something...

in scala, while loop is faster than foreach. If you look at Expression.eval implementations, we use while loop a lot even foreach can produce simpler code.

BTW, if it's not true anymore with scala 2.12, we should update them together with a bechmark, instead of only updating this single one.

Yes, but converting toArray may require an extra O(N) operation for the copy, so I am not sure the difference between while and foreach is significant enough to cover the overhead of the copy...

mgaido91 · 2018-11-23T08:15:19Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

        |}
+        |if ($numElementsName > ${ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH}) {


this check is not really correct, as we are not considering duplicates IIUC. I think we can change this behavior using putAll and checking the size in the loop.

This check is done before the putAll, so that it can fail fast. I think it's fine to ignore duplicated keys here, to make it a more conservative.

yes, but we could do the putAll before and eventually fail when we reach the limit. We can maybe do that in a followup, though, as it is not introducing any regression..

yup. I actually did what you proposed at first, and then realized it's different from before and may introduce perf regression. We can investigate it in a followup.

I see, I agree doing it in a followup, thanks.

mgaido91 · 2018-11-23T08:21:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala

+
+  // write a 2-field row, the first field is key and the second field is value.
+  def put(entry: InternalRow): Unit = {
+    if (entry.isNullAt(0)) {


this is checked only here and not in all the other put...I think we should be consistent and either check it always or never do it..

There are 2 put methods have this null check and other put methods all go through them.

Oh I see now, I missed it, thanks.

viirya · 2018-11-23T08:23:52Z

retest this please.

viirya · 2018-11-23T08:41:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala

+  }
+
+  def from(keyArray: ArrayData, valueArray: ArrayData): ArrayBasedMapData = {
+    assert(keyToIndex.isEmpty, "'from' can only be called with a fresh GenericMapBuilder.")


ArrayBasedMapBuilder instead of GenericMapBuilder?

viirya · 2018-11-23T08:48:29Z

docs/sql-migration-guide-upgrade.md

@@ -19,6 +19,8 @@ displayTitle: Spark SQL Upgrading Guide

  - In Spark version 2.4 and earlier, users can create map values with map type key via built-in function like `CreateMap`, `MapFromArrays`, etc. Since Spark 3.0, it's not allowed to create map values with map type key with these built-in functions. Users can still read map values with map type key from data source or Java/Scala collections, though they are not very useful.

+  - In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, these built-in functions will remove duplicated map keys with last wins policy.


Similar as above, shall we also mention data source can have be read with duplicated map keys?

viirya · 2018-11-23T09:19:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala

+    put(keyGetter(entry, 0), valueGetter(entry, 1))
+  }
+
+  def putAll(keyArray: Array[Any], valueArray: Array[Any]): Unit = {


Has this method been used? Looks like only another putAll below is used.

ah good catch!

viirya · 2018-11-23T09:24:13Z

If not too verbose, we can update the ExpressionDescription of the built-in expressions to declare the last wins policy.

SparkQA · 2018-11-23T10:15:18Z

Test build #99213 has finished for PR 23124 at commit cbcd5d7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ArrayBasedMapBuilder(keyType: DataType, valueType: DataType) extends Serializable

kiszk · 2018-11-23T13:53:55Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala

+
+  def from(keyArray: ArrayData, valueArray: ArrayData): ArrayBasedMapData = {
+    assert(keyToIndex.isEmpty, "'from' can only be called with a fresh GenericMapBuilder.")
+    putAll(keyArray, valueArray)


Can we call new ArrayBasedMapData(keyArray, valueArray) without calling putAll(keyArray, valueArray) if keyArray.asInstanceOf[ArrayData].containsNull is false? This is a kind of optimizations.

no we can't, as we still need to detect duplicated keys.

Ah, you are right.

kiszk · 2018-11-23T14:07:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala

+    if (keyToIndex.size == keyArray.numElements()) {
+      // If there is no duplicated map keys, creates the MapData with the input key and value array,
+      // as they might already in unsafe format and are more efficient.
+      new ArrayBasedMapData(keyArray, valueArray)


ditto in build

kiszk · 2018-11-23T14:16:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala

+  }
+
+  def build(): ArrayBasedMapData = {
+    new ArrayBasedMapData(new GenericArrayData(keys.toArray), new GenericArrayData(values.toArray))


Is it better to call reset() after calling new ArrayBasedMapData to reduce memory consumption in Java heap?

At caller side, ArrayBasedMapBuilder is not released. Therefore, until reset() will be called next time, each ArrayBasedMapBuilder keeps unused data in keys, values, and keyToIndex. They consumes Java heap unexpectedly.

SparkQA · 2018-11-26T17:58:32Z

Test build #99276 has finished for PR 23124 at commit b2bfb33.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-11-26T18:35:17Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala

@@ -89,7 +89,7 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext {
    val msg1 = intercept[Exception] {
      df5.select(map_from_arrays($"k", $"v")).collect
    }.getMessage
-    assert(msg1.contains("Cannot use null as map key!"))
+    assert(msg1.contains("Cannot use null as map key"))


Message at Line 98 is also changed now.

SparkQA · 2018-11-27T07:38:29Z

Test build #99304 has finished for PR 23124 at commit 0c77c41.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-27T08:05:01Z

Test build #99310 has finished for PR 23124 at commit abd0ec5.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-11-27T08:23:09Z

retest this please

SparkQA · 2018-11-27T11:25:39Z

Test build #99312 has finished for PR 23124 at commit abd0ec5.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-27T15:15:45Z

Test build #99325 has finished for PR 23124 at commit b7073b2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

bersprockets · 2018-11-28T06:25:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala

+  assert(keyType != NullType, "map key cannot be null type.")
+
+  private lazy val keyToIndex = keyType match {
+    case _: AtomicType | _: CalendarIntervalType => mutable.HashMap.empty[Any, Int]


FYI: I had a test lying around from when I worked on map_concat. With this PR:

map_concat of two small maps (20 string keys per map, no dups) for 2M rows is about 17% slower.

map_concat of two big maps (500 string keys per map, no dups) for 1M rows is about 25% slower.

The baseline code is the same branch as the PR, but without the 4 commits.

Some cost makes sense, as we're checking for dups, but it's odd that the overhead grows disproportionately as the size of the maps grows.

I remember that at one time, mutable.HashMap had some performance issues (rumor has it, anyway). So as a test, I modified ArrayBasedMapBuilder.scala to use java.util.Hashmap instead. After that:

map_concat of two small maps (20 string keys per map, no dups) for 2M rows is about 12% slower.

map_concat of two big maps (500 string keys per map, no dups) for 1M rows is about 15% slower.

It's a little more proportionate. I don't know if switching HashMap implementations would have some negative consequences.

Also, my test is a dumb benchmark that uses System.currentTimeMillis concatenating simple [String,Integer] maps.

I think for performance critical code path we should prefer java collection. thanks for pointing it out!

ueshin · 2018-11-28T06:43:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala

+  assert(keyType != NullType, "map key cannot be null type.")
+
+  private lazy val keyToIndex = keyType match {
+    case _: AtomicType | _: CalendarIntervalType => mutable.HashMap.empty[Any, Int]


We need to exclude BinaryType from AtomicType here.

ueshin · 2018-11-28T06:57:34Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilderSuite.scala

+    assert(map.numElements() == 2)
+    assert(ArrayBasedMapData.toScalaMap(map) ==
+      Map(new GenericArrayData(Seq(1, 1)) -> 3, new GenericArrayData(Seq(2, 2)) -> 2))
+  }


We should have a binary type key test as well?

SparkQA · 2018-11-28T08:05:01Z

Test build #99357 has finished for PR 23124 at commit 6dff654.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-11-28T08:13:38Z

retest this please

ueshin

LGTM.

ueshin · 2018-11-28T08:01:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapData.scala

+ * A simple `MapData` implementation which is backed by 2 arrays.
+ *
+ * Note that, user is responsible to guarantee that the key array does not have duplicated
+ * elements, otherwise the behavior is undefined.


nit: we might need to add the same note to the 3rd and 4th ArrayBasedMapData.apply() method.

mgaido91 · 2018-11-28T09:24:56Z

LGTM too

SparkQA · 2018-11-28T12:26:27Z

Test build #99362 has finished for PR 23124 at commit 6dff654.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-28T15:21:50Z

Test build #99368 has finished for PR 23124 at commit 72c771e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-11-28T15:43:04Z

thanks, merging to master!

dongjoon-hyun · 2018-11-28T18:29:24Z

Thank you so much, @cloud-fan !

gatorsmile · 2018-12-10T07:33:28Z

docs/sql-migration-guide-upgrade.md

@@ -27,6 +27,8 @@ displayTitle: Spark SQL Upgrading Guide

  - In Spark version 2.4 and earlier, float/double -0.0 is semantically equal to 0.0, but users can still distinguish them via `Dataset.show`, `Dataset.collect` etc. Since Spark 3.0, float/double -0.0 is replaced by 0.0 internally, and users can't distinguish them any more.

+  - In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, these built-in functions will remove duplicated map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be udefined.


A few typos.

In Spark version 2.4 and earlier, users can create a map with duplicate keys via built-in functions like CreateMap and StringToMap. The behavior of map with duplicate keys is undefined. For example, the map lookup respects the duplicate key that appears first, Dataset.collect only keeps the duplicate key that appears last, and MapKeys returns duplicate keys. Since Spark 3.0, these built-in functions will remove duplicate map keys using the last-one-wins policy. Users may still read map values with duplicate keys from the data sources that do not enforce it (e.g. Parquet), but the behavior will be undefined.

## What changes were proposed in this pull request? Currently duplicated map keys are not handled consistently. For example, map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. This PR proposes to remove duplicated map keys with last wins policy, to follow Java/Scala and Presto. It only applies to built-in functions, as users can create map with duplicated map keys via private APIs anyway. updated functions: `CreateMap`, `MapFromArrays`, `MapFromEntries`, `StringToMap`, `MapConcat`, `TransformKeys`. For other places: 1. data source v1 doesn't have this problem, as users need to provide a java/scala map, which can't have duplicated keys. 2. data source v2 may have this problem. I've added a note to `ArrayBasedMapData` to ask the caller to take care of duplicated keys. In the future we should enforce it in the stable data APIs for data source v2. 3. UDF doesn't have this problem, as users need to provide a java/scala map. Same as data source v1. 4. file format. I checked all of them and only parquet does not enforce it. For backward compatibility reasons I change nothing but leave a note saying that the behavior will be undefined if users write map with duplicated keys to parquet files. Maybe we can add a config and fail by default if parquet files have map with duplicated keys. This can be done in followup. ## How was this patch tested? updated tests and new tests Closes apache#23124 from cloud-fan/map. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…perly the limit size ## What changes were proposed in this pull request? The PR starts from the [comment](apache#23124 (comment)) in the main one and it aims at: - simplifying the code for `MapConcat`; - be more precise in checking the limit size. ## How was this patch tested? existing tests Closes apache#23217 from mgaido91/SPARK-25829_followup. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…s` and change the default behavior ### What changes were proposed in this pull request? This is a follow-up for #23124, add a new config `spark.sql.legacy.allowDuplicatedMapKeys` to control the behavior of removing duplicated map keys in build-in functions. With the default value `false`, Spark will throw a RuntimeException while duplicated keys are found. ### Why are the changes needed? Prevent silent behavior changes. ### Does this PR introduce any user-facing change? Yes, new config added and the default behavior for duplicated map keys changed to RuntimeException thrown. ### How was this patch tested? Modify existing UT. Closes #27478 from xuanyuanking/SPARK-25892-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…s` and change the default behavior ### What changes were proposed in this pull request? This is a follow-up for #23124, add a new config `spark.sql.legacy.allowDuplicatedMapKeys` to control the behavior of removing duplicated map keys in build-in functions. With the default value `false`, Spark will throw a RuntimeException while duplicated keys are found. ### Why are the changes needed? Prevent silent behavior changes. ### Does this PR introduce any user-facing change? Yes, new config added and the default behavior for duplicated map keys changed to RuntimeException thrown. ### How was this patch tested? Modify existing UT. Closes #27478 from xuanyuanking/SPARK-25892-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ab186e3) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…s` and change the default behavior ### What changes were proposed in this pull request? This is a follow-up for apache#23124, add a new config `spark.sql.legacy.allowDuplicatedMapKeys` to control the behavior of removing duplicated map keys in build-in functions. With the default value `false`, Spark will throw a RuntimeException while duplicated keys are found. ### Why are the changes needed? Prevent silent behavior changes. ### Does this PR introduce any user-facing change? Yes, new config added and the default behavior for duplicated map keys changed to RuntimeException thrown. ### How was this patch tested? Modify existing UT. Closes apache#27478 from xuanyuanking/SPARK-25892-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan commented Nov 23, 2018

View reviewed changes

dongjoon-hyun reviewed Nov 23, 2018

View reviewed changes

mgaido91 reviewed Nov 23, 2018

View reviewed changes

viirya reviewed Nov 23, 2018

View reviewed changes

kiszk reviewed Nov 23, 2018

View reviewed changes

cloud-fan mentioned this pull request Nov 26, 2018

[SPARK-26021][SQL][followup] add test for special floating point values #23141

Closed

cloud-fan added 2 commits November 26, 2018 22:34

remove duplicated map keys with last wins policy

9df6274

address comments

b2bfb33

cloud-fan force-pushed the map branch from cbcd5d7 to b2bfb33 Compare November 26, 2018 15:06

kiszk reviewed Nov 26, 2018

View reviewed changes

fix test

0c77c41

fix pyspark test

b7073b2

cloud-fan force-pushed the map branch from abd0ec5 to b7073b2 Compare November 27, 2018 11:43

bersprockets reviewed Nov 28, 2018

View reviewed changes

ueshin reviewed Nov 28, 2018

View reviewed changes

address comments

6dff654

ueshin reviewed Nov 28, 2018

View reviewed changes

cloud-fan added 2 commits November 28, 2018 19:23

Merge branch 'master' into map

2b4c03b

add comment

72c771e

asfgit closed this in fa0d4bf Nov 28, 2018

mgaido91 mentioned this pull request Dec 4, 2018

[SPARK-25829][SQL][FOLLOWUP] Refactor MapConcat in order to check properly the limit size #23217

Closed

gatorsmile reviewed Dec 10, 2018

View reviewed changes

asiunov mentioned this pull request Jan 9, 2019

[MINOR] Follow up for SPARK-23936, fixed description of "map_concat" function #23493

Closed

xuanyuanking mentioned this pull request Feb 6, 2020

[SPARK-25829][SQL] Add config spark.sql.legacy.allowDuplicatedMapKeys and change the default behavior #27478

Closed

iRakson mentioned this pull request Feb 12, 2020

[SPARK-30790][SQL] The dataType of map() should be map<null,null> #27542

Closed

		@@ -19,6 +19,8 @@ displayTitle: Spark SQL Upgrading Guide

		- In Spark version 2.4 and earlier, users can create map values with map type key via built-in function like `CreateMap`, `MapFromArrays`, etc. Since Spark 3.0, it's not allowed to create map values with map type key with these built-in functions. Users can still read map values with map type key from data source or Java/Scala collections, though they are not very useful.

		- In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, these built-in functions will remove duplicated map keys with last wins policy.

		\|}
		\|if ($numElementsName > ${ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH}) {

		@@ -27,6 +27,8 @@ displayTitle: Spark SQL Upgrading Guide

		- In Spark version 2.4 and earlier, float/double -0.0 is semantically equal to 0.0, but users can still distinguish them via `Dataset.show`, `Dataset.collect` etc. Since Spark 3.0, float/double -0.0 is replaced by 0.0 internally, and users can't distinguish them any more.

		- In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, these built-in functions will remove duplicated map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be udefined.

[SPARK-25829][SQL] remove duplicated map keys with last wins policy #23124

[SPARK-25829][SQL] remove duplicated map keys with last wins policy #23124

Conversation

cloud-fan commented Nov 23, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Nov 23, 2018

Choose a reason for hiding this comment

cloud-fan Nov 23, 2018 • edited Loading

Choose a reason for hiding this comment

kiszk Nov 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Nov 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Nov 23, 2018

viirya Nov 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Nov 23, 2018

SparkQA commented Nov 23, 2018

kiszk Nov 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 26, 2018

Choose a reason for hiding this comment

SparkQA commented Nov 27, 2018

SparkQA commented Nov 27, 2018

cloud-fan commented Nov 27, 2018

SparkQA commented Nov 27, 2018

SparkQA commented Nov 27, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ueshin Nov 28, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 28, 2018

kiszk commented Nov 28, 2018

ueshin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgaido91 commented Nov 28, 2018

SparkQA commented Nov 28, 2018

SparkQA commented Nov 28, 2018

cloud-fan commented Nov 28, 2018

dongjoon-hyun commented Nov 28, 2018

gatorsmile Dec 10, 2018 • edited Loading

Choose a reason for hiding this comment

cloud-fan commented Nov 23, 2018 •

edited

Loading

cloud-fan Nov 23, 2018 •

edited

Loading

kiszk Nov 23, 2018 •

edited

Loading

dongjoon-hyun Nov 23, 2018 •

edited

Loading

viirya Nov 23, 2018 •

edited

Loading

kiszk Nov 23, 2018 •

edited

Loading

ueshin Nov 28, 2018 •

edited

Loading

gatorsmile Dec 10, 2018 •

edited

Loading