[SPARK-46260][PYTHON][SQL] `DataFrame.withColumnsRenamed` should respect the dict ordering #44177

zhengruifeng · 2023-12-05T04:58:20Z

What changes were proposed in this pull request?

Make DataFrame.withColumnsRenamed respect the dict ordering

Why are the changes needed?

the ordering in withColumnsRenamed matters

in scala

scala> val df = spark.range(1000)
val df: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> df.withColumnsRenamed(Map("id" -> "a", "a" -> "b"))
val res0: org.apache.spark.sql.DataFrame = [b: bigint]

scala> df.withColumnsRenamed(Map("a" -> "b", "id" -> "a"))
val res1: org.apache.spark.sql.DataFrame = [a: bigint]

However, in py4j the Python dict -> JVM map conversion can not guarantee the ordering

Does this PR introduce any user-facing change?

yes, behavior change

before this PR

In [1]: df = spark.range(10)

In [2]: df.withColumnsRenamed({"id": "a", "a": "b"})
Out[2]: DataFrame[a: bigint]

In [3]: df.withColumnsRenamed({"a": "b", "id": "a"})
Out[3]: DataFrame[a: bigint]

after this PR

In [1]: df = spark.range(10)

In [2]: df.withColumnsRenamed({"id": "a", "a": "b"})
Out[2]: DataFrame[b: bigint]

In [3]: df.withColumnsRenamed({"a": "b", "id": "a"})
Out[3]: DataFrame[a: bigint]

How was this patch tested?

added ut

Was this patch authored or co-authored using generative AI tooling?

no

init nit

zhengruifeng · 2023-12-05T05:00:32Z

python/pyspark/sql/tests/connect/test_parity_dataframe.py

@@ -77,6 +77,11 @@ def test_to_pandas_from_mixed_dataframe(self):
    def test_toDF_with_string(self):
        super().test_toDF_with_string()

+    # TODO(SPARK-46260): DataFrame.withColumnsRenamed should respect the dict ordering


I am hitting a weird proto issue, will fix it in a separate pr

zhengruifeng · 2023-12-05T05:02:36Z

cc @HyukjinKwon @cloud-fan

dongjoon-hyun · 2023-12-05T05:08:53Z

python/pyspark/sql/tests/connect/test_parity_dataframe.py

@@ -77,6 +77,11 @@ def test_to_pandas_from_mixed_dataframe(self):
    def test_toDF_with_string(self):
        super().test_toDF_with_string()

+    # TODO(SPARK-46261): DataFrame.withColumnsRenamed should respect the dict ordering


May I ask why TODO JIRA has the same title with this PR?

oh, it was a mistake , i changed it to SPARK-46261

Could you update this comment too?

- # TODO(SPARK-46261): DataFrame.withColumnsRenamed should respect the dict ordering + # TODO(SPARK-46261): Python Client DataFrame.withColumnsRenamed should respect the dict ordering

dongjoon-hyun · 2023-12-05T05:11:00Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

@@ -2922,18 +2922,29 @@ class Dataset[T] private[sql](
   */
  @throws[AnalysisException]
  def withColumnsRenamed(colsMap: Map[String, String]): DataFrame = withOrigin {


Since we touch Dataset.scala, could you include [SQL] into the PR title?

sounds good

dongjoon-hyun · 2023-12-05T05:14:21Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

    val resolver = sparkSession.sessionState.analyzer.resolver
    val output: Seq[NamedExpression] = queryExecution.analyzed.output

-    val projectList = colsMap.foldLeft(output) {
+    val projectList = colNames.zip(newColNames).foldLeft(output) {


We believe we need a Scala test case for this change, @zhengruifeng , because the PR description claims that Scala code ordering matters.

Could you add a new simple test case which fails without this change?

this is a python issue.
to make sure the scala side is not changed, I add a scala test

dongjoon-hyun · 2023-12-05T05:31:30Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

+    val df = spark.range(10).toDF()
+    assert(df.withColumnsRenamed(Map("id" -> "a", "a" -> "b")).columns === Array("b"))
+    assert(df.withColumnsRenamed(Map("a" -> "b", "id" -> "a")).columns === Array("a"))
+  }


Thank you for this addition. Actually, Map has many incompatibility issues across multiple Scala versions. For example, Scala 2.11/2.12/2.13.

For now, since we have only Scala 2.13, I guess the behavior was consistent on master branch. And, this PR will help it more.

Also, cc @LuciferYang , too.

hmm, to avoid such incompatibility issues, I think I need to use ListMap instead

dongjoon-hyun · 2023-12-05T08:10:55Z

To @zhengruifeng , I believe we can proceed with the AS-IS status if this is only for Apache Spark 4.0.0.

HyukjinKwon · 2023-12-06T08:15:57Z

Merged to master.

… dict/map ordering ### What changes were proposed in this pull request? this is a follow up of #44177 ### Why are the changes needed? according to [this](https://protobuf.dev/programming-guides/proto3/#maps-features): > Wire format ordering and map iteration ordering of map values are undefined, so you cannot rely on your map items being in a particular order. we should not use `map` in protobufs when the ordering is sensitive ### Does this PR introduce _any_ user-facing change? yes, enabled test ### How was this patch tested? enabled UT ### Was this patch authored or co-authored using generative AI tooling? no Closes #44231 from zhengruifeng/connect_with_cols_rm. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…ct the dict ordering ### What changes were proposed in this pull request? Make `DataFrame.withColumnsRenamed` respect the dict ordering ### Why are the changes needed? the ordering in `withColumnsRenamed` matters in scala ``` scala> val df = spark.range(1000) val df: org.apache.spark.sql.Dataset[Long] = [id: bigint] scala> df.withColumnsRenamed(Map("id" -> "a", "a" -> "b")) val res0: org.apache.spark.sql.DataFrame = [b: bigint] scala> df.withColumnsRenamed(Map("a" -> "b", "id" -> "a")) val res1: org.apache.spark.sql.DataFrame = [a: bigint] ``` However, in py4j the Python `dict` -> JVM `map` conversion can not guarantee the ordering ### Does this PR introduce _any_ user-facing change? yes, behavior change before this PR ``` In [1]: df = spark.range(10) In [2]: df.withColumnsRenamed({"id": "a", "a": "b"}) Out[2]: DataFrame[a: bigint] In [3]: df.withColumnsRenamed({"a": "b", "id": "a"}) Out[3]: DataFrame[a: bigint] ``` after this PR ``` In [1]: df = spark.range(10) In [2]: df.withColumnsRenamed({"id": "a", "a": "b"}) Out[2]: DataFrame[b: bigint] In [3]: df.withColumnsRenamed({"a": "b", "id": "a"}) Out[3]: DataFrame[a: bigint] ``` ### How was this patch tested? added ut ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#44177 from zhengruifeng/sql_withColumnsRenamed_sql. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

… dict/map ordering ### What changes were proposed in this pull request? this is a follow up of apache#44177 ### Why are the changes needed? according to [this](https://protobuf.dev/programming-guides/proto3/#maps-features): > Wire format ordering and map iteration ordering of map values are undefined, so you cannot rely on your map items being in a particular order. we should not use `map` in protobufs when the ordering is sensitive ### Does this PR introduce _any_ user-facing change? yes, enabled test ### How was this patch tested? enabled UT ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#44231 from zhengruifeng/connect_with_cols_rm. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

github-actions bot added SQL PYTHON CONNECT labels Dec 5, 2023

init

55fd703

init nit

zhengruifeng force-pushed the sql_withColumnsRenamed_sql branch from 156498b to 55fd703 Compare December 5, 2023 04:59

zhengruifeng commented Dec 5, 2023

View reviewed changes

nit

f9658d3

HyukjinKwon approved these changes Dec 5, 2023

View reviewed changes

dongjoon-hyun reviewed Dec 5, 2023

View reviewed changes

zhengruifeng changed the title ~~[SPARK-46260][PYTHON] DataFrame.withColumnsRenamed should respect the dict ordering~~ [SPARK-46260][PYTHON][SQL] DataFrame.withColumnsRenamed should respect the dict ordering Dec 5, 2023

dongjoon-hyun reviewed Dec 5, 2023

View reviewed changes

scala test

e8f6594

dongjoon-hyun reviewed Dec 5, 2023

View reviewed changes

listmap

a9c5060

dongjoon-hyun approved these changes Dec 5, 2023

View reviewed changes

HyukjinKwon closed this in 032e782 Dec 6, 2023

zhengruifeng deleted the sql_withColumnsRenamed_sql branch December 6, 2023 08:22

zhengruifeng mentioned this pull request Dec 7, 2023

[SPARK-46261][CONNECT] DataFrame.withColumnsRenamed should keep the dict/map ordering #44231

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-46260][PYTHON][SQL] `DataFrame.withColumnsRenamed` should respect the dict ordering #44177

[SPARK-46260][PYTHON][SQL] `DataFrame.withColumnsRenamed` should respect the dict ordering #44177

zhengruifeng commented Dec 5, 2023 •

edited

Loading

zhengruifeng Dec 5, 2023

zhengruifeng commented Dec 5, 2023

dongjoon-hyun Dec 5, 2023

zhengruifeng Dec 5, 2023

dongjoon-hyun Dec 5, 2023

dongjoon-hyun Dec 5, 2023

zhengruifeng Dec 5, 2023

dongjoon-hyun Dec 5, 2023

dongjoon-hyun Dec 5, 2023

zhengruifeng Dec 5, 2023

dongjoon-hyun Dec 5, 2023

zhengruifeng Dec 5, 2023

dongjoon-hyun commented Dec 5, 2023

HyukjinKwon commented Dec 6, 2023

[SPARK-46260][PYTHON][SQL] DataFrame.withColumnsRenamed should respect the dict ordering #44177

[SPARK-46260][PYTHON][SQL] DataFrame.withColumnsRenamed should respect the dict ordering #44177

Conversation

zhengruifeng commented Dec 5, 2023 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Choose a reason for hiding this comment

zhengruifeng commented Dec 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Dec 5, 2023

HyukjinKwon commented Dec 6, 2023

[SPARK-46260][PYTHON][SQL] `DataFrame.withColumnsRenamed` should respect the dict ordering #44177

[SPARK-46260][PYTHON][SQL] `DataFrame.withColumnsRenamed` should respect the dict ordering #44177

zhengruifeng commented Dec 5, 2023 •

edited

Loading