Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-46260][PYTHON][SQL] DataFrame.withColumnsRenamed should respect the dict ordering #44177

Closed

Conversation

zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Dec 5, 2023

What changes were proposed in this pull request?

Make DataFrame.withColumnsRenamed respect the dict ordering

Why are the changes needed?

the ordering in withColumnsRenamed matters

in scala

scala> val df = spark.range(1000)
val df: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> df.withColumnsRenamed(Map("id" -> "a", "a" -> "b"))
val res0: org.apache.spark.sql.DataFrame = [b: bigint]

scala> df.withColumnsRenamed(Map("a" -> "b", "id" -> "a"))
val res1: org.apache.spark.sql.DataFrame = [a: bigint]

However, in py4j the Python dict -> JVM map conversion can not guarantee the ordering

Does this PR introduce any user-facing change?

yes, behavior change

before this PR

In [1]: df = spark.range(10)

In [2]: df.withColumnsRenamed({"id": "a", "a": "b"})
Out[2]: DataFrame[a: bigint]

In [3]: df.withColumnsRenamed({"a": "b", "id": "a"})
Out[3]: DataFrame[a: bigint]

after this PR

In [1]: df = spark.range(10)

In [2]: df.withColumnsRenamed({"id": "a", "a": "b"})
Out[2]: DataFrame[b: bigint]

In [3]: df.withColumnsRenamed({"a": "b", "id": "a"})
Out[3]: DataFrame[a: bigint]

How was this patch tested?

added ut

Was this patch authored or co-authored using generative AI tooling?

no

init

nit
@@ -77,6 +77,11 @@ def test_to_pandas_from_mixed_dataframe(self):
def test_toDF_with_string(self):
super().test_toDF_with_string()

# TODO(SPARK-46260): DataFrame.withColumnsRenamed should respect the dict ordering
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am hitting a weird proto issue, will fix it in a separate pr

@zhengruifeng
Copy link
Contributor Author

cc @HyukjinKwon @cloud-fan

@@ -77,6 +77,11 @@ def test_to_pandas_from_mixed_dataframe(self):
def test_toDF_with_string(self):
super().test_toDF_with_string()

# TODO(SPARK-46261): DataFrame.withColumnsRenamed should respect the dict ordering
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I ask why TODO JIRA has the same title with this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, it was a mistake , i changed it to SPARK-46261

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update this comment too?

- # TODO(SPARK-46261): DataFrame.withColumnsRenamed should respect the dict ordering
+ # TODO(SPARK-46261): Python Client DataFrame.withColumnsRenamed should respect the dict ordering

@@ -2922,18 +2922,29 @@ class Dataset[T] private[sql](
*/
@throws[AnalysisException]
def withColumnsRenamed(colsMap: Map[String, String]): DataFrame = withOrigin {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we touch Dataset.scala, could you include [SQL] into the PR title?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

@zhengruifeng zhengruifeng changed the title [SPARK-46260][PYTHON] DataFrame.withColumnsRenamed should respect the dict ordering [SPARK-46260][PYTHON][SQL] DataFrame.withColumnsRenamed should respect the dict ordering Dec 5, 2023
val resolver = sparkSession.sessionState.analyzer.resolver
val output: Seq[NamedExpression] = queryExecution.analyzed.output

val projectList = colsMap.foldLeft(output) {
val projectList = colNames.zip(newColNames).foldLeft(output) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We believe we need a Scala test case for this change, @zhengruifeng , because the PR description claims that Scala code ordering matters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a new simple test case which fails without this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a python issue.
to make sure the scala side is not changed, I add a scala test

val df = spark.range(10).toDF()
assert(df.withColumnsRenamed(Map("id" -> "a", "a" -> "b")).columns === Array("b"))
assert(df.withColumnsRenamed(Map("a" -> "b", "id" -> "a")).columns === Array("a"))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this addition. Actually, Map has many incompatibility issues across multiple Scala versions. For example, Scala 2.11/2.12/2.13.

For now, since we have only Scala 2.13, I guess the behavior was consistent on master branch. And, this PR will help it more.

Also, cc @LuciferYang , too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, to avoid such incompatibility issues, I think I need to use ListMap instead

@dongjoon-hyun
Copy link
Member

To @zhengruifeng , I believe we can proceed with the AS-IS status if this is only for Apache Spark 4.0.0.

@HyukjinKwon
Copy link
Member

Merged to master.

@zhengruifeng zhengruifeng deleted the sql_withColumnsRenamed_sql branch December 6, 2023 08:22
dongjoon-hyun pushed a commit that referenced this pull request Dec 7, 2023
… dict/map ordering

### What changes were proposed in this pull request?
this is a follow up of #44177

### Why are the changes needed?
according to [this](https://protobuf.dev/programming-guides/proto3/#maps-features):
> Wire format ordering and map iteration ordering of map values are undefined, so you cannot rely on your map items being in a particular order.

we should not use `map` in protobufs when the ordering is sensitive

### Does this PR introduce _any_ user-facing change?
yes, enabled test

### How was this patch tested?
enabled UT

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44231 from zhengruifeng/connect_with_cols_rm.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
dbatomic pushed a commit to dbatomic/spark that referenced this pull request Dec 11, 2023
…ct the dict ordering

### What changes were proposed in this pull request?
Make `DataFrame.withColumnsRenamed` respect the dict ordering

### Why are the changes needed?
the ordering in `withColumnsRenamed` matters

in scala
```
scala> val df = spark.range(1000)
val df: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> df.withColumnsRenamed(Map("id" -> "a", "a" -> "b"))
val res0: org.apache.spark.sql.DataFrame = [b: bigint]

scala> df.withColumnsRenamed(Map("a" -> "b", "id" -> "a"))
val res1: org.apache.spark.sql.DataFrame = [a: bigint]
```

However, in py4j the Python `dict` -> JVM `map` conversion can not guarantee the ordering

### Does this PR introduce _any_ user-facing change?
yes, behavior change

before this PR
```
In [1]: df = spark.range(10)

In [2]: df.withColumnsRenamed({"id": "a", "a": "b"})
Out[2]: DataFrame[a: bigint]

In [3]: df.withColumnsRenamed({"a": "b", "id": "a"})
Out[3]: DataFrame[a: bigint]
```

after this PR
```
In [1]: df = spark.range(10)

In [2]: df.withColumnsRenamed({"id": "a", "a": "b"})
Out[2]: DataFrame[b: bigint]

In [3]: df.withColumnsRenamed({"a": "b", "id": "a"})
Out[3]: DataFrame[a: bigint]
```

### How was this patch tested?
added ut

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#44177 from zhengruifeng/sql_withColumnsRenamed_sql.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
dbatomic pushed a commit to dbatomic/spark that referenced this pull request Dec 11, 2023
… dict/map ordering

### What changes were proposed in this pull request?
this is a follow up of apache#44177

### Why are the changes needed?
according to [this](https://protobuf.dev/programming-guides/proto3/#maps-features):
> Wire format ordering and map iteration ordering of map values are undefined, so you cannot rely on your map items being in a particular order.

we should not use `map` in protobufs when the ordering is sensitive

### Does this PR introduce _any_ user-facing change?
yes, enabled test

### How was this patch tested?
enabled UT

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#44231 from zhengruifeng/connect_with_cols_rm.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants