-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-46260][PYTHON][SQL] DataFrame.withColumnsRenamed
should respect the dict ordering
#44177
[SPARK-46260][PYTHON][SQL] DataFrame.withColumnsRenamed
should respect the dict ordering
#44177
Conversation
156498b
to
55fd703
Compare
@@ -77,6 +77,11 @@ def test_to_pandas_from_mixed_dataframe(self): | |||
def test_toDF_with_string(self): | |||
super().test_toDF_with_string() | |||
|
|||
# TODO(SPARK-46260): DataFrame.withColumnsRenamed should respect the dict ordering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am hitting a weird proto issue, will fix it in a separate pr
@@ -77,6 +77,11 @@ def test_to_pandas_from_mixed_dataframe(self): | |||
def test_toDF_with_string(self): | |||
super().test_toDF_with_string() | |||
|
|||
# TODO(SPARK-46261): DataFrame.withColumnsRenamed should respect the dict ordering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I ask why TODO
JIRA has the same title with this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, it was a mistake , i changed it to SPARK-46261
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update this comment too?
- # TODO(SPARK-46261): DataFrame.withColumnsRenamed should respect the dict ordering
+ # TODO(SPARK-46261): Python Client DataFrame.withColumnsRenamed should respect the dict ordering
@@ -2922,18 +2922,29 @@ class Dataset[T] private[sql]( | |||
*/ | |||
@throws[AnalysisException] | |||
def withColumnsRenamed(colsMap: Map[String, String]): DataFrame = withOrigin { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we touch Dataset.scala
, could you include [SQL]
into the PR title?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good
DataFrame.withColumnsRenamed
should respect the dict orderingDataFrame.withColumnsRenamed
should respect the dict ordering
val resolver = sparkSession.sessionState.analyzer.resolver | ||
val output: Seq[NamedExpression] = queryExecution.analyzed.output | ||
|
||
val projectList = colsMap.foldLeft(output) { | ||
val projectList = colNames.zip(newColNames).foldLeft(output) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We believe we need a Scala test case for this change, @zhengruifeng , because the PR description claims that Scala code ordering matters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a new simple test case which fails without this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a python issue.
to make sure the scala side is not changed, I add a scala test
val df = spark.range(10).toDF() | ||
assert(df.withColumnsRenamed(Map("id" -> "a", "a" -> "b")).columns === Array("b")) | ||
assert(df.withColumnsRenamed(Map("a" -> "b", "id" -> "a")).columns === Array("a")) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this addition. Actually, Map
has many incompatibility issues across multiple Scala versions. For example, Scala 2.11/2.12/2.13.
For now, since we have only Scala 2.13, I guess the behavior was consistent on master branch. And, this PR will help it more.
Also, cc @LuciferYang , too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, to avoid such incompatibility issues, I think I need to use ListMap
instead
To @zhengruifeng , I believe we can proceed with the AS-IS status if this is only for Apache Spark 4.0.0. |
Merged to master. |
… dict/map ordering ### What changes were proposed in this pull request? this is a follow up of #44177 ### Why are the changes needed? according to [this](https://protobuf.dev/programming-guides/proto3/#maps-features): > Wire format ordering and map iteration ordering of map values are undefined, so you cannot rely on your map items being in a particular order. we should not use `map` in protobufs when the ordering is sensitive ### Does this PR introduce _any_ user-facing change? yes, enabled test ### How was this patch tested? enabled UT ### Was this patch authored or co-authored using generative AI tooling? no Closes #44231 from zhengruifeng/connect_with_cols_rm. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…ct the dict ordering ### What changes were proposed in this pull request? Make `DataFrame.withColumnsRenamed` respect the dict ordering ### Why are the changes needed? the ordering in `withColumnsRenamed` matters in scala ``` scala> val df = spark.range(1000) val df: org.apache.spark.sql.Dataset[Long] = [id: bigint] scala> df.withColumnsRenamed(Map("id" -> "a", "a" -> "b")) val res0: org.apache.spark.sql.DataFrame = [b: bigint] scala> df.withColumnsRenamed(Map("a" -> "b", "id" -> "a")) val res1: org.apache.spark.sql.DataFrame = [a: bigint] ``` However, in py4j the Python `dict` -> JVM `map` conversion can not guarantee the ordering ### Does this PR introduce _any_ user-facing change? yes, behavior change before this PR ``` In [1]: df = spark.range(10) In [2]: df.withColumnsRenamed({"id": "a", "a": "b"}) Out[2]: DataFrame[a: bigint] In [3]: df.withColumnsRenamed({"a": "b", "id": "a"}) Out[3]: DataFrame[a: bigint] ``` after this PR ``` In [1]: df = spark.range(10) In [2]: df.withColumnsRenamed({"id": "a", "a": "b"}) Out[2]: DataFrame[b: bigint] In [3]: df.withColumnsRenamed({"a": "b", "id": "a"}) Out[3]: DataFrame[a: bigint] ``` ### How was this patch tested? added ut ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#44177 from zhengruifeng/sql_withColumnsRenamed_sql. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
… dict/map ordering ### What changes were proposed in this pull request? this is a follow up of apache#44177 ### Why are the changes needed? according to [this](https://protobuf.dev/programming-guides/proto3/#maps-features): > Wire format ordering and map iteration ordering of map values are undefined, so you cannot rely on your map items being in a particular order. we should not use `map` in protobufs when the ordering is sensitive ### Does this PR introduce _any_ user-facing change? yes, enabled test ### How was this patch tested? enabled UT ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#44231 from zhengruifeng/connect_with_cols_rm. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
Make
DataFrame.withColumnsRenamed
respect the dict orderingWhy are the changes needed?
the ordering in
withColumnsRenamed
mattersin scala
However, in py4j the Python
dict
-> JVMmap
conversion can not guarantee the orderingDoes this PR introduce any user-facing change?
yes, behavior change
before this PR
after this PR
How was this patch tested?
added ut
Was this patch authored or co-authored using generative AI tooling?
no