Create a new pull request by comparing changes across two branches #1555

GulajavaMinistudio · 2023-09-12T07:02:02Z

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

…t_set` ### What changes were proposed in this pull request? This pr refine docstring of `collect_list/collect_set` and add some new examples. ### Why are the changes needed? To improve PySpark documentation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass Github Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #42866 from LuciferYang/SPARK-45113. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…nged` information to the parameters ### What changes were proposed in this pull request? 1, for newly added parameters, using `versionadded` instead of `versionchanged`, to follow pandas https://github.com/pandas-dev/pandas/blob/cea0cc0a54725ed234e2f51cc21a1182674a6032/pandas/io/sql.py#L317 2, for newly changed parameters, move `versionchanged` under the corresponding parameter ### Why are the changes needed? for better doc ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? NO Closes #42867 from zhengruifeng/py_doc_minor. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

… reference ### What changes were proposed in this pull request? This is a bug fix for the recently added SQL variable feature. It's designed to resolve columns to SQL variable as the last resort, but for columns in Aggregate, we may resolve columns to outer reference first. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? yes, the query result can be wrong before this fix ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #42803 from cloud-fan/meta-col. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…with an internal error ### What changes were proposed in this pull request? Replace the legacy error class `_LEGACY_ERROR_TEMP_2015` with an internal error as it is not triggered by the user space. ### Why are the changes needed? As the error is not triggered by the user space, the legacy error class can be replaced by an internal error. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing test cases. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42845 from dengziming/SPARK-43251. Authored-by: dengziming <dengziming@bytedance.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

### What changes were proposed in this pull request? This PR aims to upgrade Maven to 3.8.8 from 3.9.4. ### Why are the changes needed? The new version [lift JDK minimum to JDK 8](https://issues.apache.org/jira/browse/MNG-7452) and [make the build work on JDK 20](https://issues.apache.org/jira/browse/MNG-7743) . It also brings a series of bug fixes, such as [Fix deadlock during forked lifecycle executions](https://issues.apache.org/jira/browse/MNG-7487), along with a number of new optimizations like [Profile activation by packaging](https://issues.apache.org/jira/browse/MNG-6609). On the other hand, the new version replaces 'Wagon' with 'native http' as the new [Maven Resolver transport](https://maven.apache.org/guides/mini/guide-resolver-transport.html), coupled with a range of targeted performance enhancements（See the upgrades related to Maven Resolver）. For other updates, refer to the corresponding release notes: - https://maven.apache.org/docs/3.9.0/release-notes.html | https://github.com/apache/maven/releases/tag/maven-3.9.0 - https://maven.apache.org/docs/3.9.1/release-notes.html | https://github.com/apache/maven/releases/tag/maven-3.9.1 - https://maven.apache.org/docs/3.9.2/release-notes.html | https://github.com/apache/maven/releases/tag/maven-3.9.2 - https://maven.apache.org/docs/3.9.3/release-notes.html | https://github.com/apache/maven/releases/tag/maven-3.9.3 - https://maven.apache.org/docs/3.9.4/release-notes.html | https://github.com/apache/maven/releases/tag/maven-3.9.4 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions - Manual test : run `build/mvn -version` wll trigger download `apache-maven-3.9.4-bin.tar.gz` ``` exec: curl --silent --show-error -L https://www.apache.org/dyn/closer.lua/maven/maven-3/3.9.4/binaries/apache-maven-3.9.4-bin.tar.gz?action=download ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #42827 from LuciferYang/maven-394. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>

### What changes were proposed in this pull request? This is a follow-up PR to #42863, the 1 argument `log` function should also point to `ln`. ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No, these Spark Connect functions haven't been released. ### How was this patch tested? Exsiting UTs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42869 from peter-toth/SPARK-45109-fix-log. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Peter Toth <peter.toth@gmail.com>

…keys ### What changes were proposed in this pull request? - Add new conf spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled - Change key compatibility checks in EnsureRequirements. Remove checks where all partition keys must be in join keys to allow isKeyCompatible = true in this case (if this flag is enabled) - Change BatchScanExec/DataSourceV2Relation to group splits by join keys if they differ from partition keys (previously grouped only by partition values). Do same for all auxiliary data structure, like commonPartValues. - Implement partiallyClustered skew-handling. - Group only the replicate side (now by join key as well), replicate by the total size of other-side partitions that share the join key. - add an additional sort for partitions based on join key, as when we group the replicate side, partition ordering becomes out of order from the non-replicate side. ### Why are the changes needed? - Support Storage Partition Join in cases where the join condition does not contain all the partition keys, but just some of them ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? -Added tests in KeyGroupedPartitioningSuite -Found two existing problems, will address in separate PR: - Because of #37886 we have to select all join keys to trigger SPJ in this case, otherwise DSV2 scan does not report KeyGroupedPartitioning and SPJ does not get triggered. Need to see how to relax this. - https://issues.apache.org/jira/browse/SPARK-44641 was found when testing this change. This pr refactors some of those code to add group-by-join-key, but doesnt change the underlying logic, so issue continues to exist. Hopefully this will also get fixed in another way. Closes #42306 from szehon-ho/spj_attempt_master. Authored-by: Szehon Ho <szehon.apache@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? This PR removes the use of `userId` and `sessionId` in `CachedLocalRelation` messages and subsequently make `SparkConnectPlanner` use the `userId`/`sessionId` of the active session rather than the user-provided information. ### Why are the changes needed? Allowing a fetch of a local relation using user-provided information is a potential security risk since this allows users to fetch arbitrary local relations. ### Does this PR introduce _any_ user-facing change? Virtually no. It will ignore the session id or user id that users set (but instead use internal ones that users cannot manipulate). ### How was this patch tested? Manually. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42880 from HyukjinKwon/no-local-user. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…all columns are object-dtype ### What changes were proposed in this pull request? This PR proposes to aise `TypeError` for `DataFrame.interpolate` when all columns are object-dtype. ### Why are the changes needed? To match the behavior of Pandas: ```python >>> pd.DataFrame({"A": ['a', 'b', 'c'], "B": ['a', 'b', 'c']}).interpolate() ... TypeError: Cannot interpolate with all object-dtype columns in the DataFrame. Try setting at least one column to a numeric dtype. ``` We currently return empty DataFrame instead of raise TypeError: ```python >>> pd.DataFrame({"A": ['a', 'b', 'c'], "B": ['a', 'b', 'c']}).interpolate() Empty DataFrame Columns: [] Index: [0, 1, 2] ``` ### Does this PR introduce _any_ user-facing change? Compute `DataFrame.interpolate` on DataFrame that has all object-dtype columns will raise TypeError instead of returning an empty DataFrame. ### How was this patch tested? Added UT. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42878 from itholic/SPARK-45123. Authored-by: Haejoon Lee <haejoon.lee@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

### What changes were proposed in this pull request? This PR proposes to support Series.empty for Spark Connect by removing JVM dependency. ### Why are the changes needed? Increase API coverage for Spark Connect. ### Does this PR introduce _any_ user-facing change? `Series.empty` is available on Spark Connect. ### How was this patch tested? Added UT. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42877 from itholic/SPARK-45121. Authored-by: Haejoon Lee <haejoon.lee@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

LuciferYang and others added 10 commits September 11, 2023 20:11

github-actions bot added SQL PYTHON BUILD DOCS PANDAS API ON SPARK CONNECT labels Sep 12, 2023

GulajavaMinistudio merged commit c84eac4 into GulajavaMinistudio:master Sep 12, 2023
2 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a new pull request by comparing changes across two branches #1555

Create a new pull request by comparing changes across two branches #1555

GulajavaMinistudio commented Sep 12, 2023

Create a new pull request by comparing changes across two branches #1555

Create a new pull request by comparing changes across two branches #1555

Conversation

GulajavaMinistudio commented Sep 12, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?