Create a new pull request by comparing changes across two branches #1633

GulajavaMinistudio · 2024-03-21T06:45:59Z

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

…hon Data Source ### What changes were proposed in this pull request? This PR proposes to fix several links within the documentation, and incorrect type hints. ### Why are the changes needed? For better readability in the documentation, and correct type hints. ### Does this PR introduce _any_ user-facing change? No because Python Data Source has not been released yet. The fix will correct the links, and use the consistent style of types in its documentation. ### How was this patch tested? Manually verified via `./dev/linter-python` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45557 from HyukjinKwon/SPARK-47436. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

### What changes were proposed in this pull request? This PR adds new golden file tests for collation feature: 1) DESCRIBE 3) Basic array operations 4) Removing struct test since same is already covered in golden files. ### Why are the changes needed? Extending test coverage for collation feature. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? No Closes #45515 from dbatomic/collation_golden_files_update. Authored-by: Aleksandar Tomic <aleksandar.tomic@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

…e page ### What changes were proposed in this pull request? This PR proposes to document Python Data Source API in Python API reference page. ### Why are the changes needed? For users/developers to know how to use them. ### Does this PR introduce _any_ user-facing change? Yes, it documents Python Data Source API. ### How was this patch tested? Manually checked the output from Python API reference build ```bash cd python/docs make clean html open build/html/index.html ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45561 from HyukjinKwon/SPARK-47439. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…onnect.dataframe.DataFrame.writeStream` ### What changes were proposed in this pull request? Reenable doctest `pyspark.sql.connect.dataframe.DataFrame.writeStream` ### Why are the changes needed? for test coverage ### Does this PR introduce _any_ user-facing change? no, test only ### How was this patch tested? I manually test it 10 times locally `python/run-tests -k --python-executables python3 --testnames 'pyspark.sql.connect.dataframe'` It seems this test is no longer flaky ### Was this patch authored or co-authored using generative AI tooling? no Closes #45560 from zhengruifeng/SPARK_43435. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

….sort*` ### What changes were proposed in this pull request? Correct the error class for `DataFrame.sort*` ### Why are the changes needed? `DataFrame.sort*` support negative indices, which means `sort by desc` ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #45559 from zhengruifeng/correct_index_error. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Like SPARK-24553, this PR aims to fix redirect issues (incorrect 302) when one is using proxy settings. Change the generated link to be consistent with other links and include a trailing slash ### Why are the changes needed? When using a proxy, an invalid redirect is issued if this is not included ### Does this PR introduce _any_ user-facing change? Only that people will be able to use these links if they are using a proxy ### How was this patch tested? With a proxy installed I went to the location this link would generate and could go to the page, when it redirects with the link as it exists. Edit: Further tested by building a version of our application with this patch applied, the links work now. ### Was this patch authored or co-authored using generative AI tooling? No. Page with working link <img width="913" alt="Screenshot 2024-03-18 at 4 45 27 PM" src="https://github.com/apache/spark/assets/5205457/dbcd1ffc-b7e6-4f84-8ca7-602c41202bf3"> Goes correctly to <img width="539" alt="Screenshot 2024-03-18 at 4 45 36 PM" src="https://github.com/apache/spark/assets/5205457/89111c82-b24a-4b33-895f-9c0131e8acb5"> Before it would redirect and we'd get a 404. <img width="639" alt="image" src="https://github.com/apache/spark/assets/5205457/1adfeba1-a1f6-4c35-9c39-e077c680baef"> Closes #45527 from HuwCampbell/patch-1. Authored-by: Huw Campbell <huw.campbell@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? Adds the Web UI to the `Other Documents` list on the main page. ### Why are the changes needed? I found it difficult to find the Web UI docs: it's only linked inside the Monitoring docs. Adding it to the main page will make it easier for people to find and use the docs. ### Does this PR introduce _any_ user-facing change? Yes: adds another cross-reference on the main page. ### How was this patch tested? Visually verified that Markdown still rendered properly. ### Was this patch authored or co-authored using generative AI tooling? No Closes #45534 from mattayes/patch-2. Authored-by: Matt Braymer-Hayes <matt.hayes91@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? The pr aims to upgrade jackson from `2.16.1` to `2.17.0`. ### Why are the changes needed? The full release notes: https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.17 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45562 from panbingkun/SPARK-47438. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…rSuite ### What changes were proposed in this pull request? This PR proposes to use port 0 to start the worker server. ### Why are the changes needed? More stable test by: * Avoids the port conflicts between workers * Avoids the port conflicts with local service ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually tested. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45566 from Ngone51/worker-zero-port. Lead-authored-by: Yi Wu <yi.wu@databricks.com> Co-authored-by: wuyi <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…d by SPARK-45561 ### What changes were proposed in this pull request? SPARK-45561 mapped java.sql.Types.TINYINT to ByteType in MySQL Dialect, which caused unsigned TINYINT overflow. As regardless of signed or unsigned types, the TINYINT is used for java.sql.Types. In this PR, we put the signed info into the metadata for mapping TINYINT to short or byte. ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? Uses can read MySQL UNSIGNED TINYINT values after this PR like versions before 3.5.0 which has breaked since 3.5.1 ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #45556 from yaooqinn/SPARK-47435. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? Make the shutdown hook timeout configurable. If this is not defined it falls back to the existing behavior, which uses a default timeout of 30 seconds, or whatever is defined in core-site.xml for the hadoop.service.shutdown.timeout property. ### Why are the changes needed? Spark sometimes times out during the shutdown process. This can result in data left in the queues to be dropped and causes metadata loss (e.g. event logs, anything written by custom listeners). This is not easily configurable before this change. The underlying `org.apache.hadoop.util.ShutdownHookManager` has a default timeout of 30 seconds. It can be configured by setting hadoop.service.shutdown.timeout, but this must be done in the core-site.xml/core-default.xml because a new hadoop conf object is created and there is no opportunity to modify it. ### Does this PR introduce _any_ user-facing change? Yes, a new config `spark.shutdown.timeout` is added. ### How was this patch tested? Manual testing in spark-shell. This behavior is not practical to write a unit test for. ### Was this patch authored or co-authored using generative AI tooling? No Closes #45504 from robreeves/sc_shutdown_timeout. Authored-by: Rob Reeves <roreeves@linkedin.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…nal` ### What changes were proposed in this pull request? This PR aims to make `BlockManager` warn before invoking `removeBlockInternal` by switching the log position. To be clear, 1. For the case where `removeBlockInternal` succeeds, the log messages are identical before and after this PR. 2. For the case where `removeBlockInternal` fails, the user will see one additional warning message like the following which was hidden from the users before this PR. ``` logWarning(s"Putting block $blockId failed") ``` ### Why are the changes needed? When `Put` operation fails, Apache Spark currently tries `removeBlockInternal` first before logging. https://github.com/apache/spark/blob/ce93c9fd86715e2479552628398f6fc11e83b2af/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1554-L1567 On top of that, if `removeBlockInternal` fails consecutively, Spark shows the warning like the following and fails the job. ``` 24/03/18 18:40:46 WARN BlockManager: Putting block broadcast_0 failed due to exception java.nio.file.NoSuchFileException: /data/spark/blockmgr-56a6c418-90be-4d89-9707-ef45f7eaf74c/0e. 24/03/18 18:40:46 WARN BlockManager: Block broadcast_0 was not removed normally. 24/03/18 18:40:46 INFO TaskSchedulerImpl: Cancelling stage 0 24/03/18 18:40:46 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage cancelled 24/03/18 18:40:46 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) failed in 0.264 s due to Job aborted due to stage failure: Task serialization failed: java.nio.file.NoSuchFileException: /data/spark/blockmgr-56a6c418-90be-4d89-9707-ef45f7eaf74c/0e java.nio.file.NoSuchFileException: /data/spark/blockmgr-56a6c418-90be-4d89-9707-ef45f7eaf74c/0e ``` It's misleading although they might share the same root cause. Since `Put` operation fails before the above failure, we had better switch WARN message to make it clear. ### Does this PR introduce _any_ user-facing change? No. This is a warning message change only. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45570 from dongjoon-hyun/SPARK-47446. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…n is the same ### What changes were proposed in this pull request? In this PR we change the client behaviour to send the previously observed server session id so that the server can validate that the client used to talk with this specific session. Previously this was only validated on the client side which made the server actually execute the request for the wrong session before throwing on the client side (once the response from the server was obtained). ### Why are the changes needed? The server can execute the client command on the wrong spark session before client figuring out it's the different session. ### Does this PR introduce _any_ user-facing change? The error message now pops up differently (it used to be a slightly different message when validated on the client). ### How was this patch tested? Existing unit tests, add new unit test, e2e test added, manual testing ### Was this patch authored or co-authored using generative AI tooling? No Closes #45499 from nemanja-boric-databricks/workspace. Authored-by: Nemanja Boric <nemanja.boric@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com>

…efault ### What changes were proposed in this pull request? This PR aims to enable `spark.shuffle.service.removeShuffle` for Apache Spark 4.0.0. ### Why are the changes needed? Since Apache Spark 3.3.0, Apache Spark has been supporting `spark.shuffle.service.removeShuffle` via SPARK-37618. - #35085 We can use it when external shuffle service is available. ### Does this PR introduce _any_ user-facing change? By default, no because `spark.shuffle.service.enabled` is still disabled. Only for the existing shuffle service users, this PR works. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45572 from dongjoon-hyun/SPARK-47448. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? This PR aims to use R 4.3.3 in `windows` R GitHub Action job. ### Why are the changes needed? R 4.3.3 is the latest release which is released on 2024-02-29. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45574 from dongjoon-hyun/SPARK-47450. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…eachbatch and stateful streaming query to prevent state from being re-loaded in each batch ### What changes were proposed in this pull request? Add note to persist dataframe while using foreachbatch and stateful streaming query to prevent state from being re-loaded in each batch ### Why are the changes needed? Without this recommendation, if the user did not persist the dataframe in their UDF and multiple actions are called on the dataframe, the query is evaluated multiple times within the same batch leading to the state store invalidating the loaded version and reloading from cloud each time, which leads to severely degraded performance. For eg - Without the change, we see the same state instance loaded and committed multiple times ``` sql/core/target/unit-tests.log:596:10:17:17.090 Executor task launch worker for task 2.0 in stage 8.0 (TID 20) INFO HDFSBackedStateStoreProvider: Committed version 2 for HDFSStateStore[id=(op=0,part=2),dir=file:/Users/anish.shrigondekar/spark/spark/target/tmp/temporary-1ff51b75-034b-496d-8fc9-81df5ba78725/state/0/2] to file file:/Users/anish.shrigondekar/spark/spark/target/tmp/temporary-1ff51b75-034b-496d-8fc9-81df5ba78725/state/0/2/2.delta sql/core/target/unit-tests.log:710:10:17:17.612 Executor task launch worker for task 2.0 in stage 11.0 (TID 26) INFO HDFSBackedStateStoreProvider: Committed version 2 for HDFSStateStore[id=(op=0,part=2),dir=file:/Users/anish.shrigondekar/spark/spark/target/tmp/temporary-1ff51b75-034b-496d-8fc9-81df5ba78725/state/0/2] to file file:/Users/anish.shrigondekar/spark/spark/target/tmp/temporary-1ff51b75-034b-496d-8fc9-81df5ba78725/state/0/2/2.delta sql/core/target/unit-tests.log:813:10:17:18.123 Executor task launch worker for task 2.0 in stage 13.0 (TID 31) INFO HDFSBackedStateStoreProvider: Committed version 2 for HDFSStateStore[id=(op=0,part=2),dir=file:/Users/anish.shrigondekar/spark/spark/target/tmp/temporary-1ff51b75-034b-496d-8fc9-81df5ba78725/state/0/2] to file file:/Users/anish.shrigondekar/spark/spark/target/tmp/temporary-1ff51b75-034b-496d-8fc9-81df5ba78725/state/0/2/2.delta ``` With the change, we see the same state instance loaded and committed only once ``` sql/core/target/unit-tests.log:516:10:22:47.150 Executor task launch worker for task 2.0 in stage 8.0 (TID 20) INFO HDFSBackedStateStoreProvider: Committed version 2 for HDFSStateStore[id=(op=0,part=2),dir=file:/Users/anish.shrigondekar/spark/spark/target/tmp/temporary-870e4666-52dc-471e-bb19-e6c8f372d0ca/state/0/2] to file file:/Users/anish.shrigondekar/spark/spark/target/tmp/temporary-870e4666-52dc-471e-bb19-e6c8f372d0ca/state/0/2/2.delta ``` Note that we cannot always call `persist` before the `foreachbatch` function because it might lead to increased cache memory usage, possible disk writes (with the `MEMORY_AND_DISK` default level) and also trigger unwanted eviction of other critical cache blocks. Hence, we leave this responsibility to the caller. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing unit tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #45432 from anishshri-db/task/feb_persist_issue. Authored-by: Anish Shrigondekar <anish.shrigondekar@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

…ctionsSuite ### What changes were proposed in this pull request? This PR is a followup of #45466 that addresses #45466 (comment). It renames a test name from JSON to XML within `XmlFunctionsSuite` ### Why are the changes needed? To have the correct test title so we don't get confused why/which test failed. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? CI in this PR should validate the change. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45577 from HyukjinKwon/SPARK-47345-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? Collations need to be properly supported in following array operations but currently yield unexpected results: `ArraysOverlap`, `ArrayDistinct`, `ArrayUnion`, `ArrayIntersect`, `ArrayExcept`. Example query: ``` select array_contains(array('aaa' collate utf8_binary_lcase), 'AAA' collate utf8_binary_lcase) ``` We would expect the result of query to be true. ### Why are the changes needed? To enable collation support in array operations. ### Does this PR introduce _any_ user-facing change? Yes, with these changes listed array operations return expected results when applied on arrays of collated strings. ### How was this patch tested? Added test to collections suite and updated collations golden files. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45563 from nikolamand-db/SPARK-47422. Authored-by: Nikola Mandic <nikola.mandic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…ersion to 8.3.0 ### What changes were proposed in this pull request? Upgrade MySQL docker image version to 8.3.0 ### Why are the changes needed? test dependencies upgrading ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? passed locally with: `./build/sbt -Pdocker-integration-tests "docker-integration-tests/testOnly *MySQLIntegrationSuite"` ### Was this patch authored or co-authored using generative AI tooling? no Closes #45581 from yaooqinn/SPARK-47453. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? This PR aims to support ORC Brotli codec. ### Why are the changes needed? Currently, the master branch of Spark has used ORC 2.0.0 version and supports brotli compression encoding. ([SPARK-44115](https://issues.apache.org/jira/browse/SPARK-44115)) However, due to some hardcode checks in Spark, brotli cannot be used normally. ```sql set spark.sql.orc.compression.codec=BROTLI; ``` ``` java.lang.IllegalArgumentException: The value of spark.sql.orc.compression.codec should be one of uncompressed, lz4, lzo, snappy, zlib, none, zstd, but was brotli ``` See apache/orc#1714 ### Does this PR introduce _any_ user-facing change? Yes In this PR, add the corresponding brotli jar package, we can use orc brotli encoding. ```xml <dependency> <groupId>com.aayushatharva.brotli4j</groupId> <artifactId>brotli4j</artifactId> </dependency> ``` REF: https://github.com/netty/netty/blob/3cd364107167600e8eb4b0b85553ed895519e2ed/codec/pom.xml#L91-L125 ### How was this patch tested? local test <img width="850" alt="image" src="https://github.com/apache/spark/assets/3898450/4e4f9422-1cf2-45ef-8bcc-8bae6188beb7"> ### Was this patch authored or co-authored using generative AI tooling? No Closes #45584 from cxzl25/orc_brotli_codec. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? This PR aims to use `Ubuntu 22.04` in `dev/infra/Dockerfile` for Apache Spark 4.0.0. | Installed SW | BEFORE | AFTER | | ------------- | -------- | ------- | | Ubuntu LTS | 20.04.5 | 22.04.4 | | Java | 17.0.10 | 17.0.10 | | PyPy 3.8 | 3.8.16 | 3.8.16 | | Python 3.9 | 3.9.5 | 3.9.18 | | Python 3.10 | 3.10.13 | 3.10.12 | | Python 3.11 | 3.11.8 | 3.11.8 | | Python 3.12 | 3.12.2 | 3.12.2 | | R | 3.6.3 | 4.1.2 | ### Why are the changes needed? - Since Apache Spark 3.4.0, we use `Ubuntu 20.04` via SPARK-39522. - From Apache Spark 4.0.0, this PR aims to use `Ubuntu 22.04` mainly. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45576 from dongjoon-hyun/SPARK-47452. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…ataframe` ### What changes were proposed in this pull request? Split `pyspark.sql.tests.test_dataframe` ### Why are the changes needed? for testing parallelism ### Does this PR introduce _any_ user-facing change? no, test only ### How was this patch tested? updated ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #45580 from zhengruifeng/break_test_df. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…nt` to handle Hadoop 3.4+ ### What changes were proposed in this pull request? This PR aims to fix `IsolatedClientLoader.supportsHadoopShadedClient` to handle Hadoop 3.4+ correctly. ### Why are the changes needed? Apache Spark 3.4+ support shaded clients, but currently `supportsHadoopShadedClient` returns `false`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45585 from dongjoon-hyun/SPARK-47457. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…urrent tasks for the barrier stage ### What changes were proposed in this pull request? This PR addresses the problem of calculating the maximum concurrent tasks while evaluating the number of slots for barrier stages, specifically for the case when the task resource amount is greater than 1. ### Why are the changes needed? ``` scala test("problem of calculating the maximum concurrent task") { withTempDir { dir => val discoveryScript = createTempScriptWithExpectedOutput( dir, "gpuDiscoveryScript", """{"name": "gpu","addresses":["0", "1", "2", "3"]}""") val conf = new SparkConf() // Setup a local cluster which would only has one executor with 2 CPUs and 1 GPU. .setMaster("local-cluster[1, 6, 1024]") .setAppName("test-cluster") .set(WORKER_GPU_ID.amountConf, "4") .set(WORKER_GPU_ID.discoveryScriptConf, discoveryScript) .set(EXECUTOR_GPU_ID.amountConf, "4") .set(TASK_GPU_ID.amountConf, "2") // disable barrier stage retry to fail the application as soon as possible .set(BARRIER_MAX_CONCURRENT_TASKS_CHECK_MAX_FAILURES, 1) sc = new SparkContext(conf) TestUtils.waitUntilExecutorsUp(sc, 1, 60000) // Setup a barrier stage which contains 2 tasks and each task requires 1 CPU and 1 GPU. // Therefore, the total resources requirement (2 CPUs and 2 GPUs) of this barrier stage // can not be satisfied since the cluster only has 2 CPUs and 1 GPU in total. assert(sc.parallelize(Range(1, 10), 2) .barrier() .mapPartitions { iter => iter } .collect() sameElements Range(1, 10).toArray[Int]) } } ``` In the described test scenario, the executor has 6 CPU cores and 4 GPUs, and each task requires 1 CPU core and 2 GPUs. Consequently, the maximum number of concurrent tasks should be 2. However, the issue arises when attempting to launch the subsequent 2 barrier tasks, as the 'checkBarrierStageWithNumSlots' function gets the incorrect concurrent task limit that is 1 instead of 2. The bug needs to be fixed. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? The existing and newly added unit tests should pass ### Was this patch authored or co-authored using generative AI tooling? No Closes #45528 from wbo4958/2-gpu. Authored-by: Bobby Wang <wbo4958@gmail.com> Signed-off-by: Thomas Graves <tgraves@apache.org>

…urceProfile` from `ExecutorAllocationManager` ### What changes were proposed in this pull request? This pr remove private function `totalRunningTasksPerResourceProfile` from `ExecutorAllocationManager`. This function only used by test in `ExecutorAllocationManagerSuite` and this pr also change the test function to directly call `manager.listener.totalRunningTasksPerResourceProfile` instead of invoking the private function `ExecutorAllocationManager#totalRunningTasksPerResourceProfile` through reflection. ### Why are the changes needed? Code clean up ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #45587 from LuciferYang/SPARK-47461. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…and `common/variant` ### What changes were proposed in this pull request? The pr aims to update `labeler.yml` for module `common/sketch` and `common/variant`. ### Why are the changes needed? Currently, the above modules are not classified in the file `labeler.yml`, and the GitHub action label cannot automatically tag the submitted PR. ### Does this PR introduce _any_ user-facing change? Yes, only for dev. ### How was this patch tested? Manually test: after this PR is merged, continue to observe. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45590 from panbingkun/SPARK-47464. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? Refactor StatefulProcessorHandle unit test suites. Add List state and timer state unit tests. As planned in test plan for state-v2, list/timer should be tested in both integration and unit tests. Currently StatefulProcessorHandle related tests could be refactored to use base suite class in `ValueStateSuite`, and list/timer state unit tests are needed in addition to integration tests. ### Why are the changes needed? Compliance with test plan for state-v2 project. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test suites refactored and added. ### Was this patch authored or co-authored using generative AI tooling? No Closes #45573 from jingz-db/split-timer-list-state-v2. Authored-by: jingz-db <jing.zhan@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

…k.sql.tests.test_dataframe` ### What changes were proposed in this pull request? Further split `pyspark.sql.tests.test_dataframe`, this is the second PR (also the last one) to break it. ### Why are the changes needed? for testing parallelism ### Does this PR introduce _any_ user-facing change? No, test only ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #45591 from zhengruifeng/furthur_break_test_df. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? This PR aims to exclude `logback` from SBT dependency like Maven to fix the following SBT issue. ``` [info] stderr> SLF4J: Class path contains multiple SLF4J bindings. [info] stderr> SLF4J: Found binding in [jar:file:/home/runner/work/spark/spark/assembly/target/scala-2.13/jars/logback-classic-1.2.13.jar!/org/slf4j/impl/StaticLoggerBinder.class] [info] stderr> SLF4J: Found binding in [jar:file:/home/runner/.cache/coursier/v1/https/maven-central.storage-download.googleapis.com/maven2/ch/qos/logback/logback-classic/1.2.13/logback-classic-1.2.13.jar!/org/slf4j/impl/StaticLoggerBinder.class] [info] stderr> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. [info] stderr> SLF4J: Actual binding is of type [ch.qos.logback.classic.util.ContextSelectorStaticBinder] ``` ### Why are the changes needed? **Maven** ``` $ build/mvn dependency:tree --pl core | grep logback Using `mvn` from path: /opt/homebrew/bin/mvn Using SPARK_LOCAL_IP=localhost ``` **SBT (BEFORE)** ``` $ build/sbt "core/test:dependencyTree" | grep logback Using SPARK_LOCAL_IP=localhost [info] | +-ch.qos.logback:logback-classic:1.2.13 [info] | | +-ch.qos.logback:logback-core:1.2.13 [info] | +-ch.qos.logback:logback-core:1.2.13 [info] | | +-ch.qos.logback:logback-classic:1.2.13 [info] | | | +-ch.qos.logback:logback-core:1.2.13 [info] | | +-ch.qos.logback:logback-core:1.2.13 [info] | +-ch.qos.logback:logback-classic:1.2.13 [info] | | +-ch.qos.logback:logback-core:1.2.13 [info] | +-ch.qos.logback:logback-core:1.2.13 ``` **SBT (AFTER)** ``` $ build/sbt "core/test:dependencyTree" | grep logback Using SPARK_LOCAL_IP=localhost ``` ### Does this PR introduce _any_ user-facing change? No. This only fixes developer and CI issues. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45594 from dongjoon-hyun/SPARK-47468. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? Add schema inference tags for corrupt records, null values and value tags. For value tags, this PR adds the following tests: 1. Conflict between primitive types conflict 2. Root-level value tag 3. empty value tag in some rows 4. array of value tags: 1) values split into multiple lines 2) interspersed in nested structs: empty fields and optional fields in structs 3) interspersed in arrays and value tags: empty fields and optional fields in structs 4) name conflict 5) CDATA and comments 6) no spaces / some spaces / whitespaces between valueTags and elements ### Why are the changes needed? This is a test-only change. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This is a test-only change. ### Was this patch authored or co-authored using generative AI tooling? No Closes #45538 from shujingyang-db/xml-inference-test. Lead-authored-by: Shujing Yang <shujing.yang@databricks.com> Co-authored-by: Shujing Yang <135740748+shujingyang-db@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…ITY timestamps ### What changes were proposed in this pull request? This PR fixes a bug involved with #41843 that Epoch Second is used instead of epoch millis to create a timestamp value ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? revised tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #45599 from yaooqinn/SPARK-47473. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>

…rame` ### What changes were proposed in this pull request? Enable doctest for `createDataFrame` ### Why are the changes needed? for test coverage 536ac30 had refined the doctests, and it can be reused in connect ### Does this PR introduce _any_ user-facing change? no, test only ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #45606 from zhengruifeng/enable_create_df_doc_test. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…nect.dataframe.DataFrame.isStreaming` ### What changes were proposed in this pull request? Enable doctest `pyspark.sql.connect.dataframe.DataFrame.isStreaming` ### Why are the changes needed? for test coverage the related functions like `readStream` were already implemented ### Does this PR introduce _any_ user-facing change? no, test-only ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #45603 from zhengruifeng/enable_stream_doctest. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? We have special conversion rules for mapping MySQL boolean synonyms, this PR added some tests to cover the read and write code path ### Why are the changes needed? test coverage improvements ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? added tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #45604 from yaooqinn/SPARK-47478. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…ry collations ### What changes were proposed in this pull request? ### Why are the changes needed? Currently, all `StringType` arguments passed to built-in string functions in Spark SQL get treated as binary strings. This behaviour is incorrect for almost all collationIds except the default (0), and we should instead warn the user if they try to use an unsupported collation for the given function. Over time, we should implement the appropriate support for these (function, collation) pairs, but until then - we should have a way to fail unsupported statements in query analysis. ### Does this PR introduce _any_ user-facing change? Yes, users will now get appropriate errors when they try to use an unsupported collation with a given string function. ### How was this patch tested? Tests in CollationSuite to check if these functions work for binary collations and throw exceptions for others. ### Was this patch authored or co-authored using generative AI tooling? Yes. Closes #45422 from uros-db/regexp-functions. Lead-authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com> Co-authored-by: Mihailo Milosevic <mihailo.milosevic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This PR fixes a regression introduced by [SPARK-46633](https://issues.apache.org/jira/browse/SPARK-46633), commit: 3a6b9ad where one could not read an empty Avro file as the reader would be stuck in an infinite loop. I reverted the reader code to the pre-SPARK-46633 version and updated handling for empty blocks. When reading empty blocks in Avro, `blockRemaining` could still be read as 0. Call to `hasNext` status would load the next block but would still return false because of the final check `blockRemaining != 0`. Calling the method again picks up the next non-empty block and seems to fix the issue. ### Why are the changes needed? Fixes a regression introduced in [SPARK-46633](https://issues.apache.org/jira/browse/SPARK-46633). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I added a unit test to verify that empty files can be read correctly. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45578 from sadikovi/SPARK-46990. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This PR introduces support for Window Aggregates when partitioning is done against expressions with non-binary collation. The approach is same as for regular aggregates. Instead of doing byte-for-byte comparison against `UnsafeRow` we fall back to interpreted mode if there is a data type in grouping expressions that doesn't satisfy `isBinaryStable` constraint. ### Why are the changes needed? Previous implementation returned invalid results. ### Does this PR introduce _any_ user-facing change? yes - fixes incorrect behavior. ### How was this patch tested? New test is added in `CollationSuite`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45568 from dbatomic/win_agg_support_for_collations. Authored-by: Aleksandar Tomic <aleksandar.tomic@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

…h TINYINT in MySQLDialect ### What changes were proposed in this pull request? Align mappings of other unsigned numeric types with TINYINT in MySQLDialect. TINYINT is mapping to ByteType and TINYINT UNSIGNED is mapping to ShortType. In this PR, we - map SMALLINT to ShortType, SMALLINT UNSIGNED to IntegerType. W/o this, both of them are mapping to IntegerType - map MEDIUMINT UNSIGNED to IntegerType, and MEDIUMINT is AS-IS. W/o this, MEDIUMINT UNSIGNED uses LongType Other unsigned/signed types remain unchanged and only improve the test coverage. ### Why are the changes needed? Consistency and efficiency while reading MySQL numeric values ### Does this PR introduce _any_ user-facing change? yes, the mappings described the 1st section. ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #45588 from yaooqinn/SPARK-47462. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? This PR aims to upgrade to Apache Hadoop 3.4.0 for Apache Spark 4.0.0. ### Why are the changes needed? To bring the new features like the following - https://hadoop.apache.org/docs/r3.4.0 - [HADOOP-18995](https://issues.apache.org/jira/browse/HADOOP-18995) Upgrade AWS SDK version to 2.21.33 for `S3 Express One Zone` - [HADOOP-18328](https://issues.apache.org/jira/browse/HADOOP-18328) Supports `S3 on Outposts` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45583 from dongjoon-hyun/SPARK-45393. Lead-authored-by: Dongjoon Hyun <dhyun@apple.com> Co-authored-by: YangJie <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…tring` method ### What changes were proposed in this pull request? The private method `getString` in `ArrowDeserializers` is no longer used after SPARK-44449 | #42076, this pr removes it. ### Why are the changes needed? Code clean up. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #45610 from LuciferYang/SPARK-47486. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…tion warning ### What changes were proposed in this pull request? Fix RocksDB Logger constructor use to avoid deprecation warning ### Why are the changes needed? With the latest RocksDB upgrade, the Logger constructor used was deprecated which was throwing a compiler warning. ``` [warn] val dbLogger = new Logger(dbOptions) { [warn] ^ [warn] one warning found [warn] two warnings found [info] compiling 36 Scala sources and 16 Java sources to /Users/anish.shrigondekar/spark/spark/sql/core/target/scala-2.13/classes ... [warn] -target is deprecated: Use -release instead to compile against the correct platform API. [warn] Applicable -Wconf / nowarn filters for this warning: msg=<part of the message>, cat=deprecation [warn] /Users/anish.shrigondekar/spark/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala:851:24: constructor Logger in class Logger is deprecated [warn] Applicable -Wconf / nowarn filters for this warning: msg=<part of the message>, cat=deprecation, site=org.apache.spark.sql.execution.streaming.state.RocksDB.createLogger.dbLogger, origin=org.rocksdb.Logger.<init> ``` Updated to use the new recommendation as mentioned here - https://javadoc.io/doc/org.rocksdb/rocksdbjni/latest/org/rocksdb/Logger.html Recommendation: ``` [Logger](https://javadoc.io/static/org.rocksdb/rocksdbjni/8.11.3/org/rocksdb/Logger.html#Logger-org.rocksdb.DBOptions-)([DBOptions](https://javadoc.io/static/org.rocksdb/rocksdbjni/8.11.3/org/rocksdb/DBOptions.html) dboptions) Deprecated. Use [Logger(InfoLogLevel)](https://javadoc.io/static/org.rocksdb/rocksdbjni/8.11.3/org/rocksdb/Logger.html#Logger-org.rocksdb.InfoLogLevel-) instead, e.g. new Logger(dbOptions.infoLogLevel()). ``` After the fix, the warning is not seen. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing unit tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #45616 from anishshri-db/task/SPARK-47490. Authored-by: Anish Shrigondekar <anish.shrigondekar@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…et timestamp inference since Spark 3.3 ### What changes were proposed in this pull request? Add migration doc for the behavior change of Parquet timestamp inference since Spark 3.3 ### Why are the changes needed? Show the behavior change to users. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? It's just doc change ### Was this patch authored or co-authored using generative AI tooling? Yes, there are some doc suggestion from copilot in docs/sql-migration-guide.md Closes #45623 from gengliangwang/SPARK-47494. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…ction" This reverts commit 747846b.

### What changes were proposed in this pull request? This PR aims to fix a typo `slf4j-to-jul` to `jul-to-slf4j`. There exists only one. ``` $ git grep slf4j-to-jul common/utils/src/main/scala/org/apache/spark/internal/Logging.scala: // slf4j-to-jul bridge order to route their logs to JUL. ``` Apache Spark uses `jul-to-slf4j` which includes a `java.util.logging` (jul) handler, namely `SLF4JBridgeHandler`, which routes all incoming jul records to the SLF4j API. https://github.com/apache/spark/blob/bb3e27581887a094ead0d2f7b4a6b2a17ee84b6f/pom.xml#L735 ### Why are the changes needed? This typo was there since Apache Spark 1.0.0. ### Does this PR introduce _any_ user-facing change? No, this is a comment fix. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45625 from dongjoon-hyun/jul-to-slf4j. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? Improve executor exit error message on YARN, with additional explanation of exit code defined by Spark. ### Why are the changes needed? Spark defines its own exit codes, which have overlap with exit codes defined by YARN, thus diagnostics reported by YARN may be misleading. For example, exit code 56 is defined as `HEARTBEAT_FAILURE` in Spark, but `INVALID_DOCKER_IMAGE_NAME` in Hadoop, thus the error message displayed in UI is misleading. <img width="714" alt="image" src="https://github.com/apache/spark/assets/26535726/b8cf7834-2958-467f-8851-e47ad0f61833"> ### Does this PR introduce _any_ user-facing change? Yes, the UI displays more information when the executor runs on YARN exits without zero code. ### How was this patch tested? Because HEARTBEAT_FAILURE depends on the network and Driver's load, to simplify the test, I just use `select java_method('java.lang.System', 'exit', 56)` to simulate the above case. <img width="690" alt="image" src="https://github.com/apache/spark/assets/26535726/434f3a82-0bc8-4516-9e0a-f52844bbc9fa"> ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44951 from pan3793/SPARK-46920. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: yangjie01 <yangjie01@baidu.com>

### What changes were proposed in this pull request? This pr revert the change of SPARK-47461 and add some comments to `ExecutorAllocationManager#totalRunningTasksPerResourceProfile` to clarify that the tests in `ExecutorAllocationManagerSuite` need to call `listener.totalRunningTasksPerResourceProfile` with `synchronized`. ### Why are the changes needed? `ExecutorAllocationManagerSuite` need to call `listener.totalRunningTasksPerResourceProfile` with `synchronized`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #45602 from LuciferYang/SPARK-47474. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com>

…bserve` ### What changes were proposed in this pull request? Enable doctest for `DataFrame.observe` ### Why are the changes needed? for test coverage ### Does this PR introduce _any_ user-facing change? no, test-only ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #45627 from zhengruifeng/enable_listener_doctest. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…adoop version to 3.4.0 ### What changes were proposed in this pull request? Update IsolatedClientLoader fallback Hadoop version to 3.4.0 ### Why are the changes needed? Sync with the default Hadoop version ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No Closes #45628 from pan3793/SPARK-45393-followup. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

HyukjinKwon and others added 30 commits March 18, 2024 16:02

yaooqinn and others added 17 commits March 20, 2024 15:56

Revert "[SPARK-47007][SQL][PYTHON][R][CONNECT] Add the map_sort fun…

bb3e275

…ction" This reverts commit 747846b.

github-actions bot added CORE SQL PYTHON WEB UI AVRO STRUCTURED STREAMING BUILD DOCS INFRA YARN CONNECT labels Mar 21, 2024

GulajavaMinistudio merged commit 68731cf into GulajavaMinistudio:master Mar 21, 2024
2 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a new pull request by comparing changes across two branches #1633

Create a new pull request by comparing changes across two branches #1633

GulajavaMinistudio commented Mar 21, 2024

Create a new pull request by comparing changes across two branches #1633

Create a new pull request by comparing changes across two branches #1633

Conversation

GulajavaMinistudio commented Mar 21, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?