Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a new pull request by comparing changes across two branches #1625

Merged
merged 92 commits into from
Feb 28, 2024

Conversation

GulajavaMinistudio
Copy link
Owner

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

urosstan-db and others added 30 commits February 20, 2024 22:03
…o explain output

### What changes were proposed in this pull request?
Add generated JDBC query to EXPLAIN FORMATTED command when physical Scan node should access to JDBC source to create RDD.

Output of Explain formatted with this change from newly added test.
```
== Physical Plan ==
* Project (2)
+- * Scan org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$14349389d  (1)

(1) Scan org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$14349389d  [codegen id : 1]
Output [1]: [MAX(ID)#x]
Arguments: [MAX(ID)#x], [StructField(MAX(ID),IntegerType,true)], PushedDownOperators(Some(org.apache.spark.sql.connector.expressions.aggregate.Aggregation647d3279),None,None,None,List(),ArraySeq(ID IS NOT NULL, ID > 1)), JDBCRDD[0] at $anonfun$executePhase$2 at LexicalThreadLocal.scala:63, org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$14349389d, Statistics(sizeInBytes=8.0 EiB, ColumnStat: N/A)
External engine query: SELECT MAX("ID") FROM "test"."people"  WHERE ("ID" IS NOT NULL) AND ("ID" > 1)

(2) Project [codegen id : 1]
Output [1]: [MAX(ID)#x AS max(id)#x]
Input [1]: [MAX(ID)#x]
```

### Why are the changes needed?
This command will allow customers to see which query text is sent to external JDBC sources.

### Does this PR introduce _any_ user-facing change?
Yes
Customer will have another field in EXPLAIN FORMATTED command for RowDataSourceScanExec node.

### How was this patch tested?
Tested using JDBC V2 suite by new unit test.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #45102 from urosstan-db/add-sql-query-for-external-datasources.

Authored-by: Uros Stankovic <uros.stankovic@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?

This PR proposes to upgrade Pandas to 2.2.0.

See [What's new in 2.2.0 (January 19, 2024)](https://pandas.pydata.org/docs/whatsnew/v2.2.0.html)

### Why are the changes needed?

Pandas 2.2.0 is released, and we should support the latest Pandas.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The existing CI should pass

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44881 from itholic/pandas_2.2.0.

Authored-by: Haejoon Lee <haejoon.lee@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…kR tests on Windows

### What changes were proposed in this pull request?

This PR proposes to migrate from AppVeyor to GitHub Actions for SparkR tests on Windows.

### Why are the changes needed?

Reduce the tools we use for better maintenance.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

- [x] Tested in my fork

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45175 from HyukjinKwon/SPARK-47098.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…ecated" compile suppression rules

### What changes were proposed in this pull request?
This pr aims to remove undated `Auto-application to () is deprecated` compile suppression rules added by SPARK-45610 because SPARK-47016 already upgrade `scalatest` to 3.2.18 and the issue has been fixed.

### Why are the changes needed?
master has already upgraded `scalatest` to 3.2.18, the issue described in scalatest/scalatest#2297 has been resolved.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #45179 from LuciferYang/SPARK-45615.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…false` explicitly in CLIs

### What changes were proposed in this pull request?

This PR aims to set `derby.connection.requireAuthentication` to `false` explicitly in CLIs by adding an option at `SparkSubmitCommandBuilder`.

### Why are the changes needed?

Not that the embedded `Apache Derby` is supposed to be used in a non-production environment only. However, it's used (or exposed) to the users when there is no reachable Hive MetaStores. For example, `spark-shell` and `spark-sql` and `Spark ThriftServer`.

https://github.com/apache/spark/blob/9d9675922543e3e5c3b01023e5a756462a1fd308/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L190

Although `derby.connection.requireAuthentication` is supposed to be `false` by default in Apache Derby, this PR aims to make it sure that Apache Spark doesn't use it always intentionally.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45185 from dongjoon-hyun/SPARK-47108.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?

Currently user will get a misleading error as org.apache.spark.sql.execution.streaming.state.StateSchemaNotCompatible if restarting query in the same checkpoint location and changing their stateful operator. This PR catches such errors and throws a new error with informative message.

After physical planning, before execution phase, we will read from state metadata with the current operator id to fetch operator name of committed batch with the same operator id. If operator name does not match, throws the error.

### Why are the changes needed?

The current error message is misleading to users. We should provide users with message that can guide them to the real root cause of error.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #44927 from jingz-db/operator-check.

Authored-by: jingz-db <jing.zhan@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
…ATION`

### What changes were proposed in this pull request?

This PR aims to remove redundant `toLowerCase(Locale.ROOT)` transforms during checking `CATALOG_IMPLEMENTATION` values.

### Why are the changes needed?

We already have `checkValues`.

https://github.com/apache/spark/blob/9d9675922543e3e5c3b01023e5a756462a1fd308/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/StaticSQLConf.scala#L52

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Manually I checked the following. I believe these are all occurrences.

```
$ git grep -C1 '.toLowerCase(Locale.ROOT)' | grep '"hive'
repl/src/main/scala/org/apache/spark/repl/Main.scala-            .get(CATALOG_IMPLEMENTATION.key, "hive")
repl/src/main/scala/org/apache/spark/repl/Main.scala:            .toLowerCase(Locale.ROOT) == "hive") {
sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala:          jsc.sc.conf.get(CATALOG_IMPLEMENTATION.key, "hive").toLowerCase(Locale.ROOT) ==
sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala-            "hive" &&
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSchemaInferenceSuite.scala-        provider = Option("hive"),
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45184 from dongjoon-hyun/SPARK_CATALOG_IMPLEMENTATION.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?

This PR aims to upgrade `commons-compress` to 1.26.0.

### Why are the changes needed?

To bring the latest bug fixes.
- https://commons.apache.org/proper/commons-compress/changes-report.html#a1.26.0

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45189 from dongjoon-hyun/SPARK-47109.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…and docker image to 16.2

### What changes were proposed in this pull request?

This PR aims to upgrade `PostgreSQL` JDBC driver and docker images.
- JDBC Driver: `org.postgresql:postgresql` from 42.7.0 to 42.7.2
- Docker Image: `postgres` from `15.1-alpine` to `16.2-alpine`

### Why are the changes needed?

To use the latest PostgreSQL combination in the following integration tests.

- PostgresIntegrationSuite
- PostgresKrbIntegrationSuite
- GeneratedSubquerySuite
- PostgreSQLQueryTestSuite
- v2/PostgresIntegrationSuite
- v2/PostgresNamespaceSuite

### Does this PR introduce _any_ user-facing change?

No. This is a pure test-environment update.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45191 from dongjoon-hyun/SPARK-47111.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…ecution/StreamExecution

### What changes were proposed in this pull request?

To improve code clarity and maintainability, I propose that we move all the variables that track mutable state and metrics for a streaming query into a separate class.  With this refactor, it would be easy to track and find all the mutable state a microbatch can have.

### Why are the changes needed?

To improve code clarity and maintainability.  All the state and metrics that is needed for the execution lifecycle of a microbatch is consolidated into one class.  If we decide to modify or add additional state to a streaming query, it will be easier to determine 1) where to add it 2) what existing state are there.

### Does this PR introduce _any_ user-facing change?

No
### How was this patch tested?

Existing tests should suffice

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #45109 from jerrypeng/SPARK-47052.

Authored-by: Jerry Peng <jerry.peng@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
…en build

### What changes were proposed in this pull request?

This PR is a followup of #45171 that broke the scheduled build of macos-14.
Here I remove TTY specific workaround in Maven build, and skips `AmmoniteTest` that needs the workaround.
We should enable the tests back when the bug is fixed (see #40675 (comment))

### Why are the changes needed?

To fix up the build, It fails https://github.com/apache/spark/actions/runs/7979285164

See also #45186 (comment)

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

In my fork.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45186 from HyukjinKwon/SPARK-47095-followup.

Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?

This PR adds changes for ListState implementation in State Api v2. As a list contains multiple values for a single key, we utilize RocksDB merge operator to persist multiple values.

Changes include

1. A new encoder/decoder to encode multiple values inside a single byte[] array (stored in RocksDB). The encoding scheme is compatible with RocksDB StringAppendOperator merge operator.
2. Support merge operations in ChangelogCheckpointing v2.
3. Extend StateStore to support merge operation, and read multiple values for a single key (via a Iterator). Note that these changes are only supported for RocksDB currently.

### Why are the changes needed?

These changes are needed to support list values in the State Store. The changes are part of the work around adding new stateful streaming operator for arbitrary state mgmt that provides a bunch of new features listed in the SPIP JIRA here - https://issues.apache.org/jira/browse/SPARK-45939

### Does this PR introduce _any_ user-facing change?

Yes
This PR introduces a new state type (ListState) that users can use in their Spark streaming queries.

### How was this patch tested?

1. Added a new test suite for ListState to ensure the state produces correct results.
2. Added additional testcases for input validation.
3. Added tests for merge operator with RocksDB.
4. Added tests for changelog checkpointing merge operator.
5. Added tests for reading merged values in RocksDBStateStore.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #44961 from sahnib/state-api-v2-list-state.

Authored-by: Bhuwan Sahni <bhuwan.sahni@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?

Revert [SPARK-35878][CORE] Add fs.s3a.endpoint if unset and fs.s3a.endpoint.region is null

Removing the region/endpoint patching code of SPARK-35878 avoids authentication problems with versions of the S3A connector built with AWS v2 SDK -as is the case in Hadoop 3.4.0.

That is: if fs.s3a.endpoint is unset it will stay unset.

The v2 SDK does its binding to AWS Services differently, in what can be described as "region first" binding. Spark setting the endpoint blocks S3 Express support and is incompatible with HADOOP-18975 S3A: Add option fs.s3a.endpoint.fips to use AWS FIPS endpoints

- apache/hadoop#6277

The change is compatible with all releases of the s3a connector other than hadoop 3.3.1 binaries deployed outside EC2 and without the endpoint explicitly set.

### Why are the changes needed?

AWS v2 SDK has a different/complex binding mechanism; it doesn't need the endpoint to
be set if the region (fs.s3a.region) value is set. This means the spark code to
fix an endpoint is not only un-needed, it causes problems when trying to use specific
storage options (S3 Express) or security options (FIPS)

### Does this PR introduce _any_ user-facing change?

Only visible on hadoop 3.3.1 s3a connector when deployed outside of EC2 -the situation the original patch was added to work around. All other 3.3.x releases are good.

### How was this patch tested?

Removed some obsolete tests. Relying on github and jenkins to do the testing so marking this PR as WiP until they are happy.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #44834 from steveloughran/SPARK-46793-revert-region-fixup-SPARK-35878.

Authored-by: Steve Loughran <stevel@cloudera.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?

This PR proposes to use bigger memory during Maven builds. GitHub Actions runners now have more memory than before (https://docs.github.com/en/actions/using-github-hosted-runners/about-larger-runners/about-larger-runners) so we can increase.

https://github.com/HyukjinKwon/spark/actions/runs/7984135094/job/21800463337

### Why are the changes needed?

For stable Maven builds.
Some tests consistently fail:

```
*** RUN ABORTED ***
An exception or error caused a run to abort: unable to create native thread: possibly out of memory or process/resource limits reached
  java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
  at java.base/java.lang.Thread.start0(Native Method)
  at java.base/java.lang.Thread.start(Thread.java:1553)
  at java.base/java.lang.System$2.start(System.java:2577)
  at java.base/jdk.internal.vm.SharedThreadContainer.start(SharedThreadContainer.java:152)
  at java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:953)
  at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1364)
  at org.apache.spark.rpc.netty.SharedMessageLoop.$anonfun$threadpool$1(MessageLoop.scala:128)
  at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:190)
  at org.apache.spark.rpc.netty.SharedMessageLoop.<init>(MessageLoop.scala:127)
  at org.apache.spark.rpc.netty.Dispatcher.sharedLoop$lzycompute(Dispatcher.scala:46)
  ...
Warning:  The requested profile "volcano" could not be activated because it does not exist.
Warning:  The requested profile "hive" could not be activated because it does not exist.
Error:  Failed to execute goal org.scalatest:scalatest-maven-plugin:2.2.0:test (test) on project spark-core_2.13: There are test failures -> [Help 1]
Error:
Error:  To see the full stack trace of the errors, re-run Maven with the -e switch.
Error:  Re-run Maven using the -X switch to enable full debug logging.
Error:
Error:  For more information about the errors and possible solutions, please read the following articles:
Error:  [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
Error:
Error:  After correcting the problems, you can resume the build with the command
Error:    mvn <args> -rf :spark-core_2.13
Error: Process completed with exit code 1.
```

### Does this PR introduce _any_ user-facing change?

No, dev-only

### How was this patch tested?

Will monitor the scheduled jobs. It's a simple memory configuration change.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45195 from HyukjinKwon/bigger-macos.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?

Revert [SPARK-35878][CORE] Add fs.s3a.endpoint if unset and fs.s3a.endpoint.region is null

Removing the region/endpoint patching code of SPARK-35878 avoids authentication problems with versions of the S3A connector built with AWS v2 SDK -as is the case in Hadoop 3.4.0.

That is: if fs.s3a.endpoint is unset it will stay unset.

The v2 SDK does its binding to AWS Services differently, in what can be described as "region first" binding. Spark setting the endpoint blocks S3 Express support and is incompatible with HADOOP-18975 S3A: Add option fs.s3a.endpoint.fips to use AWS FIPS endpoints

- apache/hadoop#6277

The change is compatible with all releases of the s3a connector other than hadoop 3.3.1 binaries deployed outside EC2 and without the endpoint explicitly set.

### Why are the changes needed?

AWS v2 SDK has a different/complex binding mechanism; it doesn't need the endpoint to
be set if the region (fs.s3a.region) value is set. This means the spark code to
fix an endpoint is not only un-needed, it causes problems when trying to use specific
storage options (S3 Express) or security options (FIPS)

### Does this PR introduce _any_ user-facing change?

Only visible on hadoop 3.3.1 s3a connector when deployed outside of EC2 -the situation the original patch was added to work around. All other 3.3.x releases are good.

### How was this patch tested?

Removed some obsolete tests. Relying on github and jenkins to do the testing so marking this PR as WiP until they are happy.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #45193 from dongjoon-hyun/SPARK-47113.

Authored-by: Steve Loughran <stevel@cloudera.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?

We used to write Log4J logs into `target/unit-tests.log` instead of console. This seems to be broken in SparkR Windows job. This PR fixes it.

### Why are the changes needed?

https://github.com/apache/spark/actions/runs/7977185456/job/21779508822#step:10:89
This write too many logs, and difficult to see the real test cases.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

In my fork

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45192 from HyukjinKwon/reduce-logs.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…ws build to avoid warnings

### What changes were proposed in this pull request?

This PR installs Python 3.11 in SparkR build on Windows.

### Why are the changes needed?

To remove unrelated warnings: (https://github.com/HyukjinKwon/spark/actions/runs/7985005685/job/21802732830
)

```
Traceback (most recent call last):
  File "C:\hostedtoolcache\windows\Python\3.7.9\x64\lib\runpy.py", line 183, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "C:\hostedtoolcache\windows\Python\3.7.9\x64\lib\runpy.py", line 109, in _get_module_details
    __import__(pkg_name)
  File "D:\a\spark\spark\python\lib\pyspark.zip\pyspark\__init__.py", line [53](https://github.com/HyukjinKwon/spark/actions/runs/7985005685/job/21802732830#step:10:54), in <module>
  File "D:\a\spark\spark\python\lib\pyspark.zip\pyspark\rdd.py", line [54](https://github.com/HyukjinKwon/spark/actions/runs/7985005685/job/21802732830#step:10:55), in <module>
  File "D:\a\spark\spark\python\lib\pyspark.zip\pyspark\java_gateway.py", line 33, in <module>
  File "D:\a\spark\spark\python\lib\pyspark.zip\pyspark\serializers.py", line 69, in <module>
  File "D:\a\spark\spark\python\lib\pyspark.zip\pyspark\cloudpickle\__init__.py", line 1, in <module>
  File "D:\a\spark\spark\python\lib\pyspark.zip\pyspark\cloudpickle\cloudpickle.py", line 80, in <module>
ImportError: cannot import name 'CellType' from 'types' (C:\hostedtoolcache\windows\Python\3.7.9\x64\lib\types.py)
```

SparkR build does not need Python. However, it shows warnings when the Python version is too low during the attempt to look up Python Data Sources for session initialization. The Windows 2019 runner includes Python 3.7, which Spark does not support. Therefore, we simply install the proper Python for simplicity.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45196 from HyukjinKwon/python-errors.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
* Update SBT build file to remove the exclusion rule for `javax-servlet-api` package.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI build

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #45194 from HiuKwok/ft-hf-SPARK-46938-exclude-javax-rule.

Authored-by: HiuFung Kwok <hiufkwok@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…`paramIndex` for the error class `UNEXPECTED_INPUT_TYPE`

### What changes were proposed in this pull request?
The pr aims to use `ordinalNumber` to `uniformly` set the value of `paramIndex` for the error class `UNEXPECTED_INPUT_TYPE`.

### Why are the changes needed?
When I was reviewing the spark code, I found that:
- Some expressions may have a starting value of 1 when throwing error `DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE`, eg:
https://github.com/apache/spark/blob/b0aad59f123581b66515c864873f46ea4ec4e762/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala#L300-L302

- Some are 0, eg:
https://github.com/apache/spark/blob/b0aad59f123581b66515c864873f46ea4ec4e762/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/bitmapExpressions.scala#L117-L119

**We should `unify` it and avoid understanding differences.**

### Does this PR introduce _any_ user-facing change?
Yes, the value of 'paramIndex' for the error class `UNEXPECTED-INPUT-TYPE` is uniformly set by `ordinalNumber`.

### How was this patch tested?
- Updated existed UT.
- Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #45177 from panbingkun/SPARK-47099.

Authored-by: panbingkun <panbingkun@baidu.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
…ed files if files were deleted from local directory

### What changes were proposed in this pull request?

This change cleans up any dangling files tracked as being previously uploaded if they were cleaned up from the filesystem. The cleaning can happen due to a compaction racing in parallel with commit, where compaction completes after commit and a older version is loaded on the same executor.

### Why are the changes needed?

The changes are needed to prevent RocksDB versionId mismatch errors (which require users to clean the checkpoint directory and retry the query).

A particular scenario where this can happen is provided below:

1. Version V1 is loaded on executor A, RocksDB State Store has 195.sst, 196.sst, 197.sst and 198.sst files.
2. State changes are made, which result in creation of a new table file 200.sst.
3. State store is committed as version V2. The SST file 200.sst (as 000200-8c80161a-bc23-4e3b-b175-cffe38e427c7.sst) is uploaded to DFS, and previous 4 files are reused. A new metadata file is created to track the exact SST files with unique IDs, and uploaded with RocksDB Manifest as part of V1.zip.
4. Rocks DB compaction is triggered at the same time. The compaction creates a new L1 file (201.sst), and deletes existing 5 SST files.
5. Spark Stage is retried.
6. Version V1 is reloaded on the same executor. The local files are inspected, and 201.sst is deleted. The 4 SST files in version V1 are downloaded again to local file system.
7. Any local files which are deleted (as part of version load) are also removed from local → DFS file upload tracking. **However, the files already deleted as a result of compaction are not removed from tracking. This is the bug which resulted in the failure.**
8. State store is committed as version V1. However, the local mapping of SST files to DFS file path still has 200.sst in its tracking, hence the SST file is not re-uploaded. A new metadata file is created to track the exact SST files with unique IDs, and uploaded with the new RocksDB Manifest as part of V2.zip. (The V2.zip file is overwritten here atomically)
9. A new executor tried to load version V2. However, the SST files in (1) are now incompatible with Manifest file in (6) resulting in the version Id mismatch failure.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added unit test cases to cover the scenario where some files were deleted on the file system.

The test case fails with the existing master with error `Mismatch in unique ID on table file 16`, and is successful with changes in this PR.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #45092 from sahnib/rocksdb-compaction-file-tracking-fix.

Authored-by: Bhuwan Sahni <bhuwan.sahni@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?

This PR aims to provide a new profile, `hive-jackson-provided`, for Apache Spark 4.0.0.

### Why are the changes needed?

Since Apache Hadoop 3.3.5, only Apache Hive requires old CodeHaus Jackson dependencies.

Apache Spark 3.5.0 tried to eliminate them completely but it's reverted due to Hive UDF support.
- #40893
- #42446

To allow Apache Spark 4.0 users
- To provide their own CodeHaus Jackson libraries
- To exclude them completely if they don't use `Hive UDF`.

### Does this PR introduce _any_ user-facing change?

No, this is a new profile.

### How was this patch tested?

Pass the CIs and manual build.

**Without `hive-jackson-provided`**
```
$ dev/make-distribution.sh -Phive,hive-thriftserver
$ ls -al dist/jars/*asl*
-rw-r--r--  1 dongjoon  staff  232248 Feb 21 10:53 dist.org/jars/jackson-core-asl-1.9.13.jar
-rw-r--r--  1 dongjoon  staff  780664 Feb 21 10:53 dist.org/jars/jackson-mapper-asl-1.9.13.jar
```

**With `hive-jackson-provided`**
```
$ dev/make-distribution.sh -Phive,hive-thriftserver,hive-jackson-provided
$ ls -al dist/jars/*asl*
zsh: no matches found: dist/jars/*asl*

$ ls -al dist/jars/*hive*
-rw-r--r--  1 dongjoon  staff    183633 Feb 21 11:00 dist/jars/hive-beeline-2.3.9.jar
-rw-r--r--  1 dongjoon  staff     44704 Feb 21 11:00 dist/jars/hive-cli-2.3.9.jar
-rw-r--r--  1 dongjoon  staff    436169 Feb 21 11:00 dist/jars/hive-common-2.3.9.jar
-rw-r--r--  1 dongjoon  staff  10840949 Feb 21 11:00 dist/jars/hive-exec-2.3.9-core.jar
-rw-r--r--  1 dongjoon  staff    116364 Feb 21 11:00 dist/jars/hive-jdbc-2.3.9.jar
-rw-r--r--  1 dongjoon  staff    326585 Feb 21 11:00 dist/jars/hive-llap-common-2.3.9.jar
-rw-r--r--  1 dongjoon  staff   8195966 Feb 21 11:00 dist/jars/hive-metastore-2.3.9.jar
-rw-r--r--  1 dongjoon  staff    916630 Feb 21 11:00 dist/jars/hive-serde-2.3.9.jar
-rw-r--r--  1 dongjoon  staff   1679366 Feb 21 11:00 dist/jars/hive-service-rpc-3.1.3.jar
-rw-r--r--  1 dongjoon  staff     53902 Feb 21 11:00 dist/jars/hive-shims-0.23-2.3.9.jar
-rw-r--r--  1 dongjoon  staff      8786 Feb 21 11:00 dist/jars/hive-shims-2.3.9.jar
-rw-r--r--  1 dongjoon  staff    120293 Feb 21 11:00 dist/jars/hive-shims-common-2.3.9.jar
-rw-r--r--  1 dongjoon  staff     12923 Feb 21 11:00 dist/jars/hive-shims-scheduler-2.3.9.jar
-rw-r--r--  1 dongjoon  staff    258346 Feb 21 11:00 dist/jars/hive-storage-api-2.8.1.jar
-rw-r--r--  1 dongjoon  staff    581739 Feb 21 11:00 dist/jars/spark-hive-thriftserver_2.13-4.0.0-SNAPSHOT.jar
-rw-r--r--  1 dongjoon  staff    687446 Feb 21 11:00 dist/jars/spark-hive_2.13-4.0.0-SNAPSHOT.jar
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45201 from dongjoon-hyun/SPARK-47119.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…unsafe projection

### What changes were proposed in this pull request?

Change `TakeOrderedAndProjectExec#executeCollect` and `TakeOrderedAndProjectExec#doExecute` to initialize the unsafe projection before using it to produce output rows.

### Why are the changes needed?

Because the unsafe projection is not initialized, non-deterministic expressions also don't get initialized. This results in errors when the projection contains non-deterministic expressions. For example:
```
create or replace temp view v1(id, name) as values
(1, "fred"),
(2, "bob");

cache table v1;

select name, uuid() as _iid from (
  select * from v1 order by name
)
limit 20;
```
This query produces the following error:
```
java.lang.NullPointerException: Cannot invoke "org.apache.spark.sql.catalyst.util.RandomUUIDGenerator.getNextUUIDUTF8String()" because "this.randomGen_0" is null
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
	at org.apache.spark.sql.execution.TakeOrderedAndProjectExec.$anonfun$executeCollect$6(limit.scala:297)
	at scala.collection.ArrayOps$.map$extension(ArrayOps.scala:934)
	at org.apache.spark.sql.execution.TakeOrderedAndProjectExec.$anonfun$executeCollect$1(limit.scala:297)
...
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New test.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45199 from bersprockets/take_ordered_issue.

Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?

This PR aims to pin `bug-setup-action` to `v1.29.0`.

### Why are the changes needed?

To recover the broken CIs due to the latest `v1.29.0-1` issue.

- https://github.com/apache/spark/actions/runs/7995430821/job/21835769202

![Screenshot 2024-02-21 at 13 09 12](https://github.com/apache/spark/assets/9700541/4de2cbf9-bfb7-4f3c-9d77-dda043c36276)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs on this PR. It passed already.

- https://github.com/dongjoon-hyun/spark/actions/runs/7995539640/job/21836175997

![Screenshot 2024-02-21 at 13 10 56](https://github.com/apache/spark/assets/9700541/f4ae4fa3-60c1-41f3-a839-bad8088badca)

![Screenshot 2024-02-21 at 13 12 02](https://github.com/apache/spark/assets/9700541/df5af647-5666-4fcb-b89f-c63fdcd90170)

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45205 from dongjoon-hyun/buf-setup-action-2.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…neSchedulerBackend shutdown

### What changes were proposed in this pull request?

This PR adds logic to avoid uncaught `RejectedExecutionException`s while `StandaloneSchedulerBackend` is shutting down.

When the backend is shut down, its `stop()` method calls `executorDelayRemoveThread.shutdownNow()`. After this point, though, it's possible that its `StandaloneDriverEndpoint` might still process `onDisconnected` events and those might trigger calls to schedule new tasks on the `executorDelayRemoveThread`. This causes uncaught `java.util.concurrent.RejectedExecutionException`s to be thrown in RPC threads.

This patch adds a `try-catch` to catch those exceptions and log a short warning if the exceptions occur while the scheduler is stopping. This approach is consistent with other similar code in Spark, including:

- https://github.com/apache/spark/blob/9b53b803998001e4b706666e37e5f86f900a7430/core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala#L160-L163
- https://github.com/apache/spark/blob/9b53b803998001e4b706666e37e5f86f900a7430/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala#L754-L756

### Why are the changes needed?

Remove log and exception noise.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No new tests: it is difficult to reliably reproduce the scenario that leads to the log noise.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45203 from JoshRosen/reduce-scheduler-backend-shutdown-noise.

Authored-by: Josh Rosen <joshrosen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…cos-14-large)

### What changes were proposed in this pull request?

This PR addresses #45195 (review) by using larger runner.
However, I do not change the memory at `.github/workflows/maven_test.yml` because that is shared by other standard runners.

We're using GitHub Enterprise so this runners are available.

### Why are the changes needed?

It still fails https://github.com/apache/spark/actions/runs/7994847558. My speculation is that it relates to the resources available.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Will monitor the scheduled jobs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45211 from HyukjinKwon/SPARK-47115-followup.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…sitories by default

### What changes were proposed in this pull request?

This PR proposes to skip scheduled SparkR on Windows in fork repositories by default

### Why are the changes needed?

To be consistent with other scheduled jobs. We encourage the contributors to turn this GitHub Actions on by default in forked repositories so we better disable them by default.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually. Uses https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idif to be consistent with other scheduled jobs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45208 from HyukjinKwon/SPARK-47124.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…nd remove check nested type definition in `HiveExternalCatalog.verifyDataSchema`

### What changes were proposed in this pull request?

> In Hive 0.13 and later, column names can contain any [Unicode](http://en.wikipedia.org/wiki/List_of_Unicode_characters) character (see [HIVE-6013](https://issues.apache.org/jira/browse/HIVE-6013)), however, dot (.) and colon (:) yield errors on querying, so they are disallowed in Hive 1.2.0 (see [HIVE-10120](https://issues.apache.org/jira/browse/HIVE-10120)). Any column name that is specified within backticks (`) is treated literally. Within a backtick string, use double backticks (``) to represent a backtick character. Backtick quotation also enables the use of reserved keywords for table and column identifiers.

According to Hive Doc, the column names have the flexibility to contain any character from the Unicode set.

This PR makes HiveExternalCatalog.verifyDataSchema:

- Allow comma to be used in top-level column names
- remove check invalid characters in nested type definition for hard-coded ",:;", which turns out to be incomplete. for example, "^%", etc., are not allowed. They are all delayed to Hive API calls instead.

### Why are the changes needed?

improvement

### Does this PR introduce _any_ user-facing change?

yes, some special characters are now allowed and errors for some invalid characters now throw Spark Errors instead of Hive Meta Errors

### How was this patch tested?

new tests

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #45180 from yaooqinn/SPARK-47101.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?

This PR proposes to prevent `null` for `tokenizer.getContext`. This is similar with #28029. `getContext` seemingly via the univocity library, it can return null if `begingParsing` is not invoked (https://github.com/uniVocity/univocity-parsers/blob/master/src/main/java/com/univocity/parsers/common/AbstractParser.java#L53). This can happen when `parseLine` is not invoked at https://github.com/apache/spark/blob/e081f06ea401a2b6b8c214a36126583d35eaf55f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala#L300 - `parseLine` invokes `begingParsing`.

### Why are the changes needed?

To fix up a bug.

### Does this PR introduce _any_ user-facing change?

Yes. In a very rare case, when `CsvToStructs` is used as a sole predicate against an empty row, it might trigger NPE. This PR fixes it.

### How was this patch tested?

Manually tested, but test case will be done in a separate PR. We should backport this to all branches.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45210 from HyukjinKwon/SPARK-47125.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?

This PR aims to update `SKIP_SPARK_RELEASE_VERSIONS` in Maven CIs in order to prevent failures.

### Why are the changes needed?

- To skip newly released Apache Spark 3.5.1
- To remove deleted Apache Spark 3.3.4 and 3.5.0.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

This should be tested after merging.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45212 from dongjoon-hyun/SKIP_SPARK_RELEASE_VERSIONS.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
dongjoon-hyun and others added 14 commits February 26, 2024 20:27
…f registered workers

### What changes were proposed in this pull request?

This PR aims to fix `MasterSuite` to validate the number of registered workers during `SPARK-46881: scheduling with workerSelectionPolicy *` tests.

### Why are the changes needed?

To fix a flakiness.
- https://github.com/apache/spark/actions/runs/8042308713/job/21962794853#step:10:17224

```
[info] - SPARK-46881: scheduling with workerSelectionPolicy - CORES_FREE_DESC (false) *** FAILED *** (178 milliseconds)
[info]   List("10004") did not equal List("10005") (MasterSuite.scala:728)
[info]   Analysis:
[info]   List(0: "10004" -> "10005")
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45274 from dongjoon-hyun/SPARK-47181.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?

This PR changes `AssertOnQuery(<condition>, )` to `AssertOnQuery(<condition>)` when the message is empty.

### Why are the changes needed?

Just to make it a little bit more prettier and readale.

### Does this PR introduce _any_ user-facing change?

No, dev-inly.

### How was this patch tested?

Manually.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45275 from HyukjinKwon/minor-ss.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…unction

### What changes were proposed in this pull request?

This PR proposes to have a new helper function to help tree traversal: `resolveExpressionsUpWithPruning`. This helper function traverses all expressions of a query tree bottom up and skipping the subtree if the condition returns false.
### Why are the changes needed?

Without this helper function, a developer will need to combine `plan.resolveOperatorsUpWithPruning` and `case p: LogicalPlan => p.transformExpressionsUpWithPruning` to achieve the same thing.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing UT

### Was this patch authored or co-authored using generative AI tooling?

NO

Closes #45270 from amaliujia/analysis_helper.

Authored-by: Rui Wang <rui.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…cies from `commons-compress` and `avro*`

### Why are the changes needed?

This PR aims to exclude `commons-(io|lang3)` transitive dependencies from `commons-compress`, `avro`, and `avro-mapred` dependencies.

### Does this PR introduce _any_ user-facing change?

Apache Spark define and use our own versions. The exclusion of the transitive dependencies will clarify that.

https://github.com/apache/spark/blob/1a408033daf458f1ceebbe14a560355a1a2c0a70/pom.xml#L198

https://github.com/apache/spark/blob/1a408033daf458f1ceebbe14a560355a1a2c0a70/pom.xml#L194

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45278 from dongjoon-hyun/SPARK-47182.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…ataframe` reusable

### What changes were proposed in this pull request?
Make `test_repartitionByRange_dataframe` reusable

### Why are the changes needed?
to make it reusable in Spark Connect

### Does this PR introduce _any_ user-facing change?
no, test-only

### How was this patch tested?
updated ut

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #45281 from zhengruifeng/connect_test_repartitionByRange_dataframe.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
…st_pandas_api`

### What changes were proposed in this pull request?
Enable `DataFrameParityTests.test_pandas_api`

### Why are the changes needed?
for testing parity, this method had already been implemented

### Does this PR introduce _any_ user-facing change?
no, test-only

### How was this patch tested?
enabled ut

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #45279 from zhengruifeng/connect_test_pandas_api.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
### What changes were proposed in this pull request?
Fix the error class for `sameSemantics`

### Why are the changes needed?
the expected type should be `DataFrame`

### Does this PR introduce _any_ user-facing change?
yes

### How was this patch tested?
updated test

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #45280 from zhengruifeng/py_fix_error_sameSemantics.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
… better debuggability

### What changes were proposed in this pull request?

This PR proposes to improve error message from SparkThrowableSuite for better debuggability

### Why are the changes needed?

The current error message is not very actionable for developer who need regenerating the error class documentation.

### Does this PR introduce _any_ user-facing change?

No API change, but the error message is changed:

**Before**
```
The error class document is not up to date. Please regenerate it.
```

**After**
```
he error class document is not up to date. Please regenerate it by running `SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "core/testOnly *SparkThrowableSuite -- -t \"Error classes match with document\""`
```

### How was this patch tested?

The existing CI should pass.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45273 from itholic/improve_error_suite_debuggability.

Authored-by: Haejoon Lee <haejoon.lee@databricks.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?

Tweak the names and text for a few errors so they read more naturally (and correctly).

### Why are the changes needed?

Just minor English improvements.

### Does this PR introduce _any_ user-facing change?

Yes, these are user-facing error messages.

### How was this patch tested?

No testing apart from CI.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45276 from nchammas/column-error-tweak.

Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
…tinuousSourceSuite

### What changes were proposed in this pull request?

This PR proposes to increase the timeout between between actions in `KafkaContinuousSourceSuite`.

### Why are the changes needed?

In Mac OS build, those tests fail indeterministically, see
- https://github.com/apache/spark/actions/runs/8054862135/job/22000404856
- https://github.com/apache/spark/actions/runs/8040413156/job/21958488693
- https://github.com/apache/spark/actions/runs/8032862212/job/21942732320
- https://github.com/apache/spark/actions/runs/8024427919/job/21937366481

`KafkaContinuousSourceSuite` is specifically slow in Mac OS. Kafka producers send the messages correctly, but the consumers can't get the messages for some reasons. You can't get the offsets for long time. This is not an issue in micro batch but I fail to identify the difference.

I just decided to increase the timeout between actions for now. This is more just a workaround.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Manually tested in my Mac.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45283 from HyukjinKwon/SPARK-47185.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?

In the PR, I propose to remove the legacy error class `_LEGACY_ERROR_TEMP_2021` as it is an internal error.
### Why are the changes needed?

User experience improvement w/ Spark SQL.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tests already exist.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45198 from andrej-db/SPARK-43256.

Authored-by: andrej-db <andrej.gobeljic@databricks.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
…e, and reduce the resource usage

### What changes were proposed in this pull request?

This PR is a followup of #45272, #45268, #45264 and #45283 that increase timeout more and decrease the resource needed during the CI.

### Why are the changes needed?

To make the scheduled build pass https://github.com/apache/spark/actions/runs/8054862135/job/22053180441.

At least as far as I can tell, those changes are effective (makes tests less flaky and less fail).

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

I manually ran then via IDE.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45297 from HyukjinKwon/SPARK-47185-SPARK-47181-followup.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…prove the debuggability for docker integration test

### What changes were proposed in this pull request?

This PR adds test-scoped options:
  - Timeout for pulling the Docker image before the tests start. - `spark.test.docker.imagePullTimeout`
  - Timeout for container to spin up. - `spark.test.docker.startContainerTimeout`
  - Timeout for connecting the inner service in the container - `spark.test.docker.connectionTimeout`

This PR also adds loggings(excluding the downloading/extracting details) for the imaging pulling step which is time-consuming:

```
24/02/27 19:03:17.112 docker-java-stream--1665796424 INFO OracleIntegrationSuite: Pulling from gvenzl/oracle-free 23.3-slim
24/02/27 19:03:17.112 docker-java-stream--1665796424 INFO OracleIntegrationSuite: Pulling fs layer 5cbb6d705282
24/02/27 19:03:17.113 docker-java-stream--1665796424 INFO OracleIntegrationSuite: Pulling fs layer f1544b3116d0
24/02/27 19:03:17.113 docker-java-stream--1665796424 INFO OracleIntegrationSuite: Pulling fs layer 1dff807126c4
24/02/27 19:03:17.113 docker-java-stream--1665796424 INFO OracleIntegrationSuite: Pulling fs layer 603266ad0104
24/02/27 19:03:17.113 docker-java-stream--1665796424 INFO OracleIntegrationSuite: Pulling fs layer 10f286d1795c
24/02/27 19:03:17.113 docker-java-stream--1665796424 INFO OracleIntegrationSuite: Pulling fs layer 7c4de5471fcf
24/02/27 19:03:17.113 docker-java-stream--1665796424 INFO OracleIntegrationSuite: Waiting 603266ad0104
24/02/27 19:03:17.113 docker-java-stream--1665796424 INFO OracleIntegrationSuite: Waiting 10f286d1795c
24/02/27 19:03:17.113 docker-java-stream--1665796424 INFO OracleIntegrationSuite: Waiting 7c4de5471fcf
24/02/27 19:03:59.725 docker-java-stream--1665796424 INFO OracleIntegrationSuite: Verifying Checksum 5cbb6d705282
24/02/27 19:03:59.725 docker-java-stream--1665796424 INFO OracleIntegrationSuite: Download complete 5cbb6d705282
24/02/27 19:04:12.512 docker-java-stream--1665796424 INFO OracleIntegrationSuite: Extracting 5cbb6d705282 62.3 MiB
24/02/27 19:04:12.801 docker-java-stream--1665796424 INFO OracleIntegrationSuite: Pull complete 5cbb6d705282
24/02/27 19:04:25.905 docker-java-stream--1665796424 INFO OracleIntegrationSuite: Verifying Checksum f1544b3116d0
24/02/27 19:04:25.906 docker-java-stream--1665796424 INFO OracleIntegrationSuite: Download complete f1544b3116d0
24/02/27 19:04:39.533 docker-java-stream--1665796424 INFO OracleIntegrationSuite: Extracting f1544b3116d0 103.5 MiB
24/02/27 19:04:39.647 docker-java-stream--1665796424 INFO OracleIntegrationSuite: Pull complete f1544b3116d0
24/02/27 19:04:46.451 docker-java-stream--1665796424 INFO OracleIntegrationSuite: Verifying Checksum 10f286d1795c
24/02/27 19:04:46.452 docker-java-stream--1665796424 INFO OracleIntegrationSuite: Download complete 10f286d1795c
24/02/27 19:05:39.623 docker-java-stream--1665796424 INFO OracleIntegrationSuite: Verifying Checksum 7c4de5471fcf
24/02/27 19:05:39.623 docker-java-stream--1665796424 INFO OracleIntegrationSuite: Download complete 7c4de5471fcf
24/02/27 19:05:40.889 docker-java-stream--1665796424 INFO OracleIntegrationSuite: Verifying Checksum 1dff807126c4
24/02/27 19:05:40.890 docker-java-stream--1665796424 INFO OracleIntegrationSuite: Download complete 1dff807126c4
24/02/27 19:05:51.976 docker-java-stream--1665796424 INFO OracleIntegrationSuite: Verifying Checksum 603266ad0104
24/02/27 19:05:51.976 docker-java-stream--1665796424 INFO OracleIntegrationSuite: Download complete 603266ad0104
24/02/27 19:05:59.357 docker-java-stream--1665796424 INFO OracleIntegrationSuite: Extracting 1dff807126c4 178.3 MiB
24/02/27 19:05:59.429 docker-java-stream--1665796424 INFO OracleIntegrationSuite: Pull complete 1dff807126c4
24/02/27 19:06:10.751 docker-java-stream--1665796424 INFO OracleIntegrationSuite: Extracting 603266ad0104 110.2 MiB
24/02/27 19:06:11.117 docker-java-stream--1665796424 INFO OracleIntegrationSuite: Pull complete 603266ad0104
```

### Why are the changes needed?

Some districts might suffer from network issues with the official docker registry

### Does this PR introduce _any_ user-facing change?

no, dev-only

### How was this patch tested?

docker it

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #45284 from yaooqinn/SPARK-47186.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Kent Yao <yao@apache.org>
### What changes were proposed in this pull request?
`sameSemantics` checks input types in the same way as vanilla pyspark

### Why are the changes needed?
for parity

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
enabled ut

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #45300 from zhengruifeng/connect_sameSemantics_error.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.