Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a new pull request by comparing changes across two branches #1580

Merged
merged 13 commits into from
Nov 2, 2023

Conversation

GulajavaMinistudio
Copy link
Owner

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

LuciferYang and others added 13 commits October 31, 2023 22:01
### What changes were proposed in this pull request?
This pr upgrade commons-io from 2.14.0 to 2.15.0

### Why are the changes needed?
The updates of `commons-io` 2.15.0 mainly focus on fixing bugs in file and stream handling, adding new file and stream handling features, and optimizing the performance of file content comparison:

1. Bug fixes: This version fixes multiple bugs, mainly in file and stream handling. For example, it fixes the encoding matching issue of `XmlStreamReader` (IO-810), the issue that `FileUtils.listFiles` and `FileUtils.iterateFiles` methods failed to close their internal streams (IO-811), and the issue that `StreamIterator` failed to close its internal stream (IO-811). In addition, it also fixes the null pointer exception information of `RandomAccessFileMode.create(Path)`, and the issue that `UnsynchronizedBufferedInputStream.read(byte[], int, int)` does not use the buffer (IO-816).

2. New features: This version adds some new classes and methods, such as `org.apache.commons.io.channels.FileChannels`, `RandomAccessFiles#contentEquals(RandomAccessFile, RandomAccessFile)`, `RandomAccessFiles#reset(RandomAccessFile)`, and `org.apache.commons.io.StreamIterator`. In addition, it also added `MessageDigestInputStream` and deprecated `MessageDigestCalculatingInputStream`.

3. Performance optimization: This version optimizes the performance of `PathUtils.fileContentEquals(Path, Path, LinkOption[], OpenOption[])`, `PathUtils.fileContentEquals(Path, Path)`, and `FileUtils.contentEquals(File, File)`. From the release notes, the related performance has improved by about 60%.

The full release notes as follow:

- https://commons.apache.org/proper/commons-io/changes-report.html#a2.15.0

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43592 from LuciferYang/commons-io-215.

Lead-authored-by: yangjie01 <yangjie01@baidu.com>
Co-authored-by: YangJie <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…om a superclass shadow symbols defined in an outer scope

### What changes were proposed in this pull request?
After upgrade to scala 2.13, when using symbols inherited from a superclass shadow symbols defined in an outer scope, the following warning will appear:
```
[error] /Users/panbingkun/Developer/spark/spark-community/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala:1315:39: reference to child is ambiguous;
[error] it is both defined in the enclosing method apply and inherited in the enclosing anonymous class as value child (defined in class IsNull)
[error] In Scala 2, symbols inherited from a superclass shadow symbols defined in an outer scope.
[error] Such references are ambiguous in Scala 3. To continue using the inherited symbol, write `this.child`.
[error] Or use `-Wconf:msg=legacy-binding:s` to silence this warning. [quickfixable]
[error] Applicable -Wconf / nowarn filters for this fatal warning: msg=<part of the message>, cat=other, site=org.apache.spark.sql.catalyst.expressions.IsUnknown.apply
[error]       override def sql: String = s"(${child.sql} IS UNKNOWN)"
[error]                                       ^
```
The pr aims to fix it.

### Why are the changes needed?
Prepare for upgrading to scala 3.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- Pass GA
- Manually test:
   ```
   build/sbt -Phadoop-3 -Pdocker-integration-tests -Pspark-ganglia-lgpl -Pkinesis-asl -Pkubernetes -Phive-thriftserver -Pconnect -Pyarn -Phive -Phadoop-cloud -Pvolcano -Pkubernetes-integration-tests Test/package streaming-kinesis-asl-assembly/assembly connect/assembly
   ```

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #43593 from panbingkun/SPARK-45704.

Authored-by: panbingkun <pbk1982@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR captures the dataset APIs used by the user code and the call site in the user code and provides better error messages.

E.g. consider the following Spark app `SimpleApp.scala`:
```scala
   1  import org.apache.spark.sql.SparkSession
   2  import org.apache.spark.sql.functions._
   3
   4  object SimpleApp {
   5    def main(args: Array[String]) {
   6      val spark = SparkSession.builder.appName("Simple Application").config("spark.sql.ansi.enabled", true).getOrCreate()
   7      import spark.implicits._
   8
   9      val c = col("a") / col("b")
  10
  11      Seq((1, 0)).toDF("a", "b").select(c).show()
  12
  13      spark.stop()
  14    }
  15  }
```

After this PR the error message contains the error context (which Spark Dataset API is called from where in the user code) in the following form:
```
Exception in thread "main" org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error.
== Dataset ==
"div" was called from SimpleApp$.main(SimpleApp.scala:9)

	at org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:201)
	at org.apache.spark.sql.catalyst.expressions.DivModLike.eval(arithmetic.scala:672
...
```
which is similar to the already provided context in case of SQL queries:
```
org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error.
== SQL(line 1, position 1) ==
a / b
^^^^^

	at org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:201)
	at org.apache.spark.sql.errors.QueryExecutionErrors.divideByZeroError(QueryExecutionErrors.scala)
...
```

Please note that stack trace in `spark-shell` doesn't contain meaningful elements:
```
scala> Thread.currentThread().getStackTrace.foreach(println)
java.base/java.lang.Thread.getStackTrace(Thread.java:1602)
$line15.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:23)
$line15.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:27)
$line15.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:29)
$line15.$read$$iw$$iw$$iw$$iw$$iw.<init>(<console>:31)
$line15.$read$$iw$$iw$$iw$$iw.<init>(<console>:33)
$line15.$read$$iw$$iw$$iw.<init>(<console>:35)
$line15.$read$$iw$$iw.<init>(<console>:37)
$line15.$read$$iw.<init>(<console>:39)
$line15.$read.<init>(<console>:41)
$line15.$read$.<init>(<console>:45)
$line15.$read$.<clinit>(<console>)
$line15.$eval$.$print$lzycompute(<console>:7)
$line15.$eval$.$print(<console>:6)
$line15.$eval.$print(<console>)
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
...
```
so this change doesn't help with that usecase.

### Why are the changes needed?
To provide more user friendly errors.

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
Added new UTs to `QueryExecutionAnsiErrorsSuite`.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #43334 from MaxGekk/context-for-dataset-api-errors.

Lead-authored-by: Max Gekk <max.gekk@gmail.com>
Co-authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
…rolling.maxRetainedFile

**What changes were proposed in this pull request?**
The PR updates the default value of 'spark.executor.logs.rolling.maxRetainedFiles' in configuration.html on the website

**Why are the changes needed?**
The default value of 'spark.executor.logs.rolling.maxRetainedFiles' is -1, but the website is wrong.

**Does this PR introduce any user-facing change?**
No

**How was this patch tested?**
It doesn't need to.

**Was this patch authored or co-authored using generative AI tooling?**
No

Closes #43618 from chenyu-opensource/branch-SPARK-45751.

Authored-by: chenyu <119398199+chenyu-opensource@users.noreply.github.com>
Signed-off-by: Kent Yao <yao@apache.org>
…it `1`

### What changes were proposed in this pull request?

This PR makes `Dataset.isEmpty()` to execute global limit 1 first. `LimitPushDown` may push down global limit 1 to lower nodes to improve query performance.

Note that we use global limit 1 here, because the local limit cannot be pushed down the group only case: https://github.com/apache/spark/blob/89ca8b6065e9f690a492c778262080741d50d94d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L766-L770

### Why are the changes needed?

Improve query performance.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual testing:
```scala
spark.range(300000000).selectExpr("id", "array(id, id % 10, id % 100) as eo").write.saveAsTable("t1")
spark.range(100000000).selectExpr("id", "array(id, id % 10, id % 1000) as eo").write.saveAsTable("t2")
println(spark.sql("SELECT * FROM t1 LATERAL VIEW explode_outer(eo) AS e UNION SELECT * FROM t2 LATERAL VIEW explode_outer(eo) AS e").isEmpty)
```

Before this PR | After this PR
-- | --
<img width="430" alt="image" src="https://github.com/apache/spark/assets/5399861/417adc05-4160-4470-b63c-125faac08c9c"> | <img width="430" alt="image" src="https://github.com/apache/spark/assets/5399861/bdeff231-e725-4c55-9da2-1b4cd59ec8c8">

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43617 from wangyum/SPARK-45755.

Lead-authored-by: Yuming Wang <yumwang@ebay.com>
Co-authored-by: Yuming Wang <yumwang@apache.org>
Signed-off-by: Jiaan Geng <beliefer@163.com>
### What changes were proposed in this pull request?
This pr upgrade dropwizard metrics from 4.2.19 to 4.2.21.

### Why are the changes needed?
The new version includes the following major updates:
- dropwizard/metrics#2652
- dropwizard/metrics#3515
- dropwizard/metrics#3523
- dropwizard/metrics#3570

The full release notes as follows:
- https://github.com/dropwizard/metrics/releases/tag/v4.2.20
- https://github.com/dropwizard/metrics/releases/tag/v4.2.21

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43608 from LuciferYang/SPARK-45743.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?

This PR aims to support `spark.deploy.driverIdPattern` for Apache Spark 4.0.0.

### Why are the changes needed?

This allows the users to be able to control driver ID pattern.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs with the newly added test case.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43615 from dongjoon-hyun/SPARK-45753.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
The pr aims to upgrade zstd-jni from 1.5.5-5 to 1.5.5-7.

### Why are the changes needed?
1.Version compare:
- v1.5.5-6 VS v1.5.5-7: luben/zstd-jni@v1.5.5-6...v1.5.5-7
- v1.5.5-5 VS v1.5.5-6: luben/zstd-jni@v.1.5.5-5...v1.5.5-6

2.Some changes include the following:
- Add new method `getFrameContentSize` that will return also the error. luben/zstd-jni@3f6c55e
- Fix error. luben/zstd-jni@e0c66f1

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #43113 from panbingkun/SPARK-45327.

Authored-by: panbingkun <pbk1982@gmail.com>
Signed-off-by: yangjie01 <yangjie01@baidu.com>
### What changes were proposed in this pull request?

This PR aims to support `spark.deploy.appIdPattern` for Apache Spark 4.0.0.

### Why are the changes needed?

This allows the users to be able to control driver ID pattern.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs with the newly added test case.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43616 from dongjoon-hyun/SPARK-45754.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…n only when it exists

### What changes were proposed in this pull request?

This PR aims to improve `MasterPage` to show `Resource` column only when it exists.

### Why are the changes needed?

For non-GPU clusters, `Resource` column is empty always.

### Does this PR introduce _any_ user-facing change?

After this PR, `MasterPage` still shows `Resource` column if the resource exists like the following.

![Screenshot 2023-11-01 at 11 02 43 AM](https://github.com/apache/spark/assets/9700541/104dd4e7-938b-4269-8952-512e8fb5fa39)

If there is no resource on all workers, the `Resource` column is omitted.

![Screenshot 2023-11-01 at 11 03 20 AM](https://github.com/apache/spark/assets/9700541/12c9d4b2-330a-4e36-a6eb-ac2813e0649a)

### How was this patch tested?

Manual test.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43628 from dongjoon-hyun/SPARK-45763.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?

Introduce a new `ReleaseSession` Spark Connect RPC, which cancels everything running in the session and removes the session server side. Refactor code around managing the cache of sessions into `SparkConnectSessionManager`.

### Why are the changes needed?

Better session management.

### Does this PR introduce _any_ user-facing change?

Not really. `SparkSession.stop()` API already existed on the client side. It was closing the client's network connection, but the Session was still there cached for 1 hour on the server side.

Caveats, which were not really supported user behaviour:
* After `session.stop()`, user could have created a new session with the same session_id in Configuration. That session would be a new session on the client side, but connect to the old cached session in the server. It could therefore e.g. access that old session's state like views or artifacts.
* If a session timed out and was removed in the server, it used to be that a new request would re-create the session. The client would then see this as the old session, but the server would see a new one, and e.g. not have access to old session state that was removed.
* User is no longer allowed to create a new session with the same session_id as before.

### How was this patch tested?

Tests added.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43546 from juliuszsompolski/release-session.

Lead-authored-by: Juliusz Sompolski <julek@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?

This PR aims to upgrade `Volcano` to 1.8.1 in K8s integration test document and GitHub Action job.

### Why are the changes needed?

To bring the latest feature and bug fixes in addition to the test coverage for Volcano scheduler 1.8.1.

- https://github.com/volcano-sh/volcano/releases/tag/v1.8.1
  - volcano-sh/volcano#3101

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43624 from dongjoon-hyun/SPARK-45761.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…ction for Scala Array to wrap into `immutable.ArraySeq`

### What changes were proposed in this pull request?
Currently, we need to use `immutable.ArraySeq.unsafeWrapArray(array)` to wrap an Array into an `immutable.ArraySeq`, which makes the code look bloated.

So this PR introduces an implicit function `toImmutableArraySeq` to make it easier for Scala Array to be wrapped into `immutable.ArraySeq`.

After this pr, we can use the following way to wrap an array into an `immutable.ArraySeq`:

```scala
import org.apache.spark.util.ArrayImplicits._

val dataArray = ...
val immutableArraySeq = dataArray.toImmutableArraySeq
```

At the same time, this pr replaces the existing use of `immutable.ArraySeq.unsafeWrapArray(array)` with the new method.

On the other hand, this implicit function will be conducive to the progress of work SPARK-45686 and SPARK-45687.

### Why are the changes needed?
Makes the code for wrapping a Scala Array into an `immutable.ArraySeq` look less bloated.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43607 from LuciferYang/SPARK-45742.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants