Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a new pull request by comparing changes across two branches #1576

Merged
merged 18 commits into from
Oct 25, 2023

Conversation

GulajavaMinistudio
Copy link
Owner

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

dongjoon-hyun and others added 18 commits October 23, 2023 23:47
…ecodeVersion` rule

### What changes were proposed in this pull request?

This PR aims to remove an exclusion for `threeten-extra` dependency in `enforceBytecodeVersion` rule.

### Why are the changes needed?

`threeteen-extra` library has no problem with Apache Spark 4.0.0 because Spark's minimum JDK is 17.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43504 from dongjoon-hyun/SPARK-44032.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
… and offset > rowCount

### What changes were proposed in this pull request?

This is a fix for the failure when function that utilized `FramelessOffsetWindowFunctionFrame` is used with `ignoreNulls = true` and `offset > rowCount`.

e.g.

```
select x, lead(x, 5) IGNORE NULLS over (order by x) from (select explode(sequence(1, 3)) x)
```

### Why are the changes needed?

Fix existing bug

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Modify existing unit test to cover this case

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #43236 from vitaliili-db/SPARK-45430.

Authored-by: Vitalii Li <vitalii.li@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…th `mutable.MapOps#filterInPlace`

### What changes were proposed in this pull request?
This pr replace `s.c.mutable.MapOps#retain` with `s.c.mutable.MapOps#filterInPlace` due to `retain` has been marked as deprecated since Scala 2.13.0.

```scala
  deprecated("Use filterInPlace instead", "2.13.0")
  inline final def retain(p: (K, V) => Boolean): this.type = filterInPlace(p)
```

### Why are the changes needed?
Clean up deprecated API usage.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43482 from LuciferYang/retain2filterInPlace.

Lead-authored-by: yangjie01 <yangjie01@baidu.com>
Co-authored-by: YangJie <yangjie01@baidu.com>
Signed-off-by: yangjie01 <yangjie01@baidu.com>
…_PART_NAMESPACE

### What changes were proposed in this pull request?
The pr aims to convert `_LEGACY_ERROR_TEMP_1055` to `REQUIRES_SINGLE_PART_NAMESPACE`
PS: In addition, the variable name (`quoted` -> `database`) was incorrect in the original error class message template, and this issue was resolved in the above `convert`.

### Why are the changes needed?
Fix bug.

- Before:
   <img width="1327" alt="image" src="https://github.com/apache/spark/assets/15246973/e7a59837-4f14-4f09-872a-913d78006ede">

- After:
   <img width="1164" alt="image" src="https://github.com/apache/spark/assets/15246973/5d3928bf-e580-44b1-a86d-55654686e8d5">

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- Pass GA.
- Manually test:
```
./build/sbt "hive/testOnly org.apache.spark.sql.hive.execution.command.CreateNamespaceSuite -- -z \"REQUIRES_SINGLE_PART_NAMESPACE\""

[info] CreateNamespaceSuite:
[info] - CREATE NAMESPACE using Hive V1 catalog V1 command: REQUIRES_SINGLE_PART_NAMESPACE (392 milliseconds)
[info] - CREATE NAMESPACE using Hive V1 catalog V2 command: REQUIRES_SINGLE_PART_NAMESPACE (3 milliseconds)
15:24:14.648 WARN org.apache.hadoop.hive.metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0
15:24:14.648 WARN org.apache.hadoop.hive.metastore.ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore panbingkun172.24.144.16
15:24:14.654 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database default, returning NoSuchObjectException
15:24:14.795 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
15:24:15.007 WARN org.apache.hadoop.hive.ql.session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
[info] Run completed in 4 seconds, 819 milliseconds.
[info] Total number of tests run: 2
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 2, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 186 s (03:06), completed Oct 24, 2023, 3:24:15 PM
```

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #43479 from panbingkun/SPARK-45626.

Authored-by: panbingkun <pbk1982@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?

This PR adds content for the application start time on AllJobsPage.

### Why are the changes needed?

The application start time currently remains in a long value form, which is not convenient for users to read on the environment tab. It's also useful on the AllJobsPage, such as identifying a long pause before job 0 gets scheduled.

### Does this PR introduce _any_ user-facing change?

yes, UI updates

### How was this patch tested?

locally tested

![Screenshot 2023-10-24 at 13 44 19](https://github.com/apache/spark/assets/8326978/d0a7ccf5-43e5-4780-b937-e26ca9a91d36)

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #43495 from yaooqinn/SPARK-45641.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-44752

### Why are the changes needed?
The XML data source is basically supported, but the XML example and document page are not yet available

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Annotated the methods of other data sources, click on 'run' in the idea to run

### Was this patch authored or co-authored using generative AI tooling?
It was written by my Rubik's Cube JSON and CSV

Closes #43350 from laglangyue/xml_example_doc.

Lead-authored-by: tangjiafu <jiafu.tang@qq.com>
Co-authored-by: laglangyue <jiafu.tang@qq.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?

Change MySql Dialect to convert catalyst TINYINT into MySQL TINYINT rather than BYTE and INTEGER. BYTE does not exist in MySQL. The same applies to MsSqlServerDialect.

### Why are the changes needed?

Since BYTE type does not exist in MySQL, any casts that could be pushed down involving BYTE type would fail.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

UT pass.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #43390 from michaelzhan-db/SPARK-45561.

Lead-authored-by: Michael Zhang <m.zhang@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?

Remove the following hardcoding time variables prior to Hive 2.0

```
hive.stats.jdbc.timeout
hive.stats.retries.wait
```

### Why are the changes needed?

It's kind of a cleanup since Spark 4.0 only supports Hive 2.0 and above.

The removal also reduces the warning message on `spark-sql` bootstrap.

### Does this PR introduce _any_ user-facing change?

Yes, it reduces the warning message on `spark-sql` bootstrap.

```patch
  ➜  $ build/sbt clean package -Phive-thriftserver
  ➜  $ SPARK_PREPEND_CLASSES=true bin/spark-sql
  NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly.
  23/10/24 15:42:22 WARN Utils: Your hostname, pop-os resolves to a loopback address: 127.0.1.1; using 10.221.99.150 instead (on interface wlp61s0)
  23/10/24 15:42:22 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
  To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
  23/10/24 15:42:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
- 23/10/24 15:42:25 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
- 23/10/24 15:42:25 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
  23/10/24 15:42:28 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0
  23/10/24 15:42:28 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore chengpan127.0.1.1
  23/10/24 15:42:28 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
  Spark Web UI available at http://10.221.99.150:4040
  Spark master: local[*], Application Id: local-1698133344448
  spark-sql (default)>
```

### How was this patch tested?

Pass GA and manually test.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43506 from pan3793/SPARK-45646.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: yangjie01 <yangjie01@baidu.com>
… not implement getSchema

### What changes were proposed in this pull request?

This is a small followup of #43261 . When a `FileSystem` does not implement `getSchema`, we fail earlier than before after #43261  . This PR restores the timing of throwing exception by skipping the new optimization if `getSchema` fails.

### Why are the changes needed?

fix small behavior change

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

revert the tests changes in #43261 and tests still pass

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #43508 from cloud-fan/follow.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…`s.c.mutable.MapOps#mapValuesInPlace`

### What changes were proposed in this pull request?
This pr replace `s.c.mutable.MapOps#transform` with `s.c.mutable.MapOps#mapValuesInPlace` due to `transform` has been marked as deprecated since Scala 2.13.0.

```scala
  deprecated("Use mapValuesInPlace instead", "2.13.0")
  inline final def transform(f: (K, V) => V): this.type = mapValuesInPlace(f)
```

### Why are the changes needed?
Clean up deprecated API usage.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43500 from LuciferYang/transform-2-mapValuesInPlace.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…d API

### What changes were proposed in this pull request?

This PR adds a new Python data source API and adds initial execution support for data source read.

Here is an overview of the Python data source API. Please see the code (`python/pyspark/sql/datasource.py`) for more details.
```python
class DataSource(ABC):
    def __init__(self, options: Dict):
        self.options = options

    property
    def name(self) -> str:
        ...

    def schema(self) -> Union[StructType, str]:
        ...

    def reader(self, schema: StructType) -> "DataSourceReader":
        ...

    # TODO: SPARK-45525
    def writer(self, schema: StructType, saveMode: str) -> "DataSourceWriter":
        ...
```
DataSourceReader (this PR):
```python
class DataSourceReader(ABC):
    abstractmethod
    def read(self, partition) -> Iterator:
        ...

    def partitions(self) -> Iterator[Any]:
        ...
```

Note this PR does not support reading a Python data source using `spark.read.format(...).load()`. This will be supported in a subsequent PR: [SPARK-45639](https://issues.apache.org/jira/browse/SPARK-45639).

### Why are the changes needed?

To make Spark more accessible to Python users. For more information, please see this [SPIP](https://docs.google.com/document/d/1oYrCKEKHzznljYfJO4kx5K_Npcgt1Slyfph3NEk7JRU/edit?usp=sharing).

### Does this PR introduce _any_ user-facing change?

Yes. Users can use the new Python data source API to create their own data source.

### How was this patch tested?

New UTs.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #43360 from allisonwang-db/spark-45524-data-source-read-api.

Authored-by: allisonwang-db <allison.wang@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?

This is a follow-up of #43356.

Refactor the null-checking to have shortcuts.

### Why are the changes needed?

The null-check can have shortcuts for some cases.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The existing tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43492 from ueshin/issues/SPARK-45523/nullcheck.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Add a conf to set RocksDB State Store Instance's compression type, with default to be LZ4

### Why are the changes needed?
LZ4 is generally faster than Snappy. That's probably why we use LZ4 in changelogs and other places by default. However, we don't change RocksDB's default of Snappy compression style. The RocksDB Team recommend LZ4 and/or ZSTD (https://github.com/facebook/rocksdb/wiki/Setup-Options-and-Basic-Tuning#compression) and the default is kept to Snappy only for backward compatibility reason. We should make it tunable with default LZ4

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests should fix it. Some benchmarks are run and we see positive improvements.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #43338 from siying/rocksdb_lz4.

Authored-by: Siying Dong <siying.dong@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
The pr aims to fix flaky ProtobufCatalystDataConversionSuite.

As can be seen from the GA test below
https://github.com/panbingkun/spark/actions/runs/6612141762/job/17982780917
<img width="988" alt="image" src="https://github.com/apache/spark/assets/15246973/fd1931ba-d858-4c59-9d5c-3a1353e3e005">

When `data.get(0)` is null, `data.get(0).asInstanceOf[Array[Byte]].isEmpty` will be thrown `java.lang.NullPointerException`.

<img width="491" alt="image" src="https://github.com/apache/spark/assets/15246973/4cf83e36-1c4a-4b65-b119-1152b026ed29">

### Why are the changes needed?
Fix flaky ProtobufCatalystDataConversionSuite.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.
Manually test:
```
./build/sbt "protobuf/testOnly org.apache.spark.sql.protobuf.ProtobufCatalystDataConversionSuite"
```
<img width="763" alt="image" src="https://github.com/apache/spark/assets/15246973/95e76b2e-5b09-4268-81bf-df20dd262b66">

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #43493 from panbingkun/SPARK-45640.

Authored-by: panbingkun <pbk1982@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
With a manual trigger, the workflow can be executed manually after merging a fix of the workflow to master. This also allows to run the workflow only on a subset of branches (e.g. those that failed).

### Why are the changes needed?
Sometime, publishing snapshots fails. If a fix of the workflow file is needed, that change can only be tested by waiting for the next day when the cron even triggers the next publish. This is quite a long turnaround to test fixes to that workflow (see #43364).

### Does this PR introduce _any_ user-facing change?
No, this is purely build CI related.

### How was this patch tested?
This can only be tested in master. Github workflow syntax tested in a private repo.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43512 from EnricoMi/publish-snapshot-manually.

Authored-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add `sql/api` and `common/utils` to `modules.py`

### Why are the changes needed?
new modules should be covered in `modules.py`, otherwise related tests maybe wrongly skipped in some cases

### Does this PR introduce _any_ user-facing change?
no, infra-only

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #43501 from zhengruifeng/infra_sql_api.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…rtConf

### What changes were proposed in this pull request?

This change ensures that RPC SSL options settings inheritance works properly after #43238 - we pass `sslOptions` wherever we call `fromSparkConf`.

In addition to that minor mechanical change, duplicate/add tests for every place that calls this method, to add a test case that runs with SSL support in the config.

### Why are the changes needed?

These changes are needed to ensure that the RPC SSL functionality can work properly with settings inheritance. In addition, through these tests we can ensure that any changes to these modules are also tested with SSL support and avoid regressions in the future.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Full integration testing also done as part of #42685

Added some tests and ran them:

```
build/sbt
> project core
> testOnly org.apache.spark.*Ssl*
> testOnly org.apache.spark.network.netty.NettyBlockTransferSecuritySuite
```

and

```
build/sbt -Pyarn
> project yarn
> testOnly org.apache.spark.network.yarn.SslYarnShuffleServiceWithRocksDBBackendSuite
```

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #43387 from hasnain-db/spark-tls-integrate-everywhere.

Authored-by: Hasnain Lakhani <hasnain.lakhani@databricks.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
…ame name on different datasets

### What changes were proposed in this pull request?

Fixes observation when named observations with the same name on different datasets.

### Why are the changes needed?

Currently if there are observations with the same name on different dataset, one of them will be overwritten by the other execution.

For example,

```py
>>> observation1 = Observation("named")
>>> df1 = spark.range(50)
>>> observed_df1 = df1.observe(observation1, count(lit(1)).alias("cnt"))
>>>
>>> observation2 = Observation("named")
>>> df2 = spark.range(100)
>>> observed_df2 = df2.observe(observation2, count(lit(1)).alias("cnt"))
>>>
>>> observed_df1.collect()
...
>>> observed_df2.collect()
...
>>> observation1.get
{'cnt': 50}
>>> observation2.get
{'cnt': 50}
```

`observation2` should return `{'cnt': 100}`.

### Does this PR introduce _any_ user-facing change?

Yes, the observations with the same name will be available if they observe different datasets.

### How was this patch tested?

Added the related tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43519 from ueshin/issues/SPARK-45656/observation.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.