Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a new pull request by comparing changes across two branches #1664

Merged
merged 28 commits into from
Jul 26, 2024

Conversation

GulajavaMinistudio
Copy link
Owner

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

WeichenXu123 and others added 28 commits July 23, 2024 19:19
…n spark ML reader/writer

### What changes were proposed in this pull request?

`SparkSession.getActiveSession` is thread-local session, but spark ML reader / writer might be executed in different threads which causes `SparkSession.getActiveSession` returning None.

### Why are the changes needed?

It fixes the bug like:
```
        spark = SparkSession.getActiveSession()
>       spark.createDataFrame(  # type: ignore[union-attr]
            [(metadataJson,)], schema=["value"]
        ).coalesce(1).write.text(metadataPath)
E       AttributeError: 'NoneType' object has no attribute 'createDataFrame'
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manually.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47453 from WeichenXu123/SPARK-48970.

Authored-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
…er in log

### What changes were proposed in this pull request?

This PR followups for #47145 to rename the log field naming

### Why are the changes needed?

`line_no` is not very intuitive so we better renaming to `line_number` explicitly.

### Does this PR introduce _any_ user-facing change?

No API change, but user-facing log message will be improved

### How was this patch tested?
<!--
If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why it was difficult to add.
If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
-->
The existing CI should pass

### Was this patch authored or co-authored using generative AI tooling?
<!--
If generative AI tooling has been used in the process of authoring this patch, please include the
phrase: 'Generated-by: ' followed by the name of the tool and its version.
If no, write 'No'.
Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details.
-->
No

Closes #47437 from itholic/logger_followup.

Authored-by: Haejoon Lee <haejoon.lee@databricks.com>
Signed-off-by: Haejoon Lee <haejoon.lee@databricks.com>
…inted RDDs

### What changes were proposed in this pull request?

This pull request proposes logging a warning message when the `.unpersist()` method is called on RDDs that have been locally checkpointed. The goal is to inform users about the potential risks associated with unpersisting locally checkpointed RDDs without changing the current behavior of the method.

### Why are the changes needed?

Local checkpointing truncates the lineage of an RDD, preventing it from being recomputed from its source. If a locally checkpointed RDD is unpersisted, it loses its data and cannot be regenerated, potentially leading to job failures if subsequent actions or transformations are attempted on the RDD (which was seen on some user workloads). Logging a warning message helps users avoid such pitfalls and aids in debugging.

### Does this PR introduce _any_ user-facing change?

Yes, this PR adds a warning log message when .unpersist() is called on a locally checkpointed RDD, but it does not alter any existing behavior.

### How was this patch tested?

This PR does not change any existing behavior and therefore no tests are added.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47391 from mingkangli-db/warning_unpersist.

Authored-by: Mingkang Li <mingkang.li@databricks.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
### What changes were proposed in this pull request?

Fix breaking change in `fromJson` method by having default param values.

### Why are the changes needed?

In order to not break clients that don't care about collations.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing UTs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46737 from stefankandic/fromJsonBreakingChange.

Authored-by: Stefan Kandic <stefan.kandic@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…d of `SQLContext.implicits`

### What changes were proposed in this pull request?
This PR replaces `SQLContext.implicits`  with `SparkSession.implicits` in the Spark codebase.

### Why are the changes needed?
Reduce the usage of code from `SQLContext` within the internal code of Spark.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #47457 from LuciferYang/use-sparksession-implicits.

Lead-authored-by: yangjie01 <yangjie01@baidu.com>
Co-authored-by: YangJie <yangjie01@baidu.com>
Signed-off-by: yangjie01 <yangjie01@baidu.com>
### What changes were proposed in this pull request?
The pr aims to make `curl` retry `3 times` in `bin/mvn`.

### Why are the changes needed?
Avoid the following issues:
https://github.com/panbingkun/spark/actions/runs/10067831390/job/27832101470
<img width="993" alt="image" src="https://github.com/user-attachments/assets/3fa9a59a-82cb-4e99-b9f7-d128f051d340">

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Continuous manual observation.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47465 from panbingkun/SPARK-48987.

Authored-by: panbingkun <panbingkun@baidu.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…llations

### What changes were proposed in this pull request?

Fixing the bug in the code where because of different way string interpolation works in python we had an accidental dollar sign in the string value.

### Why are the changes needed?

To be consistent with the scala code.

### Does this PR introduce _any_ user-facing change?

Different value will be shown to the user.

### How was this patch tested?

Unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47463 from stefankandic/fixPythonToString.

Authored-by: Stefan Kandic <stefan.kandic@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
The pr aims to improve the docs related to `variable`, includes:
- `docs/sql-ref-syntax-aux-set-var.md`, show the `primitive` error messages.
- `docs/sql-ref-syntax-ddl-declare-variable.md`, add usage of `DECLARE OR REPLACE`.
- `docs/sql-ref-syntax-ddl-drop-variable.md`, show the `primitive` error messages and fix `typo`.

### Why are the changes needed?
Only improve docs.

### Does this PR introduce _any_ user-facing change?
Yes, make end-user docs clearer.

### How was this patch tested?
Manually test.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47460 from panbingkun/SPARK-48976.

Authored-by: panbingkun <panbingkun@baidu.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…nition from `protobuf`

### What changes were proposed in this pull request?
This PR removes the unused object definition `ScalaReflectionLock` from the `protobuf` module. `ScalaReflectionLock` is a definition at the access scope of `protobuf` package, which was defined in SPARK-40654 | #37972 and become unused in SPARK-41639 | #39147.

### Why are the changes needed?
Clean up unused definitions.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #47459 from LuciferYang/remove-ScalaReflectionLock.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…intenance task

### What changes were proposed in this pull request?

Currently, during the state store maintenance process, we find which old version files of the **RocksDB** state store to delete by listing all existing snapshotted version files in the checkpoint directory every 1 minute by default. The frequent list calls in the cloud can result in high costs. To address this concern and reduce the cost associated with state store maintenance, we should aim to minimize the frequency of listing object stores inside the maintenance task. To minimize the frequency, we will try to accumulate versions to delete and only call list when the number of versions to delete reaches a configured threshold.

The changes include:
1.  Adding new configuration variable `ratioExtraVersionsAllowedInCheckpoint` in **SQLConf**. This determines the ratio of extra versions files we want to retain in the checkpoint directory compared to number of versions to retain for rollbacks (`minBatchesToRetain`).
2. Using this config and `minBatchesToRetain`, set `minVersionsToDelete` config inside **StateStoreConf** to `minVersionsToDelete = ratioExtraVersionsAllowedInCheckpoint * minBatchesToRetain.`
3.  Using `minSeenVersion` and `maxSeenVersion` variables in **RocksDBFileManager** to estimate min/max version present in directory and control deletion frequency. This is done by ensuring number of stale versions to delete is at least `minVersionsToDelete`

### Why are the changes needed?

Currently, maintenance operations like snapshotting, purging, deletion, and file management is done asynchronously for each data partition. We want to shift away periodic deletion and instead rely on the estimated number of files in the checkpoint directory to reduce list calls and introduce batch deletion.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47393 from riyaverm-db/reduce-cloud-store-list-api-cost-in-maintenance.

Authored-by: Riya Verma <riya.verma@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
… consistent with JVM

### What changes were proposed in this pull request?

This PR proposes to make the parameter naming of `PySparkException` consistent with JVM

### Why are the changes needed?

The parameter names of `PySparkException` are different from `SparkException` so there is an inconsistency when searching those parameters from error logs.

SparkException:
https://github.com/apache/spark/blob/6508b1f5e18731359354af0a7bcc1382bc4f356b/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L27-L33

PySparkException:
https://github.com/apache/spark/blob/6508b1f5e18731359354af0a7bcc1382bc4f356b/python/pyspark/errors/exceptions/base.py#L29-L40

### Does this PR introduce _any_ user-facing change?

The error parameter names are changed from:
- `error_class` -> `errorClass`
- `message_parameters` -> `messageParameters`
- `query_contexts` -> `context`

### How was this patch tested?

The existing CI should pass

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47436 from itholic/SPARK-48961.

Authored-by: Haejoon Lee <haejoon.lee@databricks.com>
Signed-off-by: Haejoon Lee <haejoon.lee@databricks.com>
…Collation` expression itself in UT

### What changes were proposed in this pull request?
The pr aims to:
- make `checkEvaluation` directly check the `Collation` expression itself in UT, rather than `Collation(...).replacement`.
- fix an `miss` check in UT.

### Why are the changes needed?
When checking the `RuntimeReplaceable` expression in UT, there is no need to write as `checkEvaluation(Collation(Literal("abc")).replacement, "UTF8_BINARY")`, because it has already undergone a similar replacement internally.
https://github.com/apache/spark/blob/1a428c1606645057ef94ac8a6cadbb947b9208a6/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala#L75

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- Update existed UT.
- Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47401 from panbingkun/SPARK-48935.

Authored-by: panbingkun <panbingkun@baidu.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?

Checking wether variable declaration is only at the beginning of the BEGIN END block.

### Why are the changes needed?

SQL standard states that the variables can be declared only immediately after BEGIN.

### Does this PR introduce _any_ user-facing change?

Users will get an error if they try to declare variable in the scope that is not started with BEGIN and ended with END or if the declarations are not immediately after BEGIN.

### How was this patch tested?

Tests are in SqlScriptingParserSuite. There are 2 tests for now, if declarations are correctly written and if declarations are not written immediately after BEGIN. There is a TODO to write the test if declaration is located in the scope that is not BEGIN END.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #47404 from momcilomrk-db/check_variable_declarations.

Authored-by: Momcilo Mrkaic <momcilo.mrkaic@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
… the actual StreamingQueryProgress

This reverts commit d067fc6, which reverted 042804a, essentially brings it back. 042804a failed the 3.5 client <> 4.0 server test, but the test was decided to turned off for cross-version test in #47468

### What changes were proposed in this pull request?

This PR is created after discussion in this closed one: #46886
I was trying to fix a bug (in connect, query.lastProgress doesn't have `numInputRows`, `inputRowsPerSecond`, and `processedRowsPerSecond`), and we reached the conclusion that what purposed in this PR should be the ultimate fix.

In python, for both classic spark and spark connect, the return type of `lastProgress` is `Dict` (and `recentProgress` is `List[Dict]`), but in scala it's the actual `StreamingQueryProgress` object:
https://github.com/apache/spark/blob/1a5d22aa2ffe769435be4aa6102ef961c55b9593/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQuery.scala#L94-L101

This API discrepancy brings some confusion, like in Scala, users can do `query.lastProgress.batchId`, while in Python they have to do `query.lastProgress["batchId"]`.

This PR makes `StreamingQuery.lastProgress` to return the actual `StreamingQueryProgress` (and `StreamingQuery.recentProgress` to return `List[StreamingQueryProgress]`).

To prevent breaking change, we extend `StreamingQueryProgress` to be a subclass of `dict`, so existing code accessing using dictionary method (e.g. `query.lastProgress["id"]`) is still functional.

### Why are the changes needed?

API parity

### Does this PR introduce _any_ user-facing change?

Yes, now `StreamingQuery.lastProgress` returns the actual `StreamingQueryProgress` (and `StreamingQuery.recentProgress` returns `List[StreamingQueryProgress]`).

### How was this patch tested?

Added unit test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #47470 from WweiL/bring-back-lastProgress.

Authored-by: Wei Liu <wei.liu@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?

adds support for variant type in `InMemoryTableScan`, or `df.cache()` by supporting writing variant values to an inmemory buffer.

### Why are the changes needed?

prior to this PR, calling `df.cache()` on a df that has a variant would fail because `InMemoryTableScan` does not support reading variant types.

### Does this PR introduce _any_ user-facing change?

no
### How was this patch tested?

added UTs
### Was this patch authored or co-authored using generative AI tooling?

no

Closes #47252 from richardc-db/variant_dfcache_support.

Authored-by: Richard Chen <r.chen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…ith spark session

### What changes were proposed in this pull request?
`DefaultParamsReader/Writer` handle metadata with spark session

### Why are the changes needed?
In existing ml implementations, when loading/saving a model, it loads/saves the metadata with `SparkContext` then loads/saves the coefficients with `SparkSession`.

This PR aims to also load/save the metadata with `SparkSession`, by introducing new helper functions.

- Note I: 3-rd libraries (e.g. [xgboost](https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-spark/src/main/scala/org/apache/spark/ml/util/XGBoostReadWrite.scala#L38-L53) ) likely depends on existing implementation of saveMetadata/loadMetadata, so we cannot simply remove them even though they are `private[ml]`.

- Note II: this PR only handles `loadMetadata` and `saveMetadata`, there are similar cases for meta algorithms and param read/write, but I want to ignore the remaining part first, to avoid touching too many files in single PR.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #47467 from zhengruifeng/ml_load_with_spark.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…ileStreamSink.hasMetadata

### What changes were proposed in this pull request?

This pull request proposed to move path initialization into try-catch block in FileStreamSink.hasMetadata. Then, exceptions from invalid paths can be handled consistently like other path-related exceptions in the current try-catch block. At last, we can make the errors fall into the correct code branches to be handled

### Why are the changes needed?

bugfix for improperly handled exceptions in FileStreamSink.hasMetadata

### Does this PR introduce _any_ user-facing change?

no, an invalid path is still invalid, but fails in the correct places

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #47471 from yaooqinn/SPARK-48991.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Kent Yao <yao@apache.org>
### What changes were proposed in this pull request?
The pr aims to unified `variable` related `SQL syntax` keywords, enable syntax `DECLARE (OR REPLACE)? ...` and `DROP TEMPORARY ...` to support keyword: `VAR` (not only `VARIABLE`).

### Why are the changes needed?
When `setting` variables, we support `(VARIABLE | VAR)`, but when `declaring` and `dropping` variables, we only support the keyword `VARIABLE` (not support the keyword `VAR`)

<img width="597" alt="image" src="https://github.com/user-attachments/assets/07084fef-4080-4410-a74c-e6001ae0a8fa">

https://github.com/apache/spark/blob/285489b0225004e918b6e937f7367e492292815e/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4#L68-L72

https://github.com/apache/spark/blob/285489b0225004e918b6e937f7367e492292815e/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4#L218-L220

The syntax seems `a bit weird`, `inconsistent experience` in SQL syntax related to variable usage by end-users, so I propose to `unify` it.

### Does this PR introduce _any_ user-facing change?
Yes, enable end-users to use `variable related SQL` with `consistent` keywords.

### How was this patch tested?
Updated existed UT.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47469 from panbingkun/SPARK-48990.

Authored-by: panbingkun <panbingkun@baidu.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…Shuffle to remove shuffle properly

### What changes were proposed in this pull request?
This is a follow-up for #45930, where we introduced ShuffleCleanupMode and implemented cleaning up of shuffle dependencies.

There was a bug where `ShuffleManager.unregisterShuffle` was used on Driver, and in non-local mode it is not effective at all. This change fixed the bug by changing to use `ShuffleDriverComponents.removeShuffle` instead.

### Why are the changes needed?
This is to address the comments in #45930 (comment)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Updated unit tests.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46302 from bozhang2820/spark-47764-1.

Authored-by: Bo Zhang <bo.zhang@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…t case failure with ANSI mode off

### What changes were proposed in this pull request?

This PR aims to fix the `h2` filter push-down test case failure with ANSI mode off.

### Why are the changes needed?

Fix test failure.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually test of the whole `JDBCV2Suite` with ANSI mode off and on.

1. Method One: with IDEA.
-  ANSI mode off: with `SPARK_ANSI_SQL_MODE=false`
<img width="1066" alt="image" src="https://github.com/user-attachments/assets/13ec8ff4-0699-4f3e-95c4-74f53d9824fe">

-  ANSI mode on: without `SPARK_ANSI_SQL_MODE` env variable
<img width="1066" alt="image" src="https://github.com/user-attachments/assets/8434bf0c-b332-4663-965c-0d17d60da78a">

2. Method Two: with commands.
- ANSI mode off
```
SPARK_ANSI_SQL_MODE=false
$ build/sbt
> project sql
> testOnly org.apache.spark.sql.jdbc.JDBCV2Suite
```

- ANSI mode on
```
UNSET SPARK_ANSI_SQL_MODE
$ build/sbt
> project sql
> testOnly org.apache.spark.sql.jdbc.JDBCV2Suite
```

Test results:
1. The issue on current `master` branch
-  with `SPARK_ANSI_SQL_MODE=false`, test failed
-  without `SPARK_ANSI_SQL_MODE` env variable, test passed
2. Fixed with new test code
-  with `SPARK_ANSI_SQL_MODE=false`, test passed
-  without `SPARK_ANSI_SQL_MODE` env variable, test passed

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47472 from wayneguow/fix_h2.

Authored-by: Wei Guo <guow93@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?

Introduce a new `clusterBy` DataFrame API in Scala. This PR adds the API for both the DataFrameWriter V1 and V2, as well as Spark Connect.

### Why are the changes needed?

Introduce more ways for users to interact with clustered tables.

### Does this PR introduce _any_ user-facing change?

Yes, it adds a new `clusterBy` DataFrame API in Scala to allow specifying the clustering columns when writing DataFrames.

### How was this patch tested?

New unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #47301 from zedtang/clusterby-scala-api.

Authored-by: Jiaheng Tang <jiaheng.tang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
… in hive-thriftserver test

### What changes were proposed in this pull request?

A follow up of SPARK-48844 to cleanup duplicated data resource files in hive-thriftserver test

### Why are the changes needed?

code refactoring

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new tests

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #47480 from yaooqinn/SPARK-48844-F.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Kent Yao <yao@apache.org>
### What changes were proposed in this pull request?
The pr aims to update doc `sql/README.md`.

### Why are the changes needed?
After #41426, We have added a subproject `API` to our `SQL moudle`, so we need to update the doc `sql/README.md` synchronously.

### Does this PR introduce _any_ user-facing change?
Yes, make the doc clearer and more accurate.

### How was this patch tested?
Manually test.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47476 from panbingkun/minor_docs.

Authored-by: panbingkun <panbingkun@baidu.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…hStateExec operator

### What changes were proposed in this pull request?

Introducing the OperatorStateMetadataV2 format that integrates with the TransformWithStateExec operator. This is used to keep information about the TWS operator, will be used to enforce invariants in between query runs. Each OperatorStateMetadataV2 has a pointer to the StateSchemaV3 file for the corresponding operator.
Will introduce purging in this PR: #47286
### Why are the changes needed?

This is needed for State Metadata integration with the TransformWithState operator.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Added unit tests to StateStoreSuite and TransformWithStateSuite

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #47445 from ericm-db/metadata-v2.

Authored-by: Eric Marnadi <eric.marnadi@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
… of Column

### What changes were proposed in this pull request?

Allows bare literals for `__and__` and `__or__` of Column API in Spark Classic.

### Why are the changes needed?

Currently bare literals are not allowed for `__and__` and `__or__` of Column API in Spark Classic and need to wrap with `lit()` function. It should be allowed similar to other similar operators.

```py
>>> from pyspark.sql.functions import *
>>> c = col("c")
>>> c & True
Traceback (most recent call last):
...
py4j.Py4JException: Method and([class java.lang.Boolean]) does not exist

>>> c & lit(True)
Column<'and(c, true)'>
```

whereas other operators:

```py
>>> c + 1
Column<'`+`(c, 1)'>
>>> c + lit(1)
Column<'`+`(c, 1)'>
```

Spark Connect allows this.

```py
>>> c & True
Column<'and(c, True)'>
>>> c & lit(True)
Column<'and(c, True)'>
```

### Does this PR introduce _any_ user-facing change?

Yes.

### How was this patch tested?

Added the related tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47474 from ueshin/issues/SPARK-48996/literal_and_or.

Authored-by: Takuya Ueshin <ueshin@databricks.com>
Signed-off-by: Takuya Ueshin <ueshin@databricks.com>
…, if they are bound to outer rows

### What changes were proposed in this pull request?

Extends previous work in #46839, allowing the grouping expressions to be bound to outer references.

Most common example is
`select *, (select count(*) from T_inner where cast(T_inner.x as date) = T_outer.date group by cast(T_inner.x as date))`

Here, we group by cast(T_inner.x as date) which is bound to an outer row. This guarantees that for every outer row, there is exactly one value of cast(T_inner.x as date), so it is safe to group on it.
Previously, we required that only columns can be bound to outer expressions, thus forbidding such subqueries.

### Why are the changes needed?

Extends supported subqueries

### Does this PR introduce _any_ user-facing change?

Yes, previously failing queries are now passing

### How was this patch tested?

Query tests

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #47388 from agubichev/group_by_cols.

Authored-by: Andrey Gubichev <andrey.gubichev@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?

Support listColumns API for clustering columns.
### Why are the changes needed?

Clustering columns should be supported, just like partition and bucket columns, for listColumns API.
### Does this PR introduce _any_ user-facing change?

Yes, listColumns will now show an additional field `isCluster` to indicate whether the column is a clustering column.
Old output for `spark.catalog.listColumns`:
```
+----+-----------+--------+--------+-----------+--------+
|name|description|dataType|nullable|isPartition|isBucket|
+----+-----------+--------+--------+-----------+--------+
|   a|       null|     int|    true|      false|   false|
|   b|       null|  string|    true|      false|   false|
|   c|       null|     int|    true|      false|   false|
|   d|       null|  string|    true|      false|   false|
+----+-----------+--------+--------+-----------+--------+
```

New output:
```
+----+-----------+--------+--------+-----------+--------+---------+
|name|description|dataType|nullable|isPartition|isBucket|isCluster|
+----+-----------+--------+--------+-----------+--------+---------+
|   a|       null|     int|    true|      false|   false|    false|
|   b|       null|  string|    true|      false|   false|    false|
|   c|       null|     int|    true|      false|   false|    false|
|   d|       null|  string|    true|      false|   false|    false|
+----+-----------+--------+--------+-----------+--------+---------+
```

### How was this patch tested?

New unit tests.
### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47451 from zedtang/list-clustering-columns.

Authored-by: Jiaheng Tang <jiaheng.tang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?

This PR aims to improve `MasterPage` to support custom title.

### Why are the changes needed?

When there exists multiple Spark clusters, custom title can be more helpful than the spark master address because it can contain semantics like the role of the clusters. In addition, the URL field in the same page already provides the spark master information even when we use a custom title.

**BEFORE**
```
sbin/start-master.sh
```
![Screenshot 2024-07-25 at 14 01 11](https://github.com/user-attachments/assets/7055d700-4bd6-4785-a535-2f8ce6dba47d)

**AFTER**
```
SPARK_MASTER_OPTS='-Dspark.master.ui.title="Projext X Staging Cluster"' sbin/start-master.sh
```
![Screenshot 2024-07-25 at 14 05 38](https://github.com/user-attachments/assets/f7e45fd6-fa2b-4547-ae39-1403b1e910d9)

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Pass the CIs with newly added test case.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47491 from dongjoon-hyun/SPARK-49007.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.