Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a new pull request by comparing changes across two branches #1664

Merged
merged 28 commits into from
Jul 26, 2024

Commits on Jul 23, 2024

  1. [SPARK-48970][PYTHON][ML] Avoid using SparkSession.getActiveSession i…

    …n spark ML reader/writer
    
    ### What changes were proposed in this pull request?
    
    `SparkSession.getActiveSession` is thread-local session, but spark ML reader / writer might be executed in different threads which causes `SparkSession.getActiveSession` returning None.
    
    ### Why are the changes needed?
    
    It fixes the bug like:
    ```
            spark = SparkSession.getActiveSession()
    >       spark.createDataFrame(  # type: ignore[union-attr]
                [(metadataJson,)], schema=["value"]
            ).coalesce(1).write.text(metadataPath)
    E       AttributeError: 'NoneType' object has no attribute 'createDataFrame'
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Manually.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes #47453 from WeichenXu123/SPARK-48970.
    
    Authored-by: Weichen Xu <weichen.xu@databricks.com>
    Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
    WeichenXu123 committed Jul 23, 2024
    Configuration menu
    Copy the full SHA
    fba4c8c View commit details
    Browse the repository at this point in the history
  2. [SPARK-48752][FOLLOWUP][PYTHON][DOCS] Use explicit name for line numb…

    …er in log
    
    ### What changes were proposed in this pull request?
    
    This PR followups for #47145 to rename the log field naming
    
    ### Why are the changes needed?
    
    `line_no` is not very intuitive so we better renaming to `line_number` explicitly.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No API change, but user-facing log message will be improved
    
    ### How was this patch tested?
    <!--
    If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
    If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
    If tests were not added, please describe why they were not added and/or why it was difficult to add.
    If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
    -->
    The existing CI should pass
    
    ### Was this patch authored or co-authored using generative AI tooling?
    <!--
    If generative AI tooling has been used in the process of authoring this patch, please include the
    phrase: 'Generated-by: ' followed by the name of the tool and its version.
    If no, write 'No'.
    Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details.
    -->
    No
    
    Closes #47437 from itholic/logger_followup.
    
    Authored-by: Haejoon Lee <haejoon.lee@databricks.com>
    Signed-off-by: Haejoon Lee <haejoon.lee@databricks.com>
    itholic committed Jul 23, 2024
    Configuration menu
    Copy the full SHA
    c69f02e View commit details
    Browse the repository at this point in the history
  3. [SPARK-48928] Log Warning for Calling .unpersist() on Locally Checkpo…

    …inted RDDs
    
    ### What changes were proposed in this pull request?
    
    This pull request proposes logging a warning message when the `.unpersist()` method is called on RDDs that have been locally checkpointed. The goal is to inform users about the potential risks associated with unpersisting locally checkpointed RDDs without changing the current behavior of the method.
    
    ### Why are the changes needed?
    
    Local checkpointing truncates the lineage of an RDD, preventing it from being recomputed from its source. If a locally checkpointed RDD is unpersisted, it loses its data and cannot be regenerated, potentially leading to job failures if subsequent actions or transformations are attempted on the RDD (which was seen on some user workloads). Logging a warning message helps users avoid such pitfalls and aids in debugging.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, this PR adds a warning log message when .unpersist() is called on a locally checkpointed RDD, but it does not alter any existing behavior.
    
    ### How was this patch tested?
    
    This PR does not change any existing behavior and therefore no tests are added.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes #47391 from mingkangli-db/warning_unpersist.
    
    Authored-by: Mingkang Li <mingkang.li@databricks.com>
    Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
    mingkangli-db authored and Mridul Muralidharan committed Jul 23, 2024
    Configuration menu
    Copy the full SHA
    118167f View commit details
    Browse the repository at this point in the history

Commits on Jul 24, 2024

  1. [SPARK-48414][PYTHON] Fix breaking change in python's fromJson

    ### What changes were proposed in this pull request?
    
    Fix breaking change in `fromJson` method by having default param values.
    
    ### Why are the changes needed?
    
    In order to not break clients that don't care about collations.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Existing UTs.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes #46737 from stefankandic/fromJsonBreakingChange.
    
    Authored-by: Stefan Kandic <stefan.kandic@databricks.com>
    Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
    stefankandic authored and HyukjinKwon committed Jul 24, 2024
    Configuration menu
    Copy the full SHA
    fdcf975 View commit details
    Browse the repository at this point in the history
  2. [SPARK-48974][SQL][SS][ML][MLLIB] Use SparkSession.implicits instea…

    …d of `SQLContext.implicits`
    
    ### What changes were proposed in this pull request?
    This PR replaces `SQLContext.implicits`  with `SparkSession.implicits` in the Spark codebase.
    
    ### Why are the changes needed?
    Reduce the usage of code from `SQLContext` within the internal code of Spark.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Pass GitHub Actions
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes #47457 from LuciferYang/use-sparksession-implicits.
    
    Lead-authored-by: yangjie01 <yangjie01@baidu.com>
    Co-authored-by: YangJie <yangjie01@baidu.com>
    Signed-off-by: yangjie01 <yangjie01@baidu.com>
    LuciferYang committed Jul 24, 2024
    Configuration menu
    Copy the full SHA
    877c3f2 View commit details
    Browse the repository at this point in the history
  3. [SPARK-48987][INFRA] Make curl retry 3 times in bin/mvn

    ### What changes were proposed in this pull request?
    The pr aims to make `curl` retry `3 times` in `bin/mvn`.
    
    ### Why are the changes needed?
    Avoid the following issues:
    https://github.com/panbingkun/spark/actions/runs/10067831390/job/27832101470
    <img width="993" alt="image" src="https://github.com/user-attachments/assets/3fa9a59a-82cb-4e99-b9f7-d128f051d340">
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    Continuous manual observation.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes #47465 from panbingkun/SPARK-48987.
    
    Authored-by: panbingkun <panbingkun@baidu.com>
    Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
    panbingkun authored and HyukjinKwon committed Jul 24, 2024
    Configuration menu
    Copy the full SHA
    b9447db View commit details
    Browse the repository at this point in the history
  4. [SPARK-48981] Fix simpleString method of StringType in pyspark for co…

    …llations
    
    ### What changes were proposed in this pull request?
    
    Fixing the bug in the code where because of different way string interpolation works in python we had an accidental dollar sign in the string value.
    
    ### Why are the changes needed?
    
    To be consistent with the scala code.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Different value will be shown to the user.
    
    ### How was this patch tested?
    
    Unit tests.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes #47463 from stefankandic/fixPythonToString.
    
    Authored-by: Stefan Kandic <stefan.kandic@databricks.com>
    Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
    stefankandic authored and HyukjinKwon committed Jul 24, 2024
    Configuration menu
    Copy the full SHA
    3fdad6a View commit details
    Browse the repository at this point in the history
  5. [SPARK-48976][SQL][DOCS] Improve the docs related to variable

    ### What changes were proposed in this pull request?
    The pr aims to improve the docs related to `variable`, includes:
    - `docs/sql-ref-syntax-aux-set-var.md`, show the `primitive` error messages.
    - `docs/sql-ref-syntax-ddl-declare-variable.md`, add usage of `DECLARE OR REPLACE`.
    - `docs/sql-ref-syntax-ddl-drop-variable.md`, show the `primitive` error messages and fix `typo`.
    
    ### Why are the changes needed?
    Only improve docs.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes, make end-user docs clearer.
    
    ### How was this patch tested?
    Manually test.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes #47460 from panbingkun/SPARK-48976.
    
    Authored-by: panbingkun <panbingkun@baidu.com>
    Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
    panbingkun authored and HyukjinKwon committed Jul 24, 2024
    Configuration menu
    Copy the full SHA
    881b214 View commit details
    Browse the repository at this point in the history
  6. [SPARK-48975][PROTOBUF] Remove unnecessary ScalaReflectionLock defi…

    …nition from `protobuf`
    
    ### What changes were proposed in this pull request?
    This PR removes the unused object definition `ScalaReflectionLock` from the `protobuf` module. `ScalaReflectionLock` is a definition at the access scope of `protobuf` package, which was defined in SPARK-40654 | #37972 and become unused in SPARK-41639 | #39147.
    
    ### Why are the changes needed?
    Clean up unused definitions.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Pass GitHub Actions
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes #47459 from LuciferYang/remove-ScalaReflectionLock.
    
    Authored-by: yangjie01 <yangjie01@baidu.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    LuciferYang authored and dongjoon-hyun committed Jul 24, 2024
    Configuration menu
    Copy the full SHA
    f828146 View commit details
    Browse the repository at this point in the history
  7. [SPARK-48931][SS] Reduce Cloud Store List API cost for state store ma…

    …intenance task
    
    ### What changes were proposed in this pull request?
    
    Currently, during the state store maintenance process, we find which old version files of the **RocksDB** state store to delete by listing all existing snapshotted version files in the checkpoint directory every 1 minute by default. The frequent list calls in the cloud can result in high costs. To address this concern and reduce the cost associated with state store maintenance, we should aim to minimize the frequency of listing object stores inside the maintenance task. To minimize the frequency, we will try to accumulate versions to delete and only call list when the number of versions to delete reaches a configured threshold.
    
    The changes include:
    1.  Adding new configuration variable `ratioExtraVersionsAllowedInCheckpoint` in **SQLConf**. This determines the ratio of extra versions files we want to retain in the checkpoint directory compared to number of versions to retain for rollbacks (`minBatchesToRetain`).
    2. Using this config and `minBatchesToRetain`, set `minVersionsToDelete` config inside **StateStoreConf** to `minVersionsToDelete = ratioExtraVersionsAllowedInCheckpoint * minBatchesToRetain.`
    3.  Using `minSeenVersion` and `maxSeenVersion` variables in **RocksDBFileManager** to estimate min/max version present in directory and control deletion frequency. This is done by ensuring number of stale versions to delete is at least `minVersionsToDelete`
    
    ### Why are the changes needed?
    
    Currently, maintenance operations like snapshotting, purging, deletion, and file management is done asynchronously for each data partition. We want to shift away periodic deletion and instead rely on the estimated number of files in the checkpoint directory to reduce list calls and introduce batch deletion.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Added unit tests.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes #47393 from riyaverm-db/reduce-cloud-store-list-api-cost-in-maintenance.
    
    Authored-by: Riya Verma <riya.verma@databricks.com>
    Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
    riyaverm-db authored and HeartSaVioR committed Jul 24, 2024
    Configuration menu
    Copy the full SHA
    4e20a2a View commit details
    Browse the repository at this point in the history
  8. [SPARK-48961][PYTHON] Make the parameter naming of PySparkException

    … consistent with JVM
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to make the parameter naming of `PySparkException` consistent with JVM
    
    ### Why are the changes needed?
    
    The parameter names of `PySparkException` are different from `SparkException` so there is an inconsistency when searching those parameters from error logs.
    
    SparkException:
    https://github.com/apache/spark/blob/6508b1f5e18731359354af0a7bcc1382bc4f356b/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L27-L33
    
    PySparkException:
    https://github.com/apache/spark/blob/6508b1f5e18731359354af0a7bcc1382bc4f356b/python/pyspark/errors/exceptions/base.py#L29-L40
    
    ### Does this PR introduce _any_ user-facing change?
    
    The error parameter names are changed from:
    - `error_class` -> `errorClass`
    - `message_parameters` -> `messageParameters`
    - `query_contexts` -> `context`
    
    ### How was this patch tested?
    
    The existing CI should pass
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes #47436 from itholic/SPARK-48961.
    
    Authored-by: Haejoon Lee <haejoon.lee@databricks.com>
    Signed-off-by: Haejoon Lee <haejoon.lee@databricks.com>
    itholic committed Jul 24, 2024
    Configuration menu
    Copy the full SHA
    090ad9f View commit details
    Browse the repository at this point in the history
  9. [SPARK-48935][SQL][TESTS] Make checkEvaluation directly check the `…

    …Collation` expression itself in UT
    
    ### What changes were proposed in this pull request?
    The pr aims to:
    - make `checkEvaluation` directly check the `Collation` expression itself in UT, rather than `Collation(...).replacement`.
    - fix an `miss` check in UT.
    
    ### Why are the changes needed?
    When checking the `RuntimeReplaceable` expression in UT, there is no need to write as `checkEvaluation(Collation(Literal("abc")).replacement, "UTF8_BINARY")`, because it has already undergone a similar replacement internally.
    https://github.com/apache/spark/blob/1a428c1606645057ef94ac8a6cadbb947b9208a6/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala#L75
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    - Update existed UT.
    - Pass GA.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes #47401 from panbingkun/SPARK-48935.
    
    Authored-by: panbingkun <panbingkun@baidu.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    panbingkun authored and cloud-fan committed Jul 24, 2024
    Configuration menu
    Copy the full SHA
    4de4ed1 View commit details
    Browse the repository at this point in the history
  10. [SPARK-48338][SQL] Check variable declarations

    ### What changes were proposed in this pull request?
    
    Checking wether variable declaration is only at the beginning of the BEGIN END block.
    
    ### Why are the changes needed?
    
    SQL standard states that the variables can be declared only immediately after BEGIN.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Users will get an error if they try to declare variable in the scope that is not started with BEGIN and ended with END or if the declarations are not immediately after BEGIN.
    
    ### How was this patch tested?
    
    Tests are in SqlScriptingParserSuite. There are 2 tests for now, if declarations are correctly written and if declarations are not written immediately after BEGIN. There is a TODO to write the test if declaration is located in the scope that is not BEGIN END.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No
    
    Closes #47404 from momcilomrk-db/check_variable_declarations.
    
    Authored-by: Momcilo Mrkaic <momcilo.mrkaic@databricks.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    momcilomrk-db authored and cloud-fan committed Jul 24, 2024
    Configuration menu
    Copy the full SHA
    239d77b View commit details
    Browse the repository at this point in the history
  11. [SPARK-48567][SS][FOLLOWUP] StreamingQuery.lastProgress should return…

    … the actual StreamingQueryProgress
    
    This reverts commit d067fc6, which reverted 042804a, essentially brings it back. 042804a failed the 3.5 client <> 4.0 server test, but the test was decided to turned off for cross-version test in #47468
    
    ### What changes were proposed in this pull request?
    
    This PR is created after discussion in this closed one: #46886
    I was trying to fix a bug (in connect, query.lastProgress doesn't have `numInputRows`, `inputRowsPerSecond`, and `processedRowsPerSecond`), and we reached the conclusion that what purposed in this PR should be the ultimate fix.
    
    In python, for both classic spark and spark connect, the return type of `lastProgress` is `Dict` (and `recentProgress` is `List[Dict]`), but in scala it's the actual `StreamingQueryProgress` object:
    https://github.com/apache/spark/blob/1a5d22aa2ffe769435be4aa6102ef961c55b9593/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQuery.scala#L94-L101
    
    This API discrepancy brings some confusion, like in Scala, users can do `query.lastProgress.batchId`, while in Python they have to do `query.lastProgress["batchId"]`.
    
    This PR makes `StreamingQuery.lastProgress` to return the actual `StreamingQueryProgress` (and `StreamingQuery.recentProgress` to return `List[StreamingQueryProgress]`).
    
    To prevent breaking change, we extend `StreamingQueryProgress` to be a subclass of `dict`, so existing code accessing using dictionary method (e.g. `query.lastProgress["id"]`) is still functional.
    
    ### Why are the changes needed?
    
    API parity
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, now `StreamingQuery.lastProgress` returns the actual `StreamingQueryProgress` (and `StreamingQuery.recentProgress` returns `List[StreamingQueryProgress]`).
    
    ### How was this patch tested?
    
    Added unit test
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No
    
    Closes #47470 from WweiL/bring-back-lastProgress.
    
    Authored-by: Wei Liu <wei.liu@databricks.com>
    Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
    WweiL authored and HyukjinKwon committed Jul 24, 2024
    Configuration menu
    Copy the full SHA
    22eb6c4 View commit details
    Browse the repository at this point in the history
  12. [SPARK-48833][SQL][VARIANT] Support variant in InMemoryTableScan

    ### What changes were proposed in this pull request?
    
    adds support for variant type in `InMemoryTableScan`, or `df.cache()` by supporting writing variant values to an inmemory buffer.
    
    ### Why are the changes needed?
    
    prior to this PR, calling `df.cache()` on a df that has a variant would fail because `InMemoryTableScan` does not support reading variant types.
    
    ### Does this PR introduce _any_ user-facing change?
    
    no
    ### How was this patch tested?
    
    added UTs
    ### Was this patch authored or co-authored using generative AI tooling?
    
    no
    
    Closes #47252 from richardc-db/variant_dfcache_support.
    
    Authored-by: Richard Chen <r.chen@databricks.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    richardc-db authored and cloud-fan committed Jul 24, 2024
    Configuration menu
    Copy the full SHA
    0c9b072 View commit details
    Browse the repository at this point in the history
  13. [SPARK-48988][ML] Make DefaultParamsReader/Writer handle metadata w…

    …ith spark session
    
    ### What changes were proposed in this pull request?
    `DefaultParamsReader/Writer` handle metadata with spark session
    
    ### Why are the changes needed?
    In existing ml implementations, when loading/saving a model, it loads/saves the metadata with `SparkContext` then loads/saves the coefficients with `SparkSession`.
    
    This PR aims to also load/save the metadata with `SparkSession`, by introducing new helper functions.
    
    - Note I: 3-rd libraries (e.g. [xgboost](https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-spark/src/main/scala/org/apache/spark/ml/util/XGBoostReadWrite.scala#L38-L53) ) likely depends on existing implementation of saveMetadata/loadMetadata, so we cannot simply remove them even though they are `private[ml]`.
    
    - Note II: this PR only handles `loadMetadata` and `saveMetadata`, there are similar cases for meta algorithms and param read/write, but I want to ignore the remaining part first, to avoid touching too many files in single PR.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    CI
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes #47467 from zhengruifeng/ml_load_with_spark.
    
    Authored-by: Ruifeng Zheng <ruifengz@apache.org>
    Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
    zhengruifeng authored and HyukjinKwon committed Jul 24, 2024
    Configuration menu
    Copy the full SHA
    8597b78 View commit details
    Browse the repository at this point in the history
  14. [SPARK-48991][SQL] Move path initialization into try-catch block in F…

    …ileStreamSink.hasMetadata
    
    ### What changes were proposed in this pull request?
    
    This pull request proposed to move path initialization into try-catch block in FileStreamSink.hasMetadata. Then, exceptions from invalid paths can be handled consistently like other path-related exceptions in the current try-catch block. At last, we can make the errors fall into the correct code branches to be handled
    
    ### Why are the changes needed?
    
    bugfix for improperly handled exceptions in FileStreamSink.hasMetadata
    
    ### Does this PR introduce _any_ user-facing change?
    
    no, an invalid path is still invalid, but fails in the correct places
    
    ### How was this patch tested?
    
    new test
    
    ### Was this patch authored or co-authored using generative AI tooling?
    no
    
    Closes #47471 from yaooqinn/SPARK-48991.
    
    Authored-by: Kent Yao <yao@apache.org>
    Signed-off-by: Kent Yao <yao@apache.org>
    yaooqinn committed Jul 24, 2024
    Configuration menu
    Copy the full SHA
    d68cde8 View commit details
    Browse the repository at this point in the history
  15. [SPARK-48990][SQL] Unified variable related SQL syntax keywords

    ### What changes were proposed in this pull request?
    The pr aims to unified `variable` related `SQL syntax` keywords, enable syntax `DECLARE (OR REPLACE)? ...` and `DROP TEMPORARY ...` to support keyword: `VAR` (not only `VARIABLE`).
    
    ### Why are the changes needed?
    When `setting` variables, we support `(VARIABLE | VAR)`, but when `declaring` and `dropping` variables, we only support the keyword `VARIABLE` (not support the keyword `VAR`)
    
    <img width="597" alt="image" src="https://github.com/user-attachments/assets/07084fef-4080-4410-a74c-e6001ae0a8fa">
    
    https://github.com/apache/spark/blob/285489b0225004e918b6e937f7367e492292815e/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4#L68-L72
    
    https://github.com/apache/spark/blob/285489b0225004e918b6e937f7367e492292815e/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4#L218-L220
    
    The syntax seems `a bit weird`, `inconsistent experience` in SQL syntax related to variable usage by end-users, so I propose to `unify` it.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes, enable end-users to use `variable related SQL` with `consistent` keywords.
    
    ### How was this patch tested?
    Updated existed UT.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes #47469 from panbingkun/SPARK-48990.
    
    Authored-by: panbingkun <panbingkun@baidu.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    panbingkun authored and cloud-fan committed Jul 24, 2024
    Configuration menu
    Copy the full SHA
    34e65a8 View commit details
    Browse the repository at this point in the history
  16. [SPARK-47764][FOLLOW-UP] Change to use ShuffleDriverComponents.remove…

    …Shuffle to remove shuffle properly
    
    ### What changes were proposed in this pull request?
    This is a follow-up for #45930, where we introduced ShuffleCleanupMode and implemented cleaning up of shuffle dependencies.
    
    There was a bug where `ShuffleManager.unregisterShuffle` was used on Driver, and in non-local mode it is not effective at all. This change fixed the bug by changing to use `ShuffleDriverComponents.removeShuffle` instead.
    
    ### Why are the changes needed?
    This is to address the comments in #45930 (comment)
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Updated unit tests.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes #46302 from bozhang2820/spark-47764-1.
    
    Authored-by: Bo Zhang <bo.zhang@databricks.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    bozhang2820 authored and dongjoon-hyun committed Jul 24, 2024
    Configuration menu
    Copy the full SHA
    5d787e2 View commit details
    Browse the repository at this point in the history

Commits on Jul 25, 2024

  1. [SPARK-48943][SQL][TESTS][FOLLOWUP] Fix the h2 filter push-down tes…

    …t case failure with ANSI mode off
    
    ### What changes were proposed in this pull request?
    
    This PR aims to fix the `h2` filter push-down test case failure with ANSI mode off.
    
    ### Why are the changes needed?
    
    Fix test failure.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Manually test of the whole `JDBCV2Suite` with ANSI mode off and on.
    
    1. Method One: with IDEA.
    -  ANSI mode off: with `SPARK_ANSI_SQL_MODE=false`
    <img width="1066" alt="image" src="https://github.com/user-attachments/assets/13ec8ff4-0699-4f3e-95c4-74f53d9824fe">
    
    -  ANSI mode on: without `SPARK_ANSI_SQL_MODE` env variable
    <img width="1066" alt="image" src="https://github.com/user-attachments/assets/8434bf0c-b332-4663-965c-0d17d60da78a">
    
    2. Method Two: with commands.
    - ANSI mode off
    ```
    SPARK_ANSI_SQL_MODE=false
    $ build/sbt
    > project sql
    > testOnly org.apache.spark.sql.jdbc.JDBCV2Suite
    ```
    
    - ANSI mode on
    ```
    UNSET SPARK_ANSI_SQL_MODE
    $ build/sbt
    > project sql
    > testOnly org.apache.spark.sql.jdbc.JDBCV2Suite
    ```
    
    Test results:
    1. The issue on current `master` branch
    -  with `SPARK_ANSI_SQL_MODE=false`, test failed
    -  without `SPARK_ANSI_SQL_MODE` env variable, test passed
    2. Fixed with new test code
    -  with `SPARK_ANSI_SQL_MODE=false`, test passed
    -  without `SPARK_ANSI_SQL_MODE` env variable, test passed
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes #47472 from wayneguow/fix_h2.
    
    Authored-by: Wei Guo <guow93@gmail.com>
    Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
    wayneguow authored and HyukjinKwon committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    8c625ea View commit details
    Browse the repository at this point in the history
  2. [SPARK-48761][SQL] Introduce clusterBy DataFrameWriter API for Scala

    ### What changes were proposed in this pull request?
    
    Introduce a new `clusterBy` DataFrame API in Scala. This PR adds the API for both the DataFrameWriter V1 and V2, as well as Spark Connect.
    
    ### Why are the changes needed?
    
    Introduce more ways for users to interact with clustered tables.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, it adds a new `clusterBy` DataFrame API in Scala to allow specifying the clustering columns when writing DataFrames.
    
    ### How was this patch tested?
    
    New unit tests.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No
    
    Closes #47301 from zedtang/clusterby-scala-api.
    
    Authored-by: Jiaheng Tang <jiaheng.tang@databricks.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    zedtang authored and cloud-fan committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    bafce5d View commit details
    Browse the repository at this point in the history
  3. [SPARK-48844][FOLLOWUP][TESTS] Cleanup duplicated data resource files…

    … in hive-thriftserver test
    
    ### What changes were proposed in this pull request?
    
    A follow up of SPARK-48844 to cleanup duplicated data resource files in hive-thriftserver test
    
    ### Why are the changes needed?
    
    code refactoring
    
    ### Does this PR introduce _any_ user-facing change?
    
    no
    
    ### How was this patch tested?
    
    new tests
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    no
    
    Closes #47480 from yaooqinn/SPARK-48844-F.
    
    Authored-by: Kent Yao <yao@apache.org>
    Signed-off-by: Kent Yao <yao@apache.org>
    yaooqinn committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    5c19505 View commit details
    Browse the repository at this point in the history
  4. [MINOR][DOCS] Update doc sql/README.md

    ### What changes were proposed in this pull request?
    The pr aims to update doc `sql/README.md`.
    
    ### Why are the changes needed?
    After #41426, We have added a subproject `API` to our `SQL moudle`, so we need to update the doc `sql/README.md` synchronously.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes, make the doc clearer and more accurate.
    
    ### How was this patch tested?
    Manually test.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes #47476 from panbingkun/minor_docs.
    
    Authored-by: panbingkun <panbingkun@baidu.com>
    Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
    panbingkun authored and HyukjinKwon committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    cf95e75 View commit details
    Browse the repository at this point in the history
  5. [SPARK-48849][SS] Create OperatorStateMetadataV2 for the TransformWit…

    …hStateExec operator
    
    ### What changes were proposed in this pull request?
    
    Introducing the OperatorStateMetadataV2 format that integrates with the TransformWithStateExec operator. This is used to keep information about the TWS operator, will be used to enforce invariants in between query runs. Each OperatorStateMetadataV2 has a pointer to the StateSchemaV3 file for the corresponding operator.
    Will introduce purging in this PR: #47286
    ### Why are the changes needed?
    
    This is needed for State Metadata integration with the TransformWithState operator.
    
    ### Does this PR introduce _any_ user-facing change?
    
    ### How was this patch tested?
    
    Added unit tests to StateStoreSuite and TransformWithStateSuite
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No
    
    Closes #47445 from ericm-db/metadata-v2.
    
    Authored-by: Eric Marnadi <eric.marnadi@databricks.com>
    Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
    ericm-db authored and HeartSaVioR committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    4999469 View commit details
    Browse the repository at this point in the history
  6. [SPARK-48996][SQL][PYTHON] Allow bare literals for __and__ and __or__…

    … of Column
    
    ### What changes were proposed in this pull request?
    
    Allows bare literals for `__and__` and `__or__` of Column API in Spark Classic.
    
    ### Why are the changes needed?
    
    Currently bare literals are not allowed for `__and__` and `__or__` of Column API in Spark Classic and need to wrap with `lit()` function. It should be allowed similar to other similar operators.
    
    ```py
    >>> from pyspark.sql.functions import *
    >>> c = col("c")
    >>> c & True
    Traceback (most recent call last):
    ...
    py4j.Py4JException: Method and([class java.lang.Boolean]) does not exist
    
    >>> c & lit(True)
    Column<'and(c, true)'>
    ```
    
    whereas other operators:
    
    ```py
    >>> c + 1
    Column<'`+`(c, 1)'>
    >>> c + lit(1)
    Column<'`+`(c, 1)'>
    ```
    
    Spark Connect allows this.
    
    ```py
    >>> c & True
    Column<'and(c, True)'>
    >>> c & lit(True)
    Column<'and(c, True)'>
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes.
    
    ### How was this patch tested?
    
    Added the related tests.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes #47474 from ueshin/issues/SPARK-48996/literal_and_or.
    
    Authored-by: Takuya Ueshin <ueshin@databricks.com>
    Signed-off-by: Takuya Ueshin <ueshin@databricks.com>
    ueshin committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    78b83fa View commit details
    Browse the repository at this point in the history

Commits on Jul 26, 2024

  1. [SPARK-48503][SQL] Allow grouping on expressions in scalar subqueries…

    …, if they are bound to outer rows
    
    ### What changes were proposed in this pull request?
    
    Extends previous work in #46839, allowing the grouping expressions to be bound to outer references.
    
    Most common example is
    `select *, (select count(*) from T_inner where cast(T_inner.x as date) = T_outer.date group by cast(T_inner.x as date))`
    
    Here, we group by cast(T_inner.x as date) which is bound to an outer row. This guarantees that for every outer row, there is exactly one value of cast(T_inner.x as date), so it is safe to group on it.
    Previously, we required that only columns can be bound to outer expressions, thus forbidding such subqueries.
    
    ### Why are the changes needed?
    
    Extends supported subqueries
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, previously failing queries are now passing
    
    ### How was this patch tested?
    
    Query tests
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No
    
    Closes #47388 from agubichev/group_by_cols.
    
    Authored-by: Andrey Gubichev <andrey.gubichev@databricks.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    agubichev authored and cloud-fan committed Jul 26, 2024
    Configuration menu
    Copy the full SHA
    f3b819e View commit details
    Browse the repository at this point in the history
  2. [SPARK-45787][SQL] Support Catalog.listColumns for clustering columns

    ### What changes were proposed in this pull request?
    
    Support listColumns API for clustering columns.
    ### Why are the changes needed?
    
    Clustering columns should be supported, just like partition and bucket columns, for listColumns API.
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, listColumns will now show an additional field `isCluster` to indicate whether the column is a clustering column.
    Old output for `spark.catalog.listColumns`:
    ```
    +----+-----------+--------+--------+-----------+--------+
    |name|description|dataType|nullable|isPartition|isBucket|
    +----+-----------+--------+--------+-----------+--------+
    |   a|       null|     int|    true|      false|   false|
    |   b|       null|  string|    true|      false|   false|
    |   c|       null|     int|    true|      false|   false|
    |   d|       null|  string|    true|      false|   false|
    +----+-----------+--------+--------+-----------+--------+
    ```
    
    New output:
    ```
    +----+-----------+--------+--------+-----------+--------+---------+
    |name|description|dataType|nullable|isPartition|isBucket|isCluster|
    +----+-----------+--------+--------+-----------+--------+---------+
    |   a|       null|     int|    true|      false|   false|    false|
    |   b|       null|  string|    true|      false|   false|    false|
    |   c|       null|     int|    true|      false|   false|    false|
    |   d|       null|  string|    true|      false|   false|    false|
    +----+-----------+--------+--------+-----------+--------+---------+
    ```
    
    ### How was this patch tested?
    
    New unit tests.
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes #47451 from zedtang/list-clustering-columns.
    
    Authored-by: Jiaheng Tang <jiaheng.tang@databricks.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    zedtang authored and cloud-fan committed Jul 26, 2024
    Configuration menu
    Copy the full SHA
    e73ede7 View commit details
    Browse the repository at this point in the history
  3. [SPARK-49007][CORE] Improve MasterPage to support custom title

    ### What changes were proposed in this pull request?
    
    This PR aims to improve `MasterPage` to support custom title.
    
    ### Why are the changes needed?
    
    When there exists multiple Spark clusters, custom title can be more helpful than the spark master address because it can contain semantics like the role of the clusters. In addition, the URL field in the same page already provides the spark master information even when we use a custom title.
    
    **BEFORE**
    ```
    sbin/start-master.sh
    ```
    ![Screenshot 2024-07-25 at 14 01 11](https://github.com/user-attachments/assets/7055d700-4bd6-4785-a535-2f8ce6dba47d)
    
    **AFTER**
    ```
    SPARK_MASTER_OPTS='-Dspark.master.ui.title="Projext X Staging Cluster"' sbin/start-master.sh
    ```
    ![Screenshot 2024-07-25 at 14 05 38](https://github.com/user-attachments/assets/f7e45fd6-fa2b-4547-ae39-1403b1e910d9)
    
    ### Does this PR introduce _any_ user-facing change?
    
    ### How was this patch tested?
    
    Pass the CIs with newly added test case.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes #47491 from dongjoon-hyun/SPARK-49007.
    
    Authored-by: Dongjoon Hyun <dhyun@apple.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    dongjoon-hyun committed Jul 26, 2024
    Configuration menu
    Copy the full SHA
    2363aec View commit details
    Browse the repository at this point in the history