-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a new pull request by comparing changes across two branches #1664
Commits on Jul 23, 2024
-
[SPARK-48970][PYTHON][ML] Avoid using SparkSession.getActiveSession i…
…n spark ML reader/writer ### What changes were proposed in this pull request? `SparkSession.getActiveSession` is thread-local session, but spark ML reader / writer might be executed in different threads which causes `SparkSession.getActiveSession` returning None. ### Why are the changes needed? It fixes the bug like: ``` spark = SparkSession.getActiveSession() > spark.createDataFrame( # type: ignore[union-attr] [(metadataJson,)], schema=["value"] ).coalesce(1).write.text(metadataPath) E AttributeError: 'NoneType' object has no attribute 'createDataFrame' ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47453 from WeichenXu123/SPARK-48970. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for fba4c8c - Browse repository at this point
Copy the full SHA fba4c8cView commit details -
[SPARK-48752][FOLLOWUP][PYTHON][DOCS] Use explicit name for line numb…
…er in log ### What changes were proposed in this pull request? This PR followups for #47145 to rename the log field naming ### Why are the changes needed? `line_no` is not very intuitive so we better renaming to `line_number` explicitly. ### Does this PR introduce _any_ user-facing change? No API change, but user-facing log message will be improved ### How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks. --> The existing CI should pass ### Was this patch authored or co-authored using generative AI tooling? <!-- If generative AI tooling has been used in the process of authoring this patch, please include the phrase: 'Generated-by: ' followed by the name of the tool and its version. If no, write 'No'. Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details. --> No Closes #47437 from itholic/logger_followup. Authored-by: Haejoon Lee <haejoon.lee@databricks.com> Signed-off-by: Haejoon Lee <haejoon.lee@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for c69f02e - Browse repository at this point
Copy the full SHA c69f02eView commit details -
[SPARK-48928] Log Warning for Calling .unpersist() on Locally Checkpo…
…inted RDDs ### What changes were proposed in this pull request? This pull request proposes logging a warning message when the `.unpersist()` method is called on RDDs that have been locally checkpointed. The goal is to inform users about the potential risks associated with unpersisting locally checkpointed RDDs without changing the current behavior of the method. ### Why are the changes needed? Local checkpointing truncates the lineage of an RDD, preventing it from being recomputed from its source. If a locally checkpointed RDD is unpersisted, it loses its data and cannot be regenerated, potentially leading to job failures if subsequent actions or transformations are attempted on the RDD (which was seen on some user workloads). Logging a warning message helps users avoid such pitfalls and aids in debugging. ### Does this PR introduce _any_ user-facing change? Yes, this PR adds a warning log message when .unpersist() is called on a locally checkpointed RDD, but it does not alter any existing behavior. ### How was this patch tested? This PR does not change any existing behavior and therefore no tests are added. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47391 from mingkangli-db/warning_unpersist. Authored-by: Mingkang Li <mingkang.li@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 118167f - Browse repository at this point
Copy the full SHA 118167fView commit details
Commits on Jul 24, 2024
-
[SPARK-48414][PYTHON] Fix breaking change in python's
fromJson
### What changes were proposed in this pull request? Fix breaking change in `fromJson` method by having default param values. ### Why are the changes needed? In order to not break clients that don't care about collations. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UTs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46737 from stefankandic/fromJsonBreakingChange. Authored-by: Stefan Kandic <stefan.kandic@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Configuration menu - View commit details
-
Copy full SHA for fdcf975 - Browse repository at this point
Copy the full SHA fdcf975View commit details -
[SPARK-48974][SQL][SS][ML][MLLIB] Use
SparkSession.implicits
instea……d of `SQLContext.implicits` ### What changes were proposed in this pull request? This PR replaces `SQLContext.implicits` with `SparkSession.implicits` in the Spark codebase. ### Why are the changes needed? Reduce the usage of code from `SQLContext` within the internal code of Spark. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #47457 from LuciferYang/use-sparksession-implicits. Lead-authored-by: yangjie01 <yangjie01@baidu.com> Co-authored-by: YangJie <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com>
Configuration menu - View commit details
-
Copy full SHA for 877c3f2 - Browse repository at this point
Copy the full SHA 877c3f2View commit details -
[SPARK-48987][INFRA] Make
curl
retry 3 times inbin/mvn
### What changes were proposed in this pull request? The pr aims to make `curl` retry `3 times` in `bin/mvn`. ### Why are the changes needed? Avoid the following issues: https://github.com/panbingkun/spark/actions/runs/10067831390/job/27832101470 <img width="993" alt="image" src="https://github.com/user-attachments/assets/3fa9a59a-82cb-4e99-b9f7-d128f051d340"> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Continuous manual observation. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47465 from panbingkun/SPARK-48987. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Configuration menu - View commit details
-
Copy full SHA for b9447db - Browse repository at this point
Copy the full SHA b9447dbView commit details -
[SPARK-48981] Fix simpleString method of StringType in pyspark for co…
…llations ### What changes were proposed in this pull request? Fixing the bug in the code where because of different way string interpolation works in python we had an accidental dollar sign in the string value. ### Why are the changes needed? To be consistent with the scala code. ### Does this PR introduce _any_ user-facing change? Different value will be shown to the user. ### How was this patch tested? Unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47463 from stefankandic/fixPythonToString. Authored-by: Stefan Kandic <stefan.kandic@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Configuration menu - View commit details
-
Copy full SHA for 3fdad6a - Browse repository at this point
Copy the full SHA 3fdad6aView commit details -
[SPARK-48976][SQL][DOCS] Improve the docs related to
variable
### What changes were proposed in this pull request? The pr aims to improve the docs related to `variable`, includes: - `docs/sql-ref-syntax-aux-set-var.md`, show the `primitive` error messages. - `docs/sql-ref-syntax-ddl-declare-variable.md`, add usage of `DECLARE OR REPLACE`. - `docs/sql-ref-syntax-ddl-drop-variable.md`, show the `primitive` error messages and fix `typo`. ### Why are the changes needed? Only improve docs. ### Does this PR introduce _any_ user-facing change? Yes, make end-user docs clearer. ### How was this patch tested? Manually test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47460 from panbingkun/SPARK-48976. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Configuration menu - View commit details
-
Copy full SHA for 881b214 - Browse repository at this point
Copy the full SHA 881b214View commit details -
[SPARK-48975][PROTOBUF] Remove unnecessary
ScalaReflectionLock
defi……nition from `protobuf` ### What changes were proposed in this pull request? This PR removes the unused object definition `ScalaReflectionLock` from the `protobuf` module. `ScalaReflectionLock` is a definition at the access scope of `protobuf` package, which was defined in SPARK-40654 | #37972 and become unused in SPARK-41639 | #39147. ### Why are the changes needed? Clean up unused definitions. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #47459 from LuciferYang/remove-ScalaReflectionLock. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Configuration menu - View commit details
-
Copy full SHA for f828146 - Browse repository at this point
Copy the full SHA f828146View commit details -
[SPARK-48931][SS] Reduce Cloud Store List API cost for state store ma…
…intenance task ### What changes were proposed in this pull request? Currently, during the state store maintenance process, we find which old version files of the **RocksDB** state store to delete by listing all existing snapshotted version files in the checkpoint directory every 1 minute by default. The frequent list calls in the cloud can result in high costs. To address this concern and reduce the cost associated with state store maintenance, we should aim to minimize the frequency of listing object stores inside the maintenance task. To minimize the frequency, we will try to accumulate versions to delete and only call list when the number of versions to delete reaches a configured threshold. The changes include: 1. Adding new configuration variable `ratioExtraVersionsAllowedInCheckpoint` in **SQLConf**. This determines the ratio of extra versions files we want to retain in the checkpoint directory compared to number of versions to retain for rollbacks (`minBatchesToRetain`). 2. Using this config and `minBatchesToRetain`, set `minVersionsToDelete` config inside **StateStoreConf** to `minVersionsToDelete = ratioExtraVersionsAllowedInCheckpoint * minBatchesToRetain.` 3. Using `minSeenVersion` and `maxSeenVersion` variables in **RocksDBFileManager** to estimate min/max version present in directory and control deletion frequency. This is done by ensuring number of stale versions to delete is at least `minVersionsToDelete` ### Why are the changes needed? Currently, maintenance operations like snapshotting, purging, deletion, and file management is done asynchronously for each data partition. We want to shift away periodic deletion and instead rely on the estimated number of files in the checkpoint directory to reduce list calls and introduce batch deletion. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47393 from riyaverm-db/reduce-cloud-store-list-api-cost-in-maintenance. Authored-by: Riya Verma <riya.verma@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 4e20a2a - Browse repository at this point
Copy the full SHA 4e20a2aView commit details -
[SPARK-48961][PYTHON] Make the parameter naming of
PySparkException
…… consistent with JVM ### What changes were proposed in this pull request? This PR proposes to make the parameter naming of `PySparkException` consistent with JVM ### Why are the changes needed? The parameter names of `PySparkException` are different from `SparkException` so there is an inconsistency when searching those parameters from error logs. SparkException: https://github.com/apache/spark/blob/6508b1f5e18731359354af0a7bcc1382bc4f356b/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L27-L33 PySparkException: https://github.com/apache/spark/blob/6508b1f5e18731359354af0a7bcc1382bc4f356b/python/pyspark/errors/exceptions/base.py#L29-L40 ### Does this PR introduce _any_ user-facing change? The error parameter names are changed from: - `error_class` -> `errorClass` - `message_parameters` -> `messageParameters` - `query_contexts` -> `context` ### How was this patch tested? The existing CI should pass ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47436 from itholic/SPARK-48961. Authored-by: Haejoon Lee <haejoon.lee@databricks.com> Signed-off-by: Haejoon Lee <haejoon.lee@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 090ad9f - Browse repository at this point
Copy the full SHA 090ad9fView commit details -
[SPARK-48935][SQL][TESTS] Make
checkEvaluation
directly check the `……Collation` expression itself in UT ### What changes were proposed in this pull request? The pr aims to: - make `checkEvaluation` directly check the `Collation` expression itself in UT, rather than `Collation(...).replacement`. - fix an `miss` check in UT. ### Why are the changes needed? When checking the `RuntimeReplaceable` expression in UT, there is no need to write as `checkEvaluation(Collation(Literal("abc")).replacement, "UTF8_BINARY")`, because it has already undergone a similar replacement internally. https://github.com/apache/spark/blob/1a428c1606645057ef94ac8a6cadbb947b9208a6/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala#L75 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Update existed UT. - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47401 from panbingkun/SPARK-48935. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 4de4ed1 - Browse repository at this point
Copy the full SHA 4de4ed1View commit details -
[SPARK-48338][SQL] Check variable declarations
### What changes were proposed in this pull request? Checking wether variable declaration is only at the beginning of the BEGIN END block. ### Why are the changes needed? SQL standard states that the variables can be declared only immediately after BEGIN. ### Does this PR introduce _any_ user-facing change? Users will get an error if they try to declare variable in the scope that is not started with BEGIN and ended with END or if the declarations are not immediately after BEGIN. ### How was this patch tested? Tests are in SqlScriptingParserSuite. There are 2 tests for now, if declarations are correctly written and if declarations are not written immediately after BEGIN. There is a TODO to write the test if declaration is located in the scope that is not BEGIN END. ### Was this patch authored or co-authored using generative AI tooling? No Closes #47404 from momcilomrk-db/check_variable_declarations. Authored-by: Momcilo Mrkaic <momcilo.mrkaic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 239d77b - Browse repository at this point
Copy the full SHA 239d77bView commit details -
[SPARK-48567][SS][FOLLOWUP] StreamingQuery.lastProgress should return…
… the actual StreamingQueryProgress This reverts commit d067fc6, which reverted 042804a, essentially brings it back. 042804a failed the 3.5 client <> 4.0 server test, but the test was decided to turned off for cross-version test in #47468 ### What changes were proposed in this pull request? This PR is created after discussion in this closed one: #46886 I was trying to fix a bug (in connect, query.lastProgress doesn't have `numInputRows`, `inputRowsPerSecond`, and `processedRowsPerSecond`), and we reached the conclusion that what purposed in this PR should be the ultimate fix. In python, for both classic spark and spark connect, the return type of `lastProgress` is `Dict` (and `recentProgress` is `List[Dict]`), but in scala it's the actual `StreamingQueryProgress` object: https://github.com/apache/spark/blob/1a5d22aa2ffe769435be4aa6102ef961c55b9593/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQuery.scala#L94-L101 This API discrepancy brings some confusion, like in Scala, users can do `query.lastProgress.batchId`, while in Python they have to do `query.lastProgress["batchId"]`. This PR makes `StreamingQuery.lastProgress` to return the actual `StreamingQueryProgress` (and `StreamingQuery.recentProgress` to return `List[StreamingQueryProgress]`). To prevent breaking change, we extend `StreamingQueryProgress` to be a subclass of `dict`, so existing code accessing using dictionary method (e.g. `query.lastProgress["id"]`) is still functional. ### Why are the changes needed? API parity ### Does this PR introduce _any_ user-facing change? Yes, now `StreamingQuery.lastProgress` returns the actual `StreamingQueryProgress` (and `StreamingQuery.recentProgress` returns `List[StreamingQueryProgress]`). ### How was this patch tested? Added unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #47470 from WweiL/bring-back-lastProgress. Authored-by: Wei Liu <wei.liu@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Configuration menu - View commit details
-
Copy full SHA for 22eb6c4 - Browse repository at this point
Copy the full SHA 22eb6c4View commit details -
[SPARK-48833][SQL][VARIANT] Support variant in
InMemoryTableScan
### What changes were proposed in this pull request? adds support for variant type in `InMemoryTableScan`, or `df.cache()` by supporting writing variant values to an inmemory buffer. ### Why are the changes needed? prior to this PR, calling `df.cache()` on a df that has a variant would fail because `InMemoryTableScan` does not support reading variant types. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? added UTs ### Was this patch authored or co-authored using generative AI tooling? no Closes #47252 from richardc-db/variant_dfcache_support. Authored-by: Richard Chen <r.chen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 0c9b072 - Browse repository at this point
Copy the full SHA 0c9b072View commit details -
[SPARK-48988][ML] Make
DefaultParamsReader/Writer
handle metadata w……ith spark session ### What changes were proposed in this pull request? `DefaultParamsReader/Writer` handle metadata with spark session ### Why are the changes needed? In existing ml implementations, when loading/saving a model, it loads/saves the metadata with `SparkContext` then loads/saves the coefficients with `SparkSession`. This PR aims to also load/save the metadata with `SparkSession`, by introducing new helper functions. - Note I: 3-rd libraries (e.g. [xgboost](https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-spark/src/main/scala/org/apache/spark/ml/util/XGBoostReadWrite.scala#L38-L53) ) likely depends on existing implementation of saveMetadata/loadMetadata, so we cannot simply remove them even though they are `private[ml]`. - Note II: this PR only handles `loadMetadata` and `saveMetadata`, there are similar cases for meta algorithms and param read/write, but I want to ignore the remaining part first, to avoid touching too many files in single PR. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes #47467 from zhengruifeng/ml_load_with_spark. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Configuration menu - View commit details
-
Copy full SHA for 8597b78 - Browse repository at this point
Copy the full SHA 8597b78View commit details -
[SPARK-48991][SQL] Move path initialization into try-catch block in F…
…ileStreamSink.hasMetadata ### What changes were proposed in this pull request? This pull request proposed to move path initialization into try-catch block in FileStreamSink.hasMetadata. Then, exceptions from invalid paths can be handled consistently like other path-related exceptions in the current try-catch block. At last, we can make the errors fall into the correct code branches to be handled ### Why are the changes needed? bugfix for improperly handled exceptions in FileStreamSink.hasMetadata ### Does this PR introduce _any_ user-facing change? no, an invalid path is still invalid, but fails in the correct places ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? no Closes #47471 from yaooqinn/SPARK-48991. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>
Configuration menu - View commit details
-
Copy full SHA for d68cde8 - Browse repository at this point
Copy the full SHA d68cde8View commit details -
[SPARK-48990][SQL] Unified variable related SQL syntax keywords
### What changes were proposed in this pull request? The pr aims to unified `variable` related `SQL syntax` keywords, enable syntax `DECLARE (OR REPLACE)? ...` and `DROP TEMPORARY ...` to support keyword: `VAR` (not only `VARIABLE`). ### Why are the changes needed? When `setting` variables, we support `(VARIABLE | VAR)`, but when `declaring` and `dropping` variables, we only support the keyword `VARIABLE` (not support the keyword `VAR`) <img width="597" alt="image" src="https://github.com/user-attachments/assets/07084fef-4080-4410-a74c-e6001ae0a8fa"> https://github.com/apache/spark/blob/285489b0225004e918b6e937f7367e492292815e/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4#L68-L72 https://github.com/apache/spark/blob/285489b0225004e918b6e937f7367e492292815e/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4#L218-L220 The syntax seems `a bit weird`, `inconsistent experience` in SQL syntax related to variable usage by end-users, so I propose to `unify` it. ### Does this PR introduce _any_ user-facing change? Yes, enable end-users to use `variable related SQL` with `consistent` keywords. ### How was this patch tested? Updated existed UT. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47469 from panbingkun/SPARK-48990. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 34e65a8 - Browse repository at this point
Copy the full SHA 34e65a8View commit details -
[SPARK-47764][FOLLOW-UP] Change to use ShuffleDriverComponents.remove…
…Shuffle to remove shuffle properly ### What changes were proposed in this pull request? This is a follow-up for #45930, where we introduced ShuffleCleanupMode and implemented cleaning up of shuffle dependencies. There was a bug where `ShuffleManager.unregisterShuffle` was used on Driver, and in non-local mode it is not effective at all. This change fixed the bug by changing to use `ShuffleDriverComponents.removeShuffle` instead. ### Why are the changes needed? This is to address the comments in #45930 (comment) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Updated unit tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes #46302 from bozhang2820/spark-47764-1. Authored-by: Bo Zhang <bo.zhang@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Configuration menu - View commit details
-
Copy full SHA for 5d787e2 - Browse repository at this point
Copy the full SHA 5d787e2View commit details
Commits on Jul 25, 2024
-
[SPARK-48943][SQL][TESTS][FOLLOWUP] Fix the
h2
filter push-down tes……t case failure with ANSI mode off ### What changes were proposed in this pull request? This PR aims to fix the `h2` filter push-down test case failure with ANSI mode off. ### Why are the changes needed? Fix test failure. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually test of the whole `JDBCV2Suite` with ANSI mode off and on. 1. Method One: with IDEA. - ANSI mode off: with `SPARK_ANSI_SQL_MODE=false` <img width="1066" alt="image" src="https://github.com/user-attachments/assets/13ec8ff4-0699-4f3e-95c4-74f53d9824fe"> - ANSI mode on: without `SPARK_ANSI_SQL_MODE` env variable <img width="1066" alt="image" src="https://github.com/user-attachments/assets/8434bf0c-b332-4663-965c-0d17d60da78a"> 2. Method Two: with commands. - ANSI mode off ``` SPARK_ANSI_SQL_MODE=false $ build/sbt > project sql > testOnly org.apache.spark.sql.jdbc.JDBCV2Suite ``` - ANSI mode on ``` UNSET SPARK_ANSI_SQL_MODE $ build/sbt > project sql > testOnly org.apache.spark.sql.jdbc.JDBCV2Suite ``` Test results: 1. The issue on current `master` branch - with `SPARK_ANSI_SQL_MODE=false`, test failed - without `SPARK_ANSI_SQL_MODE` env variable, test passed 2. Fixed with new test code - with `SPARK_ANSI_SQL_MODE=false`, test passed - without `SPARK_ANSI_SQL_MODE` env variable, test passed ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47472 from wayneguow/fix_h2. Authored-by: Wei Guo <guow93@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Configuration menu - View commit details
-
Copy full SHA for 8c625ea - Browse repository at this point
Copy the full SHA 8c625eaView commit details -
[SPARK-48761][SQL] Introduce clusterBy DataFrameWriter API for Scala
### What changes were proposed in this pull request? Introduce a new `clusterBy` DataFrame API in Scala. This PR adds the API for both the DataFrameWriter V1 and V2, as well as Spark Connect. ### Why are the changes needed? Introduce more ways for users to interact with clustered tables. ### Does this PR introduce _any_ user-facing change? Yes, it adds a new `clusterBy` DataFrame API in Scala to allow specifying the clustering columns when writing DataFrames. ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes #47301 from zedtang/clusterby-scala-api. Authored-by: Jiaheng Tang <jiaheng.tang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for bafce5d - Browse repository at this point
Copy the full SHA bafce5dView commit details -
[SPARK-48844][FOLLOWUP][TESTS] Cleanup duplicated data resource files…
… in hive-thriftserver test ### What changes were proposed in this pull request? A follow up of SPARK-48844 to cleanup duplicated data resource files in hive-thriftserver test ### Why are the changes needed? code refactoring ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #47480 from yaooqinn/SPARK-48844-F. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>
Configuration menu - View commit details
-
Copy full SHA for 5c19505 - Browse repository at this point
Copy the full SHA 5c19505View commit details -
[MINOR][DOCS] Update doc
sql/README.md
### What changes were proposed in this pull request? The pr aims to update doc `sql/README.md`. ### Why are the changes needed? After #41426, We have added a subproject `API` to our `SQL moudle`, so we need to update the doc `sql/README.md` synchronously. ### Does this PR introduce _any_ user-facing change? Yes, make the doc clearer and more accurate. ### How was this patch tested? Manually test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47476 from panbingkun/minor_docs. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Configuration menu - View commit details
-
Copy full SHA for cf95e75 - Browse repository at this point
Copy the full SHA cf95e75View commit details -
[SPARK-48849][SS] Create OperatorStateMetadataV2 for the TransformWit…
…hStateExec operator ### What changes were proposed in this pull request? Introducing the OperatorStateMetadataV2 format that integrates with the TransformWithStateExec operator. This is used to keep information about the TWS operator, will be used to enforce invariants in between query runs. Each OperatorStateMetadataV2 has a pointer to the StateSchemaV3 file for the corresponding operator. Will introduce purging in this PR: #47286 ### Why are the changes needed? This is needed for State Metadata integration with the TransformWithState operator. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Added unit tests to StateStoreSuite and TransformWithStateSuite ### Was this patch authored or co-authored using generative AI tooling? No Closes #47445 from ericm-db/metadata-v2. Authored-by: Eric Marnadi <eric.marnadi@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 4999469 - Browse repository at this point
Copy the full SHA 4999469View commit details -
[SPARK-48996][SQL][PYTHON] Allow bare literals for __and__ and __or__…
… of Column ### What changes were proposed in this pull request? Allows bare literals for `__and__` and `__or__` of Column API in Spark Classic. ### Why are the changes needed? Currently bare literals are not allowed for `__and__` and `__or__` of Column API in Spark Classic and need to wrap with `lit()` function. It should be allowed similar to other similar operators. ```py >>> from pyspark.sql.functions import * >>> c = col("c") >>> c & True Traceback (most recent call last): ... py4j.Py4JException: Method and([class java.lang.Boolean]) does not exist >>> c & lit(True) Column<'and(c, true)'> ``` whereas other operators: ```py >>> c + 1 Column<'`+`(c, 1)'> >>> c + lit(1) Column<'`+`(c, 1)'> ``` Spark Connect allows this. ```py >>> c & True Column<'and(c, True)'> >>> c & lit(True) Column<'and(c, True)'> ``` ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Added the related tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47474 from ueshin/issues/SPARK-48996/literal_and_or. Authored-by: Takuya Ueshin <ueshin@databricks.com> Signed-off-by: Takuya Ueshin <ueshin@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 78b83fa - Browse repository at this point
Copy the full SHA 78b83faView commit details
Commits on Jul 26, 2024
-
[SPARK-48503][SQL] Allow grouping on expressions in scalar subqueries…
…, if they are bound to outer rows ### What changes were proposed in this pull request? Extends previous work in #46839, allowing the grouping expressions to be bound to outer references. Most common example is `select *, (select count(*) from T_inner where cast(T_inner.x as date) = T_outer.date group by cast(T_inner.x as date))` Here, we group by cast(T_inner.x as date) which is bound to an outer row. This guarantees that for every outer row, there is exactly one value of cast(T_inner.x as date), so it is safe to group on it. Previously, we required that only columns can be bound to outer expressions, thus forbidding such subqueries. ### Why are the changes needed? Extends supported subqueries ### Does this PR introduce _any_ user-facing change? Yes, previously failing queries are now passing ### How was this patch tested? Query tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #47388 from agubichev/group_by_cols. Authored-by: Andrey Gubichev <andrey.gubichev@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for f3b819e - Browse repository at this point
Copy the full SHA f3b819eView commit details -
[SPARK-45787][SQL] Support Catalog.listColumns for clustering columns
### What changes were proposed in this pull request? Support listColumns API for clustering columns. ### Why are the changes needed? Clustering columns should be supported, just like partition and bucket columns, for listColumns API. ### Does this PR introduce _any_ user-facing change? Yes, listColumns will now show an additional field `isCluster` to indicate whether the column is a clustering column. Old output for `spark.catalog.listColumns`: ``` +----+-----------+--------+--------+-----------+--------+ |name|description|dataType|nullable|isPartition|isBucket| +----+-----------+--------+--------+-----------+--------+ | a| null| int| true| false| false| | b| null| string| true| false| false| | c| null| int| true| false| false| | d| null| string| true| false| false| +----+-----------+--------+--------+-----------+--------+ ``` New output: ``` +----+-----------+--------+--------+-----------+--------+---------+ |name|description|dataType|nullable|isPartition|isBucket|isCluster| +----+-----------+--------+--------+-----------+--------+---------+ | a| null| int| true| false| false| false| | b| null| string| true| false| false| false| | c| null| int| true| false| false| false| | d| null| string| true| false| false| false| +----+-----------+--------+--------+-----------+--------+---------+ ``` ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47451 from zedtang/list-clustering-columns. Authored-by: Jiaheng Tang <jiaheng.tang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for e73ede7 - Browse repository at this point
Copy the full SHA e73ede7View commit details -
[SPARK-49007][CORE] Improve
MasterPage
to support custom title### What changes were proposed in this pull request? This PR aims to improve `MasterPage` to support custom title. ### Why are the changes needed? When there exists multiple Spark clusters, custom title can be more helpful than the spark master address because it can contain semantics like the role of the clusters. In addition, the URL field in the same page already provides the spark master information even when we use a custom title. **BEFORE** ``` sbin/start-master.sh ``` ![Screenshot 2024-07-25 at 14 01 11](https://github.com/user-attachments/assets/7055d700-4bd6-4785-a535-2f8ce6dba47d) **AFTER** ``` SPARK_MASTER_OPTS='-Dspark.master.ui.title="Projext X Staging Cluster"' sbin/start-master.sh ``` ![Screenshot 2024-07-25 at 14 05 38](https://github.com/user-attachments/assets/f7e45fd6-fa2b-4547-ae39-1403b1e910d9) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Pass the CIs with newly added test case. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47491 from dongjoon-hyun/SPARK-49007. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Configuration menu - View commit details
-
Copy full SHA for 2363aec - Browse repository at this point
Copy the full SHA 2363aecView commit details