forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a new pull request by comparing changes across two branches #1516
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…lace for pandas 2.0.0 ### What changes were proposed in this pull request? The pr aims to enable SeriesStringTests.test_string_replace for pandas 2.0.0. ### Why are the changes needed? Improve UT coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GA. - Manually test: ''' (base) panbingkun:~/Developer/spark/spark-community$python/run-tests --testnames 'pyspark.pandas.tests.test_series_string SeriesStringTests.test_string_replace' Running PySpark tests. Output is in /Users/panbingkun/Developer/spark/spark-community/python/unit-tests.log Will test against the following Python executables: ['python3.9'] Will test the following Python tests: ['pyspark.pandas.tests.test_series_string SeriesStringTests.test_string_replace'] python3.9 python_implementation is CPython python3.9 version is: Python 3.9.13 Starting test(python3.9): pyspark.pandas.tests.test_series_string SeriesStringTests.test_string_replace (temp output: /Users/panbingkun/Developer/spark/spark-community/python/target/d51a913a-b400-4d1b-adb3-97765bb463bd/python3.9__pyspark.pandas.tests.test_series_string_SeriesStringTests.test_string_replace__izk1fx8o.log) Finished test(python3.9): pyspark.pandas.tests.test_series_string SeriesStringTests.test_string_replace (13s) Tests passed in 13 seconds ''' Closes #41823 from panbingkun/SPARK-43476. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…GroupedData ### What changes were proposed in this pull request? Be more explicit in the `Callable` type annotation for `dfapi` and `df_varargs_api` to explicitly return a `DataFrame`. ### Why are the changes needed? In PySpark 3.3.x, type hints now infer the return value of something like `df.groupBy(...).count()` to be `Any`, whereas it should be `DataFrame`. This breaks type checking. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No runtime changes introduced, so just relied on CI tests. Closes #40460 from j03wang/grouped-data-type. Authored-by: Joe Wang <joe.wang@afreshtechnologies.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…uctsToCsv) ### What changes were proposed in this pull request? This PR enhances `StructsToCsv` class with `doGenCode` function instead of extending it from `CodegenFallback` trait (performance improvement). ### Why are the changes needed? It will improve performance. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? an additional test case were added to `org.apache.spark.sql.CsvFunctionsSuite` class. Closes #39719 from NarekDW/SPARK-42169. Authored-by: narek_karapetian <narek.karapetian93@yandex.ru> Signed-off-by: Max Gekk <max.gekk@gmail.com>
…Function ### What changes were proposed in this pull request? Adds a new SQL syntax for `TableValuedFunction`. The syntax supports passing such relations one of two ways: 1. `SELECT ... FROM tvf_call(TABLE t)` 2. `SELECT ... FROM tvf_call(TABLE (<query>))` In the former case, the relation argument directly refers to the name of a table in the catalog. In the latter case, the relation argument comprises a table subquery that may itself refer to one or more tables in its own FROM clause. For example, for the given user defined table values function: ```py udtf(returnType="a: int") class TestUDTF: def eval(self, row: Row): if row[0] > 5: yield row[0], spark.udtf.register("test_udtf", TestUDTF) spark.sql("CREATE OR REPLACE TEMPORARY VIEW v as SELECT id FROM range(0, 8)") ``` , the following SQLs should work: ```py >>> spark.sql("SELECT * FROM test_udtf(TABLE v)").collect() [Row(a=6), Row(a=7)] ``` or ```py >>> spark.sql("SELECT * FROM test_udtf(TABLE (SELECT id + 1 FROM v))").collect() [Row(a=6), Row(a=7), Row(a=8)] ``` ### Why are the changes needed? To support `TABLE` argument parser rule for TableValuedFunction. ### Does this PR introduce _any_ user-facing change? Yes, new syntax for SQL. ### How was this patch tested? Added the related tests. Closes #41750 from ueshin/issues/SPARK-44200/table_argument. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…E_ARGUMENTS` error into doc ### What changes were proposed in this pull request? This is a followup PR for #41750, because we add test for sync doc and `error-classes.json` after #41813 . We should add `TABLE_VALUED_FUNCTION_TOO_MANY_TABLE_ARGUMENTS` (add on #41750) into doc. ### Why are the changes needed? sync error and doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? exist test. Closes #41827 from Hisoka-X/SPARK-44200_follow_up_error_json_doc. Authored-by: Jia Fan <fanjiaeminem@qq.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? The `CacheManager` refreshFileIndexIfNecessary logic checks if the fileIndex root paths starts with the input path. This is problematic if the input path and root path share the prefixes but the root path is not a subdirectory of the input path. In such cases, the CacheManager can unnecessarily refresh the fileIndex which can fail the query if it does not have access to the rootPath for that SparkSession. ### Why are the changes needed? Fixes the bug where the queries on cached dataframe APIs can fail if the cached path shares prefix with the different path. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Unit test Closes #41749 from vihangk1/master_cachemanager. Lead-authored-by: Vihang Karajgaonkar <vihang.karajgaonkar@databricks.com> Co-authored-by: Vihang Karajgaonkar <vihangk1@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…return type in Arrow Python UDF ### What changes were proposed in this pull request? Explicit Arrow casting for the mismatched return type of Arrow Python UDF. ### Why are the changes needed? A more standardized and coherent type coercion. Please refer to #41706 for a comprehensive comparison between type coercion rules of Arrow and Pickle(used by the default Python UDF) separately. See more at [[Design] Type-coercion in Arrow Python UDFs](https://docs.google.com/document/d/e/2PACX-1vTEGElOZfhl9NfgbBw4CTrlm-8F_xQCAKNOXouz-7mg5vYobS7lCGUsGkDZxPY0wV5YkgoZmkYlxccU/pub). ### Does this PR introduce _any_ user-facing change? Yes. FROM ```py >>> df = spark.createDataFrame(['1', '2'], schema='string') df.select(pandas_udf(lambda x: x, 'int')('value')).show() >>> df.select(pandas_udf(lambda x: x, 'int')('value')).show() ... org.apache.spark.api.python.PythonException: Traceback (most recent call last): ... pyarrow.lib.ArrowInvalid: Could not convert '1' with type str: tried to convert to int32 ``` TO ```py >>> df = spark.createDataFrame(['1', '2'], schema='string') >>> df.select(pandas_udf(lambda x: x, 'int')('value')).show() +---------------+ |<lambda>(value)| +---------------+ | 1| | 2| +---------------+ ``` ### How was this patch tested? Unit tests. Closes #41800 from xinrong-meng/snd_type_coersion. Authored-by: Xinrong Meng <xinrong@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? Move Util.truncatedString to sql/api. ### Why are the changes needed? Make StructType depends less on Catalyst so towards simpler DataType interface. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test Closes #41811 from amaliujia/move_out_truncatedString. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…r to common/utils ### What changes were proposed in this pull request? Move out util functions used by ArtifactManager to `common/utils`. More specific, move `resolveURI` and `awaitResult` to `common/utils`. ### Why are the changes needed? So that Spark Connect Scala client does not need to depend on Spark. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test Closes #41825 from amaliujia/SPARK-44273. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? This PR proposes to add: - `SparkContext.setInterruptOnCancel(interruptOnCancel: Boolean): Unit` - `SparkContext.addJobTag(tag: String): Unit` - `SparkContext.removeJobTag(tag: String): Unit` - `SparkContext.getJobTags(): Set[String]` - `SparkContext.clearJobTags(): Unit` - `SparkContext.cancelJobsWithTag(tag: String): Unit` into PySpark. See also SPARK-43952. ### Why are the changes needed? For PySpark users, and feature parity. ### Does this PR introduce _any_ user-facing change? Yes, it adds new API in PySpark. ### How was this patch tested? Unittests were added. Closes #41841 from HyukjinKwon/SPARK-44194. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…ptions in RocksDB state store provider ### What changes were proposed in this pull request? Set the column family options before passing to DBOptions in RocksDB state store provider ### Why are the changes needed? Address bug fix to ensure column family options around memory usage are passed correctly to dbOptions ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes #41840 from anishshri-db/task/SPARK-44288. Authored-by: Anish Shrigondekar <anish.shrigondekar@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
github-actions
bot
added
CORE
SQL
PYTHON
STRUCTURED STREAMING
DOCS
PANDAS API ON SPARK
CONNECT
labels
Jul 4, 2023
…_[2310-2314] ### What changes were proposed in this pull request? The pr aims to assign names to the error class _LEGACY_ERROR_TEMP_[2310-2314]. ### Why are the changes needed? Improve the error framework. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Exists test cases updated and added new test cases. Closes #41816 from beliefer/SPARK-44269. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request? Implement classification evaluator ### Why are the changes needed? Distributed ML <> spark connect project. ### Does this PR introduce _any_ user-facing change? Yes. `BinaryClassificationEvaluator` and `MulticlassClassificationEvaluator` are added. ### How was this patch tested? Closes #41793 from WeichenXu123/classification-evaluator. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Why are the changes needed?
Does this PR introduce any user-facing change?
How was this patch tested?