Create a new pull request by comparing changes across two branches #1516

GulajavaMinistudio · 2023-07-04T04:30:31Z

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

…lace for pandas 2.0.0 ### What changes were proposed in this pull request? The pr aims to enable SeriesStringTests.test_string_replace for pandas 2.0.0. ### Why are the changes needed? Improve UT coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GA. - Manually test: ''' (base) panbingkun:~/Developer/spark/spark-community$python/run-tests --testnames 'pyspark.pandas.tests.test_series_string SeriesStringTests.test_string_replace' Running PySpark tests. Output is in /Users/panbingkun/Developer/spark/spark-community/python/unit-tests.log Will test against the following Python executables: ['python3.9'] Will test the following Python tests: ['pyspark.pandas.tests.test_series_string SeriesStringTests.test_string_replace'] python3.9 python_implementation is CPython python3.9 version is: Python 3.9.13 Starting test(python3.9): pyspark.pandas.tests.test_series_string SeriesStringTests.test_string_replace (temp output: /Users/panbingkun/Developer/spark/spark-community/python/target/d51a913a-b400-4d1b-adb3-97765bb463bd/python3.9__pyspark.pandas.tests.test_series_string_SeriesStringTests.test_string_replace__izk1fx8o.log) Finished test(python3.9): pyspark.pandas.tests.test_series_string SeriesStringTests.test_string_replace (13s) Tests passed in 13 seconds ''' Closes #41823 from panbingkun/SPARK-43476. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…GroupedData ### What changes were proposed in this pull request? Be more explicit in the `Callable` type annotation for `dfapi` and `df_varargs_api` to explicitly return a `DataFrame`. ### Why are the changes needed? In PySpark 3.3.x, type hints now infer the return value of something like `df.groupBy(...).count()` to be `Any`, whereas it should be `DataFrame`. This breaks type checking. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No runtime changes introduced, so just relied on CI tests. Closes #40460 from j03wang/grouped-data-type. Authored-by: Joe Wang <joe.wang@afreshtechnologies.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…uctsToCsv) ### What changes were proposed in this pull request? This PR enhances `StructsToCsv` class with `doGenCode` function instead of extending it from `CodegenFallback` trait (performance improvement). ### Why are the changes needed? It will improve performance. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? an additional test case were added to `org.apache.spark.sql.CsvFunctionsSuite` class. Closes #39719 from NarekDW/SPARK-42169. Authored-by: narek_karapetian <narek.karapetian93@yandex.ru> Signed-off-by: Max Gekk <max.gekk@gmail.com>

…Function ### What changes were proposed in this pull request? Adds a new SQL syntax for `TableValuedFunction`. The syntax supports passing such relations one of two ways: 1. `SELECT ... FROM tvf_call(TABLE t)` 2. `SELECT ... FROM tvf_call(TABLE (<query>))` In the former case, the relation argument directly refers to the name of a table in the catalog. In the latter case, the relation argument comprises a table subquery that may itself refer to one or more tables in its own FROM clause. For example, for the given user defined table values function: ```py udtf(returnType="a: int") class TestUDTF: def eval(self, row: Row): if row[0] > 5: yield row[0], spark.udtf.register("test_udtf", TestUDTF) spark.sql("CREATE OR REPLACE TEMPORARY VIEW v as SELECT id FROM range(0, 8)") ``` , the following SQLs should work: ```py >>> spark.sql("SELECT * FROM test_udtf(TABLE v)").collect() [Row(a=6), Row(a=7)] ``` or ```py >>> spark.sql("SELECT * FROM test_udtf(TABLE (SELECT id + 1 FROM v))").collect() [Row(a=6), Row(a=7), Row(a=8)] ``` ### Why are the changes needed? To support `TABLE` argument parser rule for TableValuedFunction. ### Does this PR introduce _any_ user-facing change? Yes, new syntax for SQL. ### How was this patch tested? Added the related tests. Closes #41750 from ueshin/issues/SPARK-44200/table_argument. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…E_ARGUMENTS` error into doc ### What changes were proposed in this pull request? This is a followup PR for #41750, because we add test for sync doc and `error-classes.json` after #41813 . We should add `TABLE_VALUED_FUNCTION_TOO_MANY_TABLE_ARGUMENTS` (add on #41750) into doc. ### Why are the changes needed? sync error and doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? exist test. Closes #41827 from Hisoka-X/SPARK-44200_follow_up_error_json_doc. Authored-by: Jia Fan <fanjiaeminem@qq.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? The `CacheManager` refreshFileIndexIfNecessary logic checks if the fileIndex root paths starts with the input path. This is problematic if the input path and root path share the prefixes but the root path is not a subdirectory of the input path. In such cases, the CacheManager can unnecessarily refresh the fileIndex which can fail the query if it does not have access to the rootPath for that SparkSession. ### Why are the changes needed? Fixes the bug where the queries on cached dataframe APIs can fail if the cached path shares prefix with the different path. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Unit test Closes #41749 from vihangk1/master_cachemanager. Lead-authored-by: Vihang Karajgaonkar <vihang.karajgaonkar@databricks.com> Co-authored-by: Vihang Karajgaonkar <vihangk1@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…return type in Arrow Python UDF ### What changes were proposed in this pull request? Explicit Arrow casting for the mismatched return type of Arrow Python UDF. ### Why are the changes needed? A more standardized and coherent type coercion. Please refer to #41706 for a comprehensive comparison between type coercion rules of Arrow and Pickle(used by the default Python UDF) separately. See more at [[Design] Type-coercion in Arrow Python UDFs](https://docs.google.com/document/d/e/2PACX-1vTEGElOZfhl9NfgbBw4CTrlm-8F_xQCAKNOXouz-7mg5vYobS7lCGUsGkDZxPY0wV5YkgoZmkYlxccU/pub). ### Does this PR introduce _any_ user-facing change? Yes. FROM ```py >>> df = spark.createDataFrame(['1', '2'], schema='string') df.select(pandas_udf(lambda x: x, 'int')('value')).show() >>> df.select(pandas_udf(lambda x: x, 'int')('value')).show() ... org.apache.spark.api.python.PythonException: Traceback (most recent call last): ... pyarrow.lib.ArrowInvalid: Could not convert '1' with type str: tried to convert to int32 ``` TO ```py >>> df = spark.createDataFrame(['1', '2'], schema='string') >>> df.select(pandas_udf(lambda x: x, 'int')('value')).show() +---------------+ |<lambda>(value)| +---------------+ | 1| | 2| +---------------+ ``` ### How was this patch tested? Unit tests. Closes #41800 from xinrong-meng/snd_type_coersion. Authored-by: Xinrong Meng <xinrong@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Move Util.truncatedString to sql/api. ### Why are the changes needed? Make StructType depends less on Catalyst so towards simpler DataType interface. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test Closes #41811 from amaliujia/move_out_truncatedString. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…r to common/utils ### What changes were proposed in this pull request? Move out util functions used by ArtifactManager to `common/utils`. More specific, move `resolveURI` and `awaitResult` to `common/utils`. ### Why are the changes needed? So that Spark Connect Scala client does not need to depend on Spark. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test Closes #41825 from amaliujia/SPARK-44273. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This PR proposes to add: - `SparkContext.setInterruptOnCancel(interruptOnCancel: Boolean): Unit` - `SparkContext.addJobTag(tag: String): Unit` - `SparkContext.removeJobTag(tag: String): Unit` - `SparkContext.getJobTags(): Set[String]` - `SparkContext.clearJobTags(): Unit` - `SparkContext.cancelJobsWithTag(tag: String): Unit` into PySpark. See also SPARK-43952. ### Why are the changes needed? For PySpark users, and feature parity. ### Does this PR introduce _any_ user-facing change? Yes, it adds new API in PySpark. ### How was this patch tested? Unittests were added. Closes #41841 from HyukjinKwon/SPARK-44194. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…ptions in RocksDB state store provider ### What changes were proposed in this pull request? Set the column family options before passing to DBOptions in RocksDB state store provider ### Why are the changes needed? Address bug fix to ensure column family options around memory usage are passed correctly to dbOptions ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes #41840 from anishshri-db/task/SPARK-44288. Authored-by: Anish Shrigondekar <anish.shrigondekar@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

…_[2310-2314] ### What changes were proposed in this pull request? The pr aims to assign names to the error class _LEGACY_ERROR_TEMP_[2310-2314]. ### Why are the changes needed? Improve the error framework. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Exists test cases updated and added new test cases. Closes #41816 from beliefer/SPARK-44269. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

### What changes were proposed in this pull request? Implement classification evaluator ### Why are the changes needed? Distributed ML <> spark connect project. ### Does this PR introduce _any_ user-facing change? Yes. `BinaryClassificationEvaluator` and `MulticlassClassificationEvaluator` are added. ### How was this patch tested? Closes #41793 from WeichenXu123/classification-evaluator. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

panbingkun and others added 11 commits July 3, 2023 15:30

github-actions bot added CORE SQL PYTHON STRUCTURED STREAMING DOCS PANDAS API ON SPARK CONNECT labels Jul 4, 2023

beliefer and others added 2 commits July 4, 2023 08:07

github-actions bot added the ML label Jul 4, 2023

GulajavaMinistudio merged commit 1166ae6 into GulajavaMinistudio:master Jul 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a new pull request by comparing changes across two branches #1516

Create a new pull request by comparing changes across two branches #1516

GulajavaMinistudio commented Jul 4, 2023

Create a new pull request by comparing changes across two branches #1516

Create a new pull request by comparing changes across two branches #1516

Conversation

GulajavaMinistudio commented Jul 4, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?