Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a new pull request by comparing changes across two branches #1516

Merged
merged 13 commits into from
Jul 4, 2023

Conversation

GulajavaMinistudio
Copy link
Owner

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

panbingkun and others added 11 commits July 3, 2023 15:30
…lace for pandas 2.0.0

### What changes were proposed in this pull request?
The pr aims to enable SeriesStringTests.test_string_replace for pandas 2.0.0.

### Why are the changes needed?
Improve UT coverage.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- Pass GA.
- Manually test:
'''
(base) panbingkun:~/Developer/spark/spark-community$python/run-tests --testnames 'pyspark.pandas.tests.test_series_string SeriesStringTests.test_string_replace'
Running PySpark tests. Output is in /Users/panbingkun/Developer/spark/spark-community/python/unit-tests.log
Will test against the following Python executables: ['python3.9']
Will test the following Python tests: ['pyspark.pandas.tests.test_series_string SeriesStringTests.test_string_replace']
python3.9 python_implementation is CPython
python3.9 version is: Python 3.9.13
Starting test(python3.9): pyspark.pandas.tests.test_series_string SeriesStringTests.test_string_replace (temp output: /Users/panbingkun/Developer/spark/spark-community/python/target/d51a913a-b400-4d1b-adb3-97765bb463bd/python3.9__pyspark.pandas.tests.test_series_string_SeriesStringTests.test_string_replace__izk1fx8o.log)
Finished test(python3.9): pyspark.pandas.tests.test_series_string SeriesStringTests.test_string_replace (13s)
Tests passed in 13 seconds
'''

Closes #41823 from panbingkun/SPARK-43476.

Authored-by: panbingkun <pbk1982@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…GroupedData

### What changes were proposed in this pull request?

Be more explicit in the `Callable` type annotation for `dfapi` and `df_varargs_api` to explicitly return a `DataFrame`.

### Why are the changes needed?

In PySpark 3.3.x, type hints now infer the return value of something like `df.groupBy(...).count()` to be `Any`, whereas it should be `DataFrame`. This breaks type checking.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

No runtime changes introduced, so just relied on CI tests.

Closes #40460 from j03wang/grouped-data-type.

Authored-by: Joe Wang <joe.wang@afreshtechnologies.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…uctsToCsv)

### What changes were proposed in this pull request?
This PR enhances `StructsToCsv` class with `doGenCode` function instead of extending it from `CodegenFallback` trait (performance improvement).

### Why are the changes needed?
It will improve performance.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
an additional test case were added to `org.apache.spark.sql.CsvFunctionsSuite` class.

Closes #39719 from NarekDW/SPARK-42169.

Authored-by: narek_karapetian <narek.karapetian93@yandex.ru>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
…Function

### What changes were proposed in this pull request?

Adds a new SQL syntax for `TableValuedFunction`.

The syntax supports passing such relations one of two ways:

1. `SELECT ... FROM tvf_call(TABLE t)`
2. `SELECT ... FROM tvf_call(TABLE (<query>))`

In the former case, the relation argument directly refers to the name of a table in the catalog. In the latter case, the relation argument comprises a table subquery that may itself refer to one or more tables in its own FROM clause.

For example, for the given user defined table values function:

```py
udtf(returnType="a: int")
class TestUDTF:
    def eval(self, row: Row):
        if row[0] > 5:
            yield row[0],

spark.udtf.register("test_udtf", TestUDTF)

spark.sql("CREATE OR REPLACE TEMPORARY VIEW v as SELECT id FROM range(0, 8)")
```

, the following SQLs should work:

```py
>>> spark.sql("SELECT * FROM test_udtf(TABLE v)").collect()
[Row(a=6), Row(a=7)]
```

or

```py
>>> spark.sql("SELECT * FROM test_udtf(TABLE (SELECT id + 1 FROM v))").collect()
[Row(a=6), Row(a=7), Row(a=8)]
```

### Why are the changes needed?

To support `TABLE` argument parser rule for TableValuedFunction.

### Does this PR introduce _any_ user-facing change?

Yes, new syntax for SQL.

### How was this patch tested?

Added the related tests.

Closes #41750 from ueshin/issues/SPARK-44200/table_argument.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…E_ARGUMENTS` error into doc

### What changes were proposed in this pull request?
This is a followup PR for #41750, because we add test for sync doc and `error-classes.json` after #41813 . We should add `TABLE_VALUED_FUNCTION_TOO_MANY_TABLE_ARGUMENTS` (add on #41750) into doc.

### Why are the changes needed?
sync error and doc

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
exist test.

Closes #41827 from Hisoka-X/SPARK-44200_follow_up_error_json_doc.

Authored-by: Jia Fan <fanjiaeminem@qq.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
The `CacheManager` refreshFileIndexIfNecessary logic checks if the fileIndex root paths starts with the input path. This is problematic if the input path and root path share the prefixes but the root path is not a subdirectory of the input path. In such cases, the CacheManager can unnecessarily refresh the fileIndex which can fail the query if it does not have access to the rootPath for that SparkSession.

### Why are the changes needed?
Fixes the bug where the queries on cached dataframe APIs can fail if the cached path shares prefix with the different path.

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
Unit test

Closes #41749 from vihangk1/master_cachemanager.

Lead-authored-by: Vihang Karajgaonkar <vihang.karajgaonkar@databricks.com>
Co-authored-by: Vihang Karajgaonkar <vihangk1@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…return type in Arrow Python UDF

### What changes were proposed in this pull request?
Explicit Arrow casting for the mismatched return type of Arrow Python UDF.

### Why are the changes needed?
A more standardized and coherent type coercion.

Please refer to #41706 for a comprehensive comparison between type coercion rules of Arrow and Pickle(used by the default Python UDF) separately.

See more at [[Design] Type-coercion in Arrow Python UDFs](https://docs.google.com/document/d/e/2PACX-1vTEGElOZfhl9NfgbBw4CTrlm-8F_xQCAKNOXouz-7mg5vYobS7lCGUsGkDZxPY0wV5YkgoZmkYlxccU/pub).

### Does this PR introduce _any_ user-facing change?
Yes.

FROM
```py
>>> df = spark.createDataFrame(['1', '2'], schema='string')
df.select(pandas_udf(lambda x: x, 'int')('value')).show()
>>> df.select(pandas_udf(lambda x: x, 'int')('value')).show()
...
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
...
pyarrow.lib.ArrowInvalid: Could not convert '1' with type str: tried to convert to int32
```

TO
```py
>>> df = spark.createDataFrame(['1', '2'], schema='string')
>>> df.select(pandas_udf(lambda x: x, 'int')('value')).show()
+---------------+
|<lambda>(value)|
+---------------+
|              1|
|              2|
+---------------+
```
### How was this patch tested?
Unit tests.

Closes #41800 from xinrong-meng/snd_type_coersion.

Authored-by: Xinrong Meng <xinrong@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?

Move Util.truncatedString to sql/api.

### Why are the changes needed?

Make StructType depends less on Catalyst so towards simpler DataType interface.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing test

Closes #41811 from amaliujia/move_out_truncatedString.

Authored-by: Rui Wang <rui.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…r to common/utils

### What changes were proposed in this pull request?

Move out util functions used by ArtifactManager to `common/utils`. More specific, move `resolveURI` and `awaitResult` to `common/utils`.

### Why are the changes needed?

So that Spark Connect Scala client does not need to depend on Spark.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing test

Closes #41825 from amaliujia/SPARK-44273.

Authored-by: Rui Wang <rui.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?

This PR proposes to add:

- `SparkContext.setInterruptOnCancel(interruptOnCancel: Boolean): Unit`
- `SparkContext.addJobTag(tag: String): Unit`
- `SparkContext.removeJobTag(tag: String): Unit`
- `SparkContext.getJobTags(): Set[String]`
- `SparkContext.clearJobTags(): Unit`
- `SparkContext.cancelJobsWithTag(tag: String): Unit`

into PySpark.

See also SPARK-43952.

### Why are the changes needed?

For PySpark users, and feature parity.

### Does this PR introduce _any_ user-facing change?

Yes, it adds new API in PySpark.

### How was this patch tested?

Unittests were added.

Closes #41841 from HyukjinKwon/SPARK-44194.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…ptions in RocksDB state store provider

### What changes were proposed in this pull request?
Set the column family options before passing to DBOptions in RocksDB state store provider

### Why are the changes needed?
Address bug fix to ensure column family options around memory usage are passed correctly to dbOptions

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests

Closes #41840 from anishshri-db/task/SPARK-44288.

Authored-by: Anish Shrigondekar <anish.shrigondekar@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
beliefer and others added 2 commits July 4, 2023 08:07
…_[2310-2314]

### What changes were proposed in this pull request?
The pr aims to assign names to the error class _LEGACY_ERROR_TEMP_[2310-2314].

### Why are the changes needed?
Improve the error framework.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
Exists test cases updated and added new test cases.

Closes #41816 from beliefer/SPARK-44269.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?

Implement classification evaluator

### Why are the changes needed?

Distributed ML <> spark connect project.

### Does this PR introduce _any_ user-facing change?

Yes.
`BinaryClassificationEvaluator` and `MulticlassClassificationEvaluator` are added.

### How was this patch tested?

Closes #41793 from WeichenXu123/classification-evaluator.

Authored-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
@github-actions github-actions bot added the ML label Jul 4, 2023
@GulajavaMinistudio GulajavaMinistudio merged commit 1166ae6 into GulajavaMinistudio:master Jul 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.