Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a new pull request by comparing changes across two branches #1554

Merged
merged 13 commits into from
Sep 11, 2023

Conversation

GulajavaMinistudio
Copy link
Owner

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

panbingkun and others added 13 commits September 8, 2023 10:03
### What changes were proposed in this pull request?
The pr aims to add gap at the bottom of the HTML.

### Why are the changes needed?
The old document style has good white space at the bottom, but the latest document has lost this piece, which looks unattractive and borderless.
<img width="918" alt="image" src="https://github.com/apache/spark/assets/15246973/c7d4e1c9-f83a-4a4b-a22f-240f3ea534c9">

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual testing.
```
SKIP_API=1 bundle exec jekyll serve --watch
```

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #42702 from panbingkun/SPARK-44986.

Authored-by: panbingkun <pbk1982@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…docstring

### What changes were proposed in this pull request?
This PR proposes a simple change in the documentation for the `transform` function in `sql`. I believe where it currently reads "filter" it should read "transform".

### Why are the changes needed?
I believe this change might not be needed per se, but it would be a slight improvement to the current version to avoid the misnomer.

### Does this PR introduce _any_ user-facing change?
Yes, it shows the word "transform' instead of "filter" in the documentation for the `transform` `sql`. function.

### How was this patch tested?
This patch was not tested because it only changes documentation.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #42858 from gdahia/patch-1.

Lead-authored-by: Gabriel Dahia <gdahia@protonmail.com>
Co-authored-by: Gabriel Dahia <gdahia@users.noreply.github.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
… to fix doc redirecting

### What changes were proposed in this pull request?

In https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc4-docs/_site/, these links are supposed to redirect to the correct targets, but failed because there are no `.html` extensions.

- [building-with-maven.html](https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc4-docs/_site/building-with-maven.html)   ---> [building-spark.html](https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc4-docs/_site/building-spark.html)
- [sql-reference.html](https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc4-docs/_site/sql-reference.html) ---> [sql-ref.html](https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc4-docs/_site/sql-ref.html)

This PR customs the redirect template to add extensions to fix this issue. Referencing https://github.com/jekyll/jekyll-redirect-from#customizing-the-redirect-template

### Why are the changes needed?

Fix doc links, such as https://spark.apache.org/docs/latest/sql-reference.html

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

Build doc and verify locally.

```html
<!DOCTYPE html>
<html lang="en-US">
<meta charset="utf-8">
<title>Redirecting&hellip;</title>
<link rel="canonical" href="/building-spark.html">
<script>location="/building-spark.html"</script>
<meta http-equiv="refresh" content="0; url=/building-spark.html">
<meta name="robots" content="noindex">
<h1>Redirecting&hellip;</h1>
<a href="/building-spark.html">Click here if you are not redirected.</a>
</html>%
```

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #42848 from yaooqinn/SPARK-45098.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?

Change `PercentileCont` to explicitly check user-supplied input by calling `checkInputDataTypes` on the replacement.

### Why are the changes needed?

`PercentileCont` does not currently check the user's input. If the runtime replacement (an instance of `Percentile`) rejects the user's input, the runtime replacement ends up unresolved.

For example, this query throws an internal error rather than producing a useful error message:
```
select percentile_cont(b) WITHIN GROUP (ORDER BY a DESC) as x
from (values (12, 0.25), (13, 0.25), (22, 0.25)) as (a, b);

[INTERNAL_ERROR] Cannot resolve the runtime replaceable expression "percentile_cont(a, b)". The replacement is unresolved: "percentile(a, b, 1)".
org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot resolve the runtime replaceable expression "percentile_cont(a, b)". The replacement is unresolved: "percentile(a, b, 1)".
	at org.apache.spark.SparkException$.internalError(SparkException.scala:92)
	at org.apache.spark.SparkException$.internalError(SparkException.scala:96)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6(CheckAnalysis.scala:313)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6$adapted(CheckAnalysis.scala:277)
...
```
With this PR, the above query will produce the following error message:
```
[DATATYPE_MISMATCH.NON_FOLDABLE_INPUT] Cannot resolve "percentile_cont(a, b)" due to data type mismatch: the input percentage should be a foldable "DOUBLE" expression; however, got "b".; line 1 pos 7;
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42857 from bersprockets/pc_checkinputtype_issue.

Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…olumn vector that has a dictionary

### What changes were proposed in this pull request?

Change getBytes/getShorts/getInts/getLongs/getFloats/getDoubles in `OnHeapColumnVector` and `OffHeapColumnVector` to use the dictionary, if present.

### Why are the changes needed?

The following query gets incorrect results:
```
drop table if exists t1;

create table t1 using parquet as
select * from values
(named_struct('f1', array(1, 2, 3), 'f2', array(1, 1, 2)))
as (value);

select cast(value as struct<f1:array<double>,f2:array<int>>) AS value from t1;

{"f1":[1.0,2.0,3.0],"f2":[0,0,0]}

```
The result should be:
```
{"f1":[1.0,2.0,3.0],"f2":[1,2,3]}
```
The cast operation copies the second array by calling `ColumnarArray#copy`, which in turn calls `ColumnarArray#toIntArray`, which in turn calls `ColumnVector#getInts` on the underlying column vector (which is either an `OnHeapColumnVector` or an `OffHeapColumnVector`). The implementation of `getInts` in either concrete class assumes there is no dictionary and does not use it if it is present (in fact, it even asserts that there is no dictionary). However, in the above example, the column vector associated with the second array does have a dictionary:
```
java -cp ~/github/parquet-mr/parquet-tools/target/parquet-tools-1.10.1.jar org.apache.parquet.tools.Main meta ./spark-warehouse/t1/part-00000-122fdd53-8166-407b-aec5-08e0c2845c3d-c000.snappy.parquet
...
row group 1: RC:1 TS:112 OFFSET:4
-------------------------------------------------------------------------------------------------------------------------------------------------------
value:
.f1:
..list:
...element:   INT32 SNAPPY DO:0 FPO:4 SZ:47/47/1.00 VC:3 ENC:RLE,PLAIN ST:[min: 1, max: 3, num_nulls: 0]
.f2:
..list:
...element:   INT32 SNAPPY DO:51 FPO:80 SZ:69/65/0.94 VC:3 ENC:RLE,PLAIN_DICTIONARY ST:[min: 1, max: 2, num_nulls: 0]

```
The same bug also occurs when field f2 is a map. This PR fixes that case as well.

### Does this PR introduce _any_ user-facing change?

No, except for fixing the correctness issue.

### How was this patch tested?

New tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42850 from bersprockets/vector_oddity.

Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…t report error

### What changes were proposed in this pull request?
This PR make sure ALTER TABLE ALTER COLUMN with invalid default value on DataSource V2 will report error, before this PR it will alter sucess.

### Why are the changes needed?
Fix the error behavior on DataSource V2 with ALTER TABLE statement.

### Does this PR introduce _any_ user-facing change?
Yes, the invalid default value will report error.

### How was this patch tested?
Add new test.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #42810 from Hisoka-X/SPARK-45075_alter_invalid_default_value_on_v2.

Authored-by: Jia Fan <fanjiaeminem@qq.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?

This PR update the `graphlib-dot` library, dagrejs/graphlib-dot@v0.5.2...v1.0.2, this library is used to read and parse dot files to graphics.

### Why are the changes needed?

Update UI js libraries

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

build and verify locally

![image](https://github.com/apache/spark/assets/8326978/d9133b44-8a95-4bb4-a2e9-3a47010ab500)

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #42853 from yaooqinn/SPARK-45104.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…ectly

### What changes were proposed in this pull request?

In Snowflake a BOOLEAN data type exist but not the BIT data type.
This PR adds `SnowflakeDialect` to override the default JdbcDialect and redefine the default mapping behaviour for the _boolean_ type.  Currently, it's mapped to `BIT(1)` type.

https://github.com/apache/spark/blob/a663c0bf0c5b104170c0612f37a0b0cdf75cd45b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L149

### Why are the changes needed?

The BIT type does not exist in Snowflake. This cause the Spark Job to fail on table creation.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit test and directly on Snowflake

Closes #42545 from hayssams/master.

Authored-by: Hayssam Saleh <Hayssam.saleh@starlake.ai>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
The pr aims to make hyperlinks in documents clickable, include: running-on-mesos.html & running-on-yarn.html

### Why are the changes needed?
Improve the convenience of using Spark documents.

Before:
<img width="1372" alt="image" src="https://github.com/apache/spark/assets/15246973/eea24735-babe-4008-ab96-ec2c29ebafd5">

After:
<img width="571" alt="image" src="https://github.com/apache/spark/assets/15246973/1ff1098b-c412-4f3d-b66c-825046691408">

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manually test.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #42854 from panbingkun/SPARK-45105.

Authored-by: panbingkun <pbk1982@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Fix `aes_descrypt` and `ln` implementations in Spark Connect. The previous `aes_descrypt` reference to `aes_encrypt` is clearly a bug. The `ln` reference to `log` is more like a cosmetic issue, but because `ln` and `log` function implementations are different in Spark SQL we should use the same implementation in Spark Connect too.

### Why are the changes needed?
Bugfix.

### Does this PR introduce _any_ user-facing change?
No, these Spark Connect functions haven't been released.

### How was this patch tested?
Esiting UTs.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #42863 from peter-toth/SPARK-45109-fix-eas_decrypt-and-ln.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
…ql.functions` from auto-completion

### What changes were proposed in this pull request?
Hide internal functions/variables in `pyspark.sql.functions` from auto-completion

### Why are the changes needed?
to hide internal functions/variables which can be confusing, e.g. the internal help functions `to_str`, `get_active_spark_context`

before this PR:

<img width="560" alt="image" src="https://github.com/apache/spark/assets/7322292/ab87d0e8-3ba2-4c71-8c06-aeef939778cf">

<img width="915" alt="image" src="https://github.com/apache/spark/assets/7322292/e138804f-8a7a-4526-9b1a-8338438e14e3">

after this PR:
<img width="562" alt="image" src="https://github.com/apache/spark/assets/7322292/e1710729-cf8f-49d4-b276-4632a88ea5ec">

<img width="774" alt="image" src="https://github.com/apache/spark/assets/7322292/50b8e6f7-9dba-46e6-97f5-5cf8b115bffb">

### Does this PR introduce _any_ user-facing change?
yes

### How was this patch tested?
manually check

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #42745 from zhengruifeng/hide_private_from_completion.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
### What changes were proposed in this pull request?
This pr aims to refine docstring of `DataFrame.groupBy/rollup/cube` and fix potentially wrong underline length.

### Why are the changes needed?
- To improve PySpark documentation.

- Fix potentially wrong underline length.
   <img width="951" alt="image" src="https://github.com/apache/spark/assets/15246973/8f5e8648-7670-4dce-860b-bd12c52e73f3">

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- Pass GA.
- Manually test.
```
cd python/docs
make clean html
```

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #42834 from panbingkun/SPARK-45044.

Authored-by: panbingkun <pbk1982@gmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
### What changes were proposed in this pull request?

This PR proposes to support string type columns for `DataFrameGroupBy.sum`.

### Why are the changes needed?

To match the behavior with latest pandas.

### Does this PR introduce _any_ user-facing change?

Yes, from now on the `DataFrameGroupBy.sum` follows the behavior of latest pandas as below:

**Test DataFrame**
```python
>>> psdf
   A    B  C      D
0  1  3.1  a   True
1  2  4.1  b  False
2  1  4.1  b  False
3  2  3.1  a   True
```

**Before**
```python
>>> psdf.groupby("A").sum().sort_index()
     B  D
A
1  7.2  1
2  7.2  1
```

**After**
```python
>>> psdf.groupby("A").sum().sort_index()
     B   C  D
A
1  7.2  ab  1
2  7.2  ba  1
```

### How was this patch tested?

Updated the existing UTs to support string type columns.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42798 from itholic/SPARK-43295.

Authored-by: Haejoon Lee <haejoon.lee@databricks.com>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
@GulajavaMinistudio GulajavaMinistudio merged commit 46fe32d into GulajavaMinistudio:master Sep 11, 2023
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants