Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a new pull request by comparing changes across two branches #1554

Merged
merged 13 commits into from
Sep 11, 2023

Commits on Sep 8, 2023

  1. [SPARK-44986][DOCS] There should be a gap at the bottom of the HTML

    ### What changes were proposed in this pull request?
    The pr aims to add gap at the bottom of the HTML.
    
    ### Why are the changes needed?
    The old document style has good white space at the bottom, but the latest document has lost this piece, which looks unattractive and borderless.
    <img width="918" alt="image" src="https://github.com/apache/spark/assets/15246973/c7d4e1c9-f83a-4a4b-a22f-240f3ea534c9">
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    Manual testing.
    ```
    SKIP_API=1 bundle exec jekyll serve --watch
    ```
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes #42702 from panbingkun/SPARK-44986.
    
    Authored-by: panbingkun <pbk1982@gmail.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    panbingkun authored and dongjoon-hyun committed Sep 8, 2023
    Configuration menu
    Copy the full SHA
    4c8d398 View commit details
    Browse the repository at this point in the history
  2. [MINOR][DOCS] Change "filter" to "transform" in transform function …

    …docstring
    
    ### What changes were proposed in this pull request?
    This PR proposes a simple change in the documentation for the `transform` function in `sql`. I believe where it currently reads "filter" it should read "transform".
    
    ### Why are the changes needed?
    I believe this change might not be needed per se, but it would be a slight improvement to the current version to avoid the misnomer.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes, it shows the word "transform' instead of "filter" in the documentation for the `transform` `sql`. function.
    
    ### How was this patch tested?
    This patch was not tested because it only changes documentation.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes #42858 from gdahia/patch-1.
    
    Lead-authored-by: Gabriel Dahia <gdahia@protonmail.com>
    Co-authored-by: Gabriel Dahia <gdahia@users.noreply.github.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    2 people authored and dongjoon-hyun committed Sep 8, 2023
    Configuration menu
    Copy the full SHA
    e4df5d1 View commit details
    Browse the repository at this point in the history
  3. [SPARK-45098][DOCS] Custom jekyll-rediect-from redirect.html template…

    … to fix doc redirecting
    
    ### What changes were proposed in this pull request?
    
    In https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc4-docs/_site/, these links are supposed to redirect to the correct targets, but failed because there are no `.html` extensions.
    
    - [building-with-maven.html](https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc4-docs/_site/building-with-maven.html)   ---> [building-spark.html](https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc4-docs/_site/building-spark.html)
    - [sql-reference.html](https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc4-docs/_site/sql-reference.html) ---> [sql-ref.html](https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc4-docs/_site/sql-ref.html)
    
    This PR customs the redirect template to add extensions to fix this issue. Referencing https://github.com/jekyll/jekyll-redirect-from#customizing-the-redirect-template
    
    ### Why are the changes needed?
    
    Fix doc links, such as https://spark.apache.org/docs/latest/sql-reference.html
    
    ### Does this PR introduce _any_ user-facing change?
    
    no
    
    ### How was this patch tested?
    
    Build doc and verify locally.
    
    ```html
    <!DOCTYPE html>
    <html lang="en-US">
    <meta charset="utf-8">
    <title>Redirecting&hellip;</title>
    <link rel="canonical" href="/building-spark.html">
    <script>location="/building-spark.html"</script>
    <meta http-equiv="refresh" content="0; url=/building-spark.html">
    <meta name="robots" content="noindex">
    <h1>Redirecting&hellip;</h1>
    <a href="/building-spark.html">Click here if you are not redirected.</a>
    </html>%
    ```
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    no
    
    Closes #42848 from yaooqinn/SPARK-45098.
    
    Authored-by: Kent Yao <yao@apache.org>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    yaooqinn authored and dongjoon-hyun committed Sep 8, 2023
    Configuration menu
    Copy the full SHA
    81bc38e View commit details
    Browse the repository at this point in the history
  4. [SPARK-45106][SQL] PercentileCont should check user supplied input

    ### What changes were proposed in this pull request?
    
    Change `PercentileCont` to explicitly check user-supplied input by calling `checkInputDataTypes` on the replacement.
    
    ### Why are the changes needed?
    
    `PercentileCont` does not currently check the user's input. If the runtime replacement (an instance of `Percentile`) rejects the user's input, the runtime replacement ends up unresolved.
    
    For example, this query throws an internal error rather than producing a useful error message:
    ```
    select percentile_cont(b) WITHIN GROUP (ORDER BY a DESC) as x
    from (values (12, 0.25), (13, 0.25), (22, 0.25)) as (a, b);
    
    [INTERNAL_ERROR] Cannot resolve the runtime replaceable expression "percentile_cont(a, b)". The replacement is unresolved: "percentile(a, b, 1)".
    org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot resolve the runtime replaceable expression "percentile_cont(a, b)". The replacement is unresolved: "percentile(a, b, 1)".
    	at org.apache.spark.SparkException$.internalError(SparkException.scala:92)
    	at org.apache.spark.SparkException$.internalError(SparkException.scala:96)
    	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6(CheckAnalysis.scala:313)
    	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6$adapted(CheckAnalysis.scala:277)
    ...
    ```
    With this PR, the above query will produce the following error message:
    ```
    [DATATYPE_MISMATCH.NON_FOLDABLE_INPUT] Cannot resolve "percentile_cont(a, b)" due to data type mismatch: the input percentage should be a foldable "DOUBLE" expression; however, got "b".; line 1 pos 7;
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    New tests.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes #42857 from bersprockets/pc_checkinputtype_issue.
    
    Authored-by: Bruce Robbins <bersprockets@gmail.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    bersprockets authored and dongjoon-hyun committed Sep 8, 2023
    Configuration menu
    Copy the full SHA
    2b4387f View commit details
    Browse the repository at this point in the history
  5. [SPARK-44805][SQL] getBytes/getShorts/getInts/etc. should work in a c…

    …olumn vector that has a dictionary
    
    ### What changes were proposed in this pull request?
    
    Change getBytes/getShorts/getInts/getLongs/getFloats/getDoubles in `OnHeapColumnVector` and `OffHeapColumnVector` to use the dictionary, if present.
    
    ### Why are the changes needed?
    
    The following query gets incorrect results:
    ```
    drop table if exists t1;
    
    create table t1 using parquet as
    select * from values
    (named_struct('f1', array(1, 2, 3), 'f2', array(1, 1, 2)))
    as (value);
    
    select cast(value as struct<f1:array<double>,f2:array<int>>) AS value from t1;
    
    {"f1":[1.0,2.0,3.0],"f2":[0,0,0]}
    
    ```
    The result should be:
    ```
    {"f1":[1.0,2.0,3.0],"f2":[1,2,3]}
    ```
    The cast operation copies the second array by calling `ColumnarArray#copy`, which in turn calls `ColumnarArray#toIntArray`, which in turn calls `ColumnVector#getInts` on the underlying column vector (which is either an `OnHeapColumnVector` or an `OffHeapColumnVector`). The implementation of `getInts` in either concrete class assumes there is no dictionary and does not use it if it is present (in fact, it even asserts that there is no dictionary). However, in the above example, the column vector associated with the second array does have a dictionary:
    ```
    java -cp ~/github/parquet-mr/parquet-tools/target/parquet-tools-1.10.1.jar org.apache.parquet.tools.Main meta ./spark-warehouse/t1/part-00000-122fdd53-8166-407b-aec5-08e0c2845c3d-c000.snappy.parquet
    ...
    row group 1: RC:1 TS:112 OFFSET:4
    -------------------------------------------------------------------------------------------------------------------------------------------------------
    value:
    .f1:
    ..list:
    ...element:   INT32 SNAPPY DO:0 FPO:4 SZ:47/47/1.00 VC:3 ENC:RLE,PLAIN ST:[min: 1, max: 3, num_nulls: 0]
    .f2:
    ..list:
    ...element:   INT32 SNAPPY DO:51 FPO:80 SZ:69/65/0.94 VC:3 ENC:RLE,PLAIN_DICTIONARY ST:[min: 1, max: 2, num_nulls: 0]
    
    ```
    The same bug also occurs when field f2 is a map. This PR fixes that case as well.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, except for fixing the correctness issue.
    
    ### How was this patch tested?
    
    New tests.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes #42850 from bersprockets/vector_oddity.
    
    Authored-by: Bruce Robbins <bersprockets@gmail.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    bersprockets authored and dongjoon-hyun committed Sep 8, 2023
    Configuration menu
    Copy the full SHA
    fac236e View commit details
    Browse the repository at this point in the history
  6. [SPARK-45075][SQL] Fix alter table with invalid default value will no…

    …t report error
    
    ### What changes were proposed in this pull request?
    This PR make sure ALTER TABLE ALTER COLUMN with invalid default value on DataSource V2 will report error, before this PR it will alter sucess.
    
    ### Why are the changes needed?
    Fix the error behavior on DataSource V2 with ALTER TABLE statement.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes, the invalid default value will report error.
    
    ### How was this patch tested?
    Add new test.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes #42810 from Hisoka-X/SPARK-45075_alter_invalid_default_value_on_v2.
    
    Authored-by: Jia Fan <fanjiaeminem@qq.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    Hisoka-X authored and dongjoon-hyun committed Sep 8, 2023
    Configuration menu
    Copy the full SHA
    4dd4737 View commit details
    Browse the repository at this point in the history
  7. [SPARK-45104][UI] Upgrade graphlib-dot.min.js to 1.0.2

    ### What changes were proposed in this pull request?
    
    This PR update the `graphlib-dot` library, dagrejs/graphlib-dot@v0.5.2...v1.0.2, this library is used to read and parse dot files to graphics.
    
    ### Why are the changes needed?
    
    Update UI js libraries
    
    ### Does this PR introduce _any_ user-facing change?
    
    no
    
    ### How was this patch tested?
    
    build and verify locally
    
    ![image](https://github.com/apache/spark/assets/8326978/d9133b44-8a95-4bb4-a2e9-3a47010ab500)
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    no
    
    Closes #42853 from yaooqinn/SPARK-45104.
    
    Authored-by: Kent Yao <yao@apache.org>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    yaooqinn authored and dongjoon-hyun committed Sep 8, 2023
    Configuration menu
    Copy the full SHA
    a663c0b View commit details
    Browse the repository at this point in the history
  8. [SPARK-44866][SQL] Add SnowflakeDialect to handle BOOLEAN type corr…

    …ectly
    
    ### What changes were proposed in this pull request?
    
    In Snowflake a BOOLEAN data type exist but not the BIT data type.
    This PR adds `SnowflakeDialect` to override the default JdbcDialect and redefine the default mapping behaviour for the _boolean_ type.  Currently, it's mapped to `BIT(1)` type.
    
    https://github.com/apache/spark/blob/a663c0bf0c5b104170c0612f37a0b0cdf75cd45b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L149
    
    ### Why are the changes needed?
    
    The BIT type does not exist in Snowflake. This cause the Spark Job to fail on table creation.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Unit test and directly on Snowflake
    
    Closes #42545 from hayssams/master.
    
    Authored-by: Hayssam Saleh <Hayssam.saleh@starlake.ai>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    Hayssam Saleh authored and dongjoon-hyun committed Sep 8, 2023
    Configuration menu
    Copy the full SHA
    c8fa821 View commit details
    Browse the repository at this point in the history

Commits on Sep 9, 2023

  1. [SPARK-45105][DOCS] Make hyperlinks in documents clickable

    ### What changes were proposed in this pull request?
    The pr aims to make hyperlinks in documents clickable, include: running-on-mesos.html & running-on-yarn.html
    
    ### Why are the changes needed?
    Improve the convenience of using Spark documents.
    
    Before:
    <img width="1372" alt="image" src="https://github.com/apache/spark/assets/15246973/eea24735-babe-4008-ab96-ec2c29ebafd5">
    
    After:
    <img width="571" alt="image" src="https://github.com/apache/spark/assets/15246973/1ff1098b-c412-4f3d-b66c-825046691408">
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    Manually test.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes #42854 from panbingkun/SPARK-45105.
    
    Authored-by: panbingkun <pbk1982@gmail.com>
    Signed-off-by: Sean Owen <srowen@gmail.com>
    panbingkun authored and srowen committed Sep 9, 2023
    Configuration menu
    Copy the full SHA
    445c541 View commit details
    Browse the repository at this point in the history

Commits on Sep 11, 2023

  1. [SPARK-45109][SQL][CONNECT] Fix aes_decrypt and ln functions in Connect

    ### What changes were proposed in this pull request?
    Fix `aes_descrypt` and `ln` implementations in Spark Connect. The previous `aes_descrypt` reference to `aes_encrypt` is clearly a bug. The `ln` reference to `log` is more like a cosmetic issue, but because `ln` and `log` function implementations are different in Spark SQL we should use the same implementation in Spark Connect too.
    
    ### Why are the changes needed?
    Bugfix.
    
    ### Does this PR introduce _any_ user-facing change?
    No, these Spark Connect functions haven't been released.
    
    ### How was this patch tested?
    Esiting UTs.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes #42863 from peter-toth/SPARK-45109-fix-eas_decrypt-and-ln.
    
    Authored-by: Peter Toth <peter.toth@gmail.com>
    Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
    peter-toth authored and zhengruifeng committed Sep 11, 2023
    Configuration menu
    Copy the full SHA
    5e97c79 View commit details
    Browse the repository at this point in the history
  2. [SPARK-45027][PYTHON] Hide internal functions/variables in `pyspark.s…

    …ql.functions` from auto-completion
    
    ### What changes were proposed in this pull request?
    Hide internal functions/variables in `pyspark.sql.functions` from auto-completion
    
    ### Why are the changes needed?
    to hide internal functions/variables which can be confusing, e.g. the internal help functions `to_str`, `get_active_spark_context`
    
    before this PR:
    
    <img width="560" alt="image" src="https://github.com/apache/spark/assets/7322292/ab87d0e8-3ba2-4c71-8c06-aeef939778cf">
    
    <img width="915" alt="image" src="https://github.com/apache/spark/assets/7322292/e138804f-8a7a-4526-9b1a-8338438e14e3">
    
    after this PR:
    <img width="562" alt="image" src="https://github.com/apache/spark/assets/7322292/e1710729-cf8f-49d4-b276-4632a88ea5ec">
    
    <img width="774" alt="image" src="https://github.com/apache/spark/assets/7322292/50b8e6f7-9dba-46e6-97f5-5cf8b115bffb">
    
    ### Does this PR introduce _any_ user-facing change?
    yes
    
    ### How was this patch tested?
    manually check
    
    ### Was this patch authored or co-authored using generative AI tooling?
    no
    
    Closes #42745 from zhengruifeng/hide_private_from_completion.
    
    Authored-by: Ruifeng Zheng <ruifengz@apache.org>
    Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
    zhengruifeng committed Sep 11, 2023
    Configuration menu
    Copy the full SHA
    1052960 View commit details
    Browse the repository at this point in the history
  3. [SPARK-45044][PYTHON][DOCS] Refine docstring of groupBy/rollup/cube

    ### What changes were proposed in this pull request?
    This pr aims to refine docstring of `DataFrame.groupBy/rollup/cube` and fix potentially wrong underline length.
    
    ### Why are the changes needed?
    - To improve PySpark documentation.
    
    - Fix potentially wrong underline length.
       <img width="951" alt="image" src="https://github.com/apache/spark/assets/15246973/8f5e8648-7670-4dce-860b-bd12c52e73f3">
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    - Pass GA.
    - Manually test.
    ```
    cd python/docs
    make clean html
    ```
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes #42834 from panbingkun/SPARK-45044.
    
    Authored-by: panbingkun <pbk1982@gmail.com>
    Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
    panbingkun authored and zhengruifeng committed Sep 11, 2023
    Configuration menu
    Copy the full SHA
    eb0b09f View commit details
    Browse the repository at this point in the history
  4. [SPARK-43295][PS] Support string type columns for DataFrameGroupBy.sum

    ### What changes were proposed in this pull request?
    
    This PR proposes to support string type columns for `DataFrameGroupBy.sum`.
    
    ### Why are the changes needed?
    
    To match the behavior with latest pandas.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, from now on the `DataFrameGroupBy.sum` follows the behavior of latest pandas as below:
    
    **Test DataFrame**
    ```python
    >>> psdf
       A    B  C      D
    0  1  3.1  a   True
    1  2  4.1  b  False
    2  1  4.1  b  False
    3  2  3.1  a   True
    ```
    
    **Before**
    ```python
    >>> psdf.groupby("A").sum().sort_index()
         B  D
    A
    1  7.2  1
    2  7.2  1
    ```
    
    **After**
    ```python
    >>> psdf.groupby("A").sum().sort_index()
         B   C  D
    A
    1  7.2  ab  1
    2  7.2  ba  1
    ```
    
    ### How was this patch tested?
    
    Updated the existing UTs to support string type columns.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes #42798 from itholic/SPARK-43295.
    
    Authored-by: Haejoon Lee <haejoon.lee@databricks.com>
    Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
    itholic authored and zhengruifeng committed Sep 11, 2023
    Configuration menu
    Copy the full SHA
    3d119a5 View commit details
    Browse the repository at this point in the history