Merge branch 'branch-22.08' of github.com:NVIDIA/spark-rapids into ad…

…d-regex-choice-to-fuzzer
NVIDIA · Aug 2, 2022 · a25b0ea · a25b0ea
2 parents 3368512 + 19a6957
commit a25b0ea
Show file tree

Hide file tree

Showing 204 changed files with 15,088 additions and 2,198 deletions.
diff --git a/.github/workflows/auto-merge.yml b/.github/workflows/auto-merge.yml
@@ -18,7 +18,7 @@ name: auto-merge HEAD to BASE
 on:
   pull_request_target:
     branches:
-    - branch-22.06
+    - branch-22.08
     types: [closed]
 
 jobs:
@@ -29,13 +29,13 @@ jobs:
     steps:
       - uses: actions/checkout@v2
         with:
-          ref: branch-22.06 # force to fetch from latest upstream instead of PR ref
+          ref: branch-22.08 # force to fetch from latest upstream instead of PR ref
 
       - name: auto-merge job
         uses: ./.github/workflows/auto-merge
         env:
           OWNER: NVIDIA
           REPO_NAME: spark-rapids
-          HEAD: branch-22.06
-          BASE: branch-22.08
+          HEAD: branch-22.08
+          BASE: branch-22.10
           AUTOMERGE_TOKEN: ${{ secrets.AUTOMERGE_TOKEN }} # use to merge PR
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -86,8 +86,8 @@ There is a build script `build/buildall` that automates the local build process.
 `./buid/buildall --help` for up-to-date use information.
 
 By default, it builds everything that is needed to create a distribution jar for all released (noSnapshots) Spark versions except for Databricks. Other profiles that you can pass using `--profile=<distribution profile>` include
-- `snapshots`
-- `minimumFeatureVersionMix` that currently includes 321cdh, 312, 320 is recommended for catching incompatibilities already in the local development cycle
+- `snapshots` that includes all released (noSnapshots) and snapshots Spark versions except for Databricks
+- `minimumFeatureVersionMix` that currently includes 321cdh, 312, 320, 330 is recommended for catching incompatibilities already in the local development cycle
 
 For initial quick iterations we can use `--profile=<buildver>` to build a single-shim version. e.g., `--profile=311` for Spark 3.1.1.
 

diff --git a/NOTICE b/NOTICE
@@ -1,6 +1,8 @@
 RAPIDS plugin for Apache Spark
 Copyright (c) 2019-2022, NVIDIA CORPORATION
 
+--------------------------------------------------------------------------------
+
 // ------------------------------------------------------------------
 // NOTICE file corresponding to the section 4d of The Apache License,
 // Version 2.0, in this case for
@@ -12,6 +14,34 @@ Copyright 2014 and onwards The Apache Software Foundation
 This product includes software developed at
 The Apache Software Foundation (http://www.apache.org/).
 
+--------------------------------------------------------------------------------
+
+Apache Iceberg
+Copyright 2017-2022 The Apache Software Foundation
+
+This product includes software developed at
+The Apache Software Foundation (http://www.apache.org/).
+
+--------------------------------------------------------------------------------
+
+This project includes code from Kite, developed at Cloudera, Inc. with
+the following copyright notice:
+
+| Copyright 2013 Cloudera Inc.
+|
+| Licensed under the Apache License, Version 2.0 (the "License");
+| you may not use this file except in compliance with the License.
+| You may obtain a copy of the License at
+|
+|   http://www.apache.org/licenses/LICENSE-2.0
+|
+| Unless required by applicable law or agreed to in writing, software
+| distributed under the License is distributed on an "AS IS" BASIS,
+| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+| See the License for the specific language governing permissions and
+| limitations under the License.
+
+--------------------------------------------------------------------------------
 
 This product bundles various third-party components under other open source licenses.
 

diff --git a/NOTICE-binary b/NOTICE-binary
@@ -12,7 +12,35 @@ Copyright 2014 and onwards The Apache Software Foundation
 This product includes software developed at
 The Apache Software Foundation (http://www.apache.org/).
 
----------------------------------------------------------------------
+--------------------------------------------------------------------------------
+
+Apache Iceberg
+Copyright 2017-2022 The Apache Software Foundation
+
+This product includes software developed at
+The Apache Software Foundation (http://www.apache.org/).
+
+--------------------------------------------------------------------------------
+
+This project includes code from Kite, developed at Cloudera, Inc. with
+the following copyright notice:
+
+| Copyright 2013 Cloudera Inc.
+|
+| Licensed under the Apache License, Version 2.0 (the "License");
+| you may not use this file except in compliance with the License.
+| You may obtain a copy of the License at
+|
+|   http://www.apache.org/licenses/LICENSE-2.0
+|
+| Unless required by applicable law or agreed to in writing, software
+| distributed under the License is distributed on an "AS IS" BASIS,
+| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+| See the License for the specific language governing permissions and
+| limitations under the License.
+
+--------------------------------------------------------------------------------
+
 UCF Consortium - Unified Communication X (UCX)
 
 Copyright (c) 2014-2015      UT-Battelle, LLC. All rights reserved.

diff --git a/build/buildall b/build/buildall
@@ -159,6 +159,7 @@ case $DIST_PROFILE in
       320
       321
       322
+      330
       331
     )
     ;;
@@ -171,6 +172,7 @@ case $DIST_PROFILE in
       313
       320
       321
+      322
       330
     )
     ;;

diff --git a/dist/pom.xml b/dist/pom.xml
@@ -47,11 +47,11 @@
             320,
             321,
             321cdh,
+            322,
             330
         </noSnapshot.buildvers>
         <snapshot.buildvers>
             314,
-            322,
             331
         </snapshot.buildvers>
         <databricks.buildvers>

diff --git a/dist/unshimmed-common-from-spark311.txt b/dist/unshimmed-common-from-spark311.txt
@@ -28,6 +28,6 @@ com/nvidia/spark/rapids/SparkShimVersion*
 com/nvidia/spark/rapids/SparkShims*
 com/nvidia/spark/udf/Plugin*
 org/apache/spark/sql/rapids/ProxyRapidsShuffleInternalManagerBase*
-org/apache/spark/sql/rapids/VisibleShuffleManager*
+org/apache/spark/sql/rapids/RapidsShuffleManagerLike*
 rapids/*.py
 rapids4spark-version-info.properties
diff --git a/dist/unshimmed-from-each-spark3xx.txt b/dist/unshimmed-from-each-spark3xx.txt
@@ -1,5 +1,6 @@
 com/nvidia/spark/rapids/*/RapidsShuffleManager*
 com/nvidia/spark/rapids/AvroProvider.class
 com/nvidia/spark/rapids/HiveProvider.class
+com/nvidia/spark/rapids/iceberg/IcebergProvider.class
 org/apache/spark/sql/rapids/shims/*/ProxyRapidsShuffleInternalManager*
 spark-*-info.properties
diff --git a/docs/FAQ.md b/docs/FAQ.md
@@ -86,8 +86,11 @@ of focus right now. Other areas like GraphX or RDDs are not accelerated.
 ### Is the Spark `Dataset` API supported?
 
 The RAPIDS Accelerator supports the `DataFrame` API which is implemented in Spark as `Dataset[Row]`.
-If you are using `Dataset[Row]` that is equivalent to the `DataFrame` API. However using custom
-classes or types with `Dataset` is not supported.  Such queries should still execute correctly when
+If you are using `Dataset[Row]` that is equivalent to the `DataFrame` API. In either case the
+operations that are supported for acceleration on the GPU are limited. For example using custom
+classes or types with `Dataset` are not supported. Neither are using APIs that take `Row` as an input,
+or ones that take Scala or Java functions to operate. This includes operators like `flatMap`, `foreach`,
+or `foreachPartition`. Such queries will still execute correctly when
 using the RAPIDS Accelerator, but it is likely most query operations will not be performed on the
 GPU.
 

diff --git a/docs/additional-functionality/iceberg-support.md b/docs/additional-functionality/iceberg-support.md
@@ -0,0 +1,62 @@
+---
+layout: page
+title: Apache Iceberg Support
+parent: Additional Functionality
+nav_order: 7
+---
+
+# Apache Iceberg Support
+
+The RAPIDS Accelerator for Apache Spark provides limited support for Apache Iceberg tables.
+This document details the Apache Iceberg features that are supported.
+
+## Apache Iceberg Versions
+
+The RAPIDS Accelerator supports Apache Iceberg 0.13.x. Earlier versions of Apache Iceberg are
+not supported.
+
+## Reading Tables
+
+### Metadata Queries
+
+Reads of Apache Iceberg metadata, i.e.: the `history`, `snapshots`, and other metadata tables
+associated with a table, will not be GPU-accelerated. The CPU will continue to process these
+metadata-level queries.
+
+### Row-level Delete and Update Support
+
+Apache Iceberg supports row-level deletions and updates. Tables that are using a configuration of
+`write.delete.mode=merge-on-read` are not supported.
+
+### Schema Evolution
+
+Columns that are added and removed at the top level of the table schema are supported. Columns
+that are added or removed within struct columns are not supported.
+
+### Data Formats
+
+Apache Iceberg can store data in various formats. Each section below details the levels of support
+for each of the underlying data formats.
+
+#### Parquet
+
+Data stored in Parquet is supported with the same limitations for loading data from raw Parquet
+files. See the [Input/Output](../supported_ops.md#inputoutput) documentation for details. The
+following compression codecs applied to the Parquet data are supported:
+- gzip (Apache Iceberg default)
+- snappy
+- uncompressed
+- zstd
+
+#### ORC
+
+The RAPIDS Accelerator does not support Apache Iceberg tables using the ORC data format.
+
+#### Avro
+
+The RAPIDS Accelerator does not support Apache Iceberg tables using the Avro data format.
+
+## Writing Tables
+
+The RAPIDS Accelerator for Apache Spark does not accelerate Apache Iceberg writes. Writes
+to Iceberg tables will be processed by the CPU.
diff --git a/docs/additional-functionality/rapids-shuffle.md b/docs/additional-functionality/rapids-shuffle.md
@@ -286,6 +286,7 @@ In this section, we are using a docker container built using the sample dockerfi
    | 3.2.0           | com.nvidia.spark.rapids.spark320.RapidsShuffleManager    |
    | 3.2.1           | com.nvidia.spark.rapids.spark321.RapidsShuffleManager    |
    | 3.2.1 CDH       | com.nvidia.spark.rapids.spark321cdh.RapidsShuffleManager |
+   | 3.2.2           | com.nvidia.spark.rapids.spark322.RapidsShuffleManager |
    | 3.3.0           | com.nvidia.spark.rapids.spark330.RapidsShuffleManager    |
    | Databricks 9.1  | com.nvidia.spark.rapids.spark312db.RapidsShuffleManager  |
    | Databricks 10.4 | com.nvidia.spark.rapids.spark321db.RapidsShuffleManager  |

diff --git a/docs/additional-functionality/shuffle-docker-examples/Dockerfile.ubuntu_no_rdma b/docs/additional-functionality/shuffle-docker-examples/Dockerfile.ubuntu_no_rdma
@@ -29,6 +29,7 @@ FROM nvidia/cuda:${CUDA_VER}-runtime-ubuntu18.04
 ARG UCX_VER
 ARG UCX_CUDA_VER
 
+RUN apt-get update && apt-get install -y gnupg2
 # https://forums.developer.nvidia.com/t/notice-cuda-linux-repository-key-rotation/212771
 RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub
 

diff --git a/docs/additional-functionality/shuffle-docker-examples/Dockerfile.ubuntu_rdma b/docs/additional-functionality/shuffle-docker-examples/Dockerfile.ubuntu_rdma
@@ -36,6 +36,7 @@ ARG UCX_CUDA_VER=11
 FROM ubuntu:18.04 as rdma_core
 ARG RDMA_CORE_VERSION
 
+RUN apt-get update && apt-get install -y gnupg2
 # https://forums.developer.nvidia.com/t/notice-cuda-linux-repository-key-rotation/212771
 RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub
 RUN apt update && apt install -y dh-make wget build-essential cmake gcc libudev-dev libnl-3-dev libnl-route-3-dev ninja-build pkg-config valgrind python3-dev cython3 python3-docutils pandoc

diff --git a/docs/compatibility.md b/docs/compatibility.md
@@ -58,6 +58,8 @@ Spark getting a value of `1.03` but under the RAPIDS accelerator it produces `1.
 Python will produce `1.02`, Java does not have the ability to do a round like this built in, but if
 you do the simple operation of `Math.round(1.025 * 100.0)/100.0` you also get `1.02`.
 
+For the `degrees` functions, Spark's implementation relies on Java JDK's built-in functions `Math.toDegrees`. It is `angrad * 180.0 / PI` in Java 8 while `angrad * (180d / PI)` in Java 9+. So their results will differ depending on the JDK runtime versions when considering overflow. The RAPIDS Accelerator follows the bahavior of Java 9+. Therefore, with JDK 8 or below, the `degrees` on GPU will not overflow on some very large numbers while the CPU version does.
+
 For aggregations the underlying implementation is doing the aggregations in parallel and due to race
 conditions within the computation itself the result may not be the same each time the query is
 run. This is inherent in how the plugin speeds up the calculations and cannot be "fixed." If a query
-Original file line number
+Diff line change
@@ Expand Up / @@ -159,6 +159,7 @@ case $DIST_PROFILE in @@
         )
         ;;
@@ Expand All / @@ -171,6 +172,7 @@ case $DIST_PROFILE in @@
         )
         ;;
@@ Expand Down @@