Skip to content

Commit

Permalink
Merge branch 'branch-22.08' of github.com:NVIDIA/spark-rapids into ad…
Browse files Browse the repository at this point in the history
…d-regex-choice-to-fuzzer
  • Loading branch information
anthony-chang committed Aug 2, 2022
2 parents 3368512 + 19a6957 commit a25b0ea
Show file tree
Hide file tree
Showing 204 changed files with 15,088 additions and 2,198 deletions.
8 changes: 4 additions & 4 deletions .github/workflows/auto-merge.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ name: auto-merge HEAD to BASE
on:
pull_request_target:
branches:
- branch-22.06
- branch-22.08
types: [closed]

jobs:
Expand All @@ -29,13 +29,13 @@ jobs:
steps:
- uses: actions/checkout@v2
with:
ref: branch-22.06 # force to fetch from latest upstream instead of PR ref
ref: branch-22.08 # force to fetch from latest upstream instead of PR ref

- name: auto-merge job
uses: ./.github/workflows/auto-merge
env:
OWNER: NVIDIA
REPO_NAME: spark-rapids
HEAD: branch-22.06
BASE: branch-22.08
HEAD: branch-22.08
BASE: branch-22.10
AUTOMERGE_TOKEN: ${{ secrets.AUTOMERGE_TOKEN }} # use to merge PR
4 changes: 2 additions & 2 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,8 +86,8 @@ There is a build script `build/buildall` that automates the local build process.
`./buid/buildall --help` for up-to-date use information.

By default, it builds everything that is needed to create a distribution jar for all released (noSnapshots) Spark versions except for Databricks. Other profiles that you can pass using `--profile=<distribution profile>` include
- `snapshots`
- `minimumFeatureVersionMix` that currently includes 321cdh, 312, 320 is recommended for catching incompatibilities already in the local development cycle
- `snapshots` that includes all released (noSnapshots) and snapshots Spark versions except for Databricks
- `minimumFeatureVersionMix` that currently includes 321cdh, 312, 320, 330 is recommended for catching incompatibilities already in the local development cycle

For initial quick iterations we can use `--profile=<buildver>` to build a single-shim version. e.g., `--profile=311` for Spark 3.1.1.

Expand Down
30 changes: 30 additions & 0 deletions NOTICE
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
RAPIDS plugin for Apache Spark
Copyright (c) 2019-2022, NVIDIA CORPORATION

--------------------------------------------------------------------------------

// ------------------------------------------------------------------
// NOTICE file corresponding to the section 4d of The Apache License,
// Version 2.0, in this case for
Expand All @@ -12,6 +14,34 @@ Copyright 2014 and onwards The Apache Software Foundation
This product includes software developed at
The Apache Software Foundation (http://www.apache.org/).

--------------------------------------------------------------------------------

Apache Iceberg
Copyright 2017-2022 The Apache Software Foundation

This product includes software developed at
The Apache Software Foundation (http://www.apache.org/).

--------------------------------------------------------------------------------

This project includes code from Kite, developed at Cloudera, Inc. with
the following copyright notice:

| Copyright 2013 Cloudera Inc.
|
| Licensed under the Apache License, Version 2.0 (the "License");
| you may not use this file except in compliance with the License.
| You may obtain a copy of the License at
|
| http://www.apache.org/licenses/LICENSE-2.0
|
| Unless required by applicable law or agreed to in writing, software
| distributed under the License is distributed on an "AS IS" BASIS,
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
| See the License for the specific language governing permissions and
| limitations under the License.

--------------------------------------------------------------------------------

This product bundles various third-party components under other open source licenses.

Expand Down
30 changes: 29 additions & 1 deletion NOTICE-binary
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,35 @@ Copyright 2014 and onwards The Apache Software Foundation
This product includes software developed at
The Apache Software Foundation (http://www.apache.org/).

---------------------------------------------------------------------
--------------------------------------------------------------------------------

Apache Iceberg
Copyright 2017-2022 The Apache Software Foundation

This product includes software developed at
The Apache Software Foundation (http://www.apache.org/).

--------------------------------------------------------------------------------

This project includes code from Kite, developed at Cloudera, Inc. with
the following copyright notice:

| Copyright 2013 Cloudera Inc.
|
| Licensed under the Apache License, Version 2.0 (the "License");
| you may not use this file except in compliance with the License.
| You may obtain a copy of the License at
|
| http://www.apache.org/licenses/LICENSE-2.0
|
| Unless required by applicable law or agreed to in writing, software
| distributed under the License is distributed on an "AS IS" BASIS,
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
| See the License for the specific language governing permissions and
| limitations under the License.

--------------------------------------------------------------------------------

UCF Consortium - Unified Communication X (UCX)

Copyright (c) 2014-2015 UT-Battelle, LLC. All rights reserved.
Expand Down
2 changes: 2 additions & 0 deletions build/buildall
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,7 @@ case $DIST_PROFILE in
320
321
322
330
331
)
;;
Expand All @@ -171,6 +172,7 @@ case $DIST_PROFILE in
313
320
321
322
330
)
;;
Expand Down
2 changes: 1 addition & 1 deletion dist/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -47,11 +47,11 @@
320,
321,
321cdh,
322,
330
</noSnapshot.buildvers>
<snapshot.buildvers>
314,
322,
331
</snapshot.buildvers>
<databricks.buildvers>
Expand Down
2 changes: 1 addition & 1 deletion dist/unshimmed-common-from-spark311.txt
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,6 @@ com/nvidia/spark/rapids/SparkShimVersion*
com/nvidia/spark/rapids/SparkShims*
com/nvidia/spark/udf/Plugin*
org/apache/spark/sql/rapids/ProxyRapidsShuffleInternalManagerBase*
org/apache/spark/sql/rapids/VisibleShuffleManager*
org/apache/spark/sql/rapids/RapidsShuffleManagerLike*
rapids/*.py
rapids4spark-version-info.properties
1 change: 1 addition & 0 deletions dist/unshimmed-from-each-spark3xx.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
com/nvidia/spark/rapids/*/RapidsShuffleManager*
com/nvidia/spark/rapids/AvroProvider.class
com/nvidia/spark/rapids/HiveProvider.class
com/nvidia/spark/rapids/iceberg/IcebergProvider.class
org/apache/spark/sql/rapids/shims/*/ProxyRapidsShuffleInternalManager*
spark-*-info.properties
7 changes: 5 additions & 2 deletions docs/FAQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,8 +86,11 @@ of focus right now. Other areas like GraphX or RDDs are not accelerated.
### Is the Spark `Dataset` API supported?

The RAPIDS Accelerator supports the `DataFrame` API which is implemented in Spark as `Dataset[Row]`.
If you are using `Dataset[Row]` that is equivalent to the `DataFrame` API. However using custom
classes or types with `Dataset` is not supported. Such queries should still execute correctly when
If you are using `Dataset[Row]` that is equivalent to the `DataFrame` API. In either case the
operations that are supported for acceleration on the GPU are limited. For example using custom
classes or types with `Dataset` are not supported. Neither are using APIs that take `Row` as an input,
or ones that take Scala or Java functions to operate. This includes operators like `flatMap`, `foreach`,
or `foreachPartition`. Such queries will still execute correctly when
using the RAPIDS Accelerator, but it is likely most query operations will not be performed on the
GPU.

Expand Down
62 changes: 62 additions & 0 deletions docs/additional-functionality/iceberg-support.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
---
layout: page
title: Apache Iceberg Support
parent: Additional Functionality
nav_order: 7
---

# Apache Iceberg Support

The RAPIDS Accelerator for Apache Spark provides limited support for Apache Iceberg tables.
This document details the Apache Iceberg features that are supported.

## Apache Iceberg Versions

The RAPIDS Accelerator supports Apache Iceberg 0.13.x. Earlier versions of Apache Iceberg are
not supported.

## Reading Tables

### Metadata Queries

Reads of Apache Iceberg metadata, i.e.: the `history`, `snapshots`, and other metadata tables
associated with a table, will not be GPU-accelerated. The CPU will continue to process these
metadata-level queries.

### Row-level Delete and Update Support

Apache Iceberg supports row-level deletions and updates. Tables that are using a configuration of
`write.delete.mode=merge-on-read` are not supported.

### Schema Evolution

Columns that are added and removed at the top level of the table schema are supported. Columns
that are added or removed within struct columns are not supported.

### Data Formats

Apache Iceberg can store data in various formats. Each section below details the levels of support
for each of the underlying data formats.

#### Parquet

Data stored in Parquet is supported with the same limitations for loading data from raw Parquet
files. See the [Input/Output](../supported_ops.md#inputoutput) documentation for details. The
following compression codecs applied to the Parquet data are supported:
- gzip (Apache Iceberg default)
- snappy
- uncompressed
- zstd

#### ORC

The RAPIDS Accelerator does not support Apache Iceberg tables using the ORC data format.

#### Avro

The RAPIDS Accelerator does not support Apache Iceberg tables using the Avro data format.

## Writing Tables

The RAPIDS Accelerator for Apache Spark does not accelerate Apache Iceberg writes. Writes
to Iceberg tables will be processed by the CPU.
1 change: 1 addition & 0 deletions docs/additional-functionality/rapids-shuffle.md
Original file line number Diff line number Diff line change
Expand Up @@ -286,6 +286,7 @@ In this section, we are using a docker container built using the sample dockerfi
| 3.2.0 | com.nvidia.spark.rapids.spark320.RapidsShuffleManager |
| 3.2.1 | com.nvidia.spark.rapids.spark321.RapidsShuffleManager |
| 3.2.1 CDH | com.nvidia.spark.rapids.spark321cdh.RapidsShuffleManager |
| 3.2.2 | com.nvidia.spark.rapids.spark322.RapidsShuffleManager |
| 3.3.0 | com.nvidia.spark.rapids.spark330.RapidsShuffleManager |
| Databricks 9.1 | com.nvidia.spark.rapids.spark312db.RapidsShuffleManager |
| Databricks 10.4 | com.nvidia.spark.rapids.spark321db.RapidsShuffleManager |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ FROM nvidia/cuda:${CUDA_VER}-runtime-ubuntu18.04
ARG UCX_VER
ARG UCX_CUDA_VER

RUN apt-get update && apt-get install -y gnupg2
# https://forums.developer.nvidia.com/t/notice-cuda-linux-repository-key-rotation/212771
RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ ARG UCX_CUDA_VER=11
FROM ubuntu:18.04 as rdma_core
ARG RDMA_CORE_VERSION

RUN apt-get update && apt-get install -y gnupg2
# https://forums.developer.nvidia.com/t/notice-cuda-linux-repository-key-rotation/212771
RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub
RUN apt update && apt install -y dh-make wget build-essential cmake gcc libudev-dev libnl-3-dev libnl-route-3-dev ninja-build pkg-config valgrind python3-dev cython3 python3-docutils pandoc
Expand Down
2 changes: 2 additions & 0 deletions docs/compatibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,8 @@ Spark getting a value of `1.03` but under the RAPIDS accelerator it produces `1.
Python will produce `1.02`, Java does not have the ability to do a round like this built in, but if
you do the simple operation of `Math.round(1.025 * 100.0)/100.0` you also get `1.02`.

For the `degrees` functions, Spark's implementation relies on Java JDK's built-in functions `Math.toDegrees`. It is `angrad * 180.0 / PI` in Java 8 while `angrad * (180d / PI)` in Java 9+. So their results will differ depending on the JDK runtime versions when considering overflow. The RAPIDS Accelerator follows the bahavior of Java 9+. Therefore, with JDK 8 or below, the `degrees` on GPU will not overflow on some very large numbers while the CPU version does.

For aggregations the underlying implementation is doing the aggregations in parallel and due to race
conditions within the computation itself the result may not be the same each time the query is
run. This is inherent in how the plugin speeds up the calculations and cannot be "fixed." If a query
Expand Down
Loading

0 comments on commit a25b0ea

Please sign in to comment.