Skip to content
This repository has been archived by the owner on Sep 18, 2023. It is now read-only.

[NSE-206]Update installation guide and configuration guide. #289

Merged
merged 2 commits into from
Apr 27, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 21 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,20 @@ We implemented columnar shuffle to improve the shuffle performance. With the col

Please check the operator supporting details [here](./docs/operators.md)

## Build the Plugin
## How to use OAP: Native SQL Engine

There are three ways to use OAP: Native SQL Engine,
1. Use precompiled jars
2. Building by Conda Environment
3. Building by Yourself

### Use precompiled jars

Please go to [OAP's Maven Central Repository](https://repo1.maven.org/maven2/com/intel/oap/) to find Native SQL Engine jars.
For usage, you will require below two jar files:
1. spark-arrow-datasource-standard-<version>-jar-with-dependencies.jar is located in com/intel/oap/spark-arrow-datasource-standard/<version>/
2. spark-columnar-core-<version>-jar-with-dependencies.jar is located in com/intel/oap/spark-columnar-core/<version>/
Please notice the files are fat jars shipped with our custom Arrow library and pre-compiled from our server(using GCC 9.3.0 and LLVM 7.0.1), which means you will require to pre-install GCC 9.3.0 and LLVM 7.0.1 in your system for normal usage.

### Building by Conda

Expand All @@ -51,18 +64,18 @@ Then you can just skip below steps and jump to Getting Started [Get Started](#ge

If you prefer to build from the source code on your hand, please follow below steps to set up your environment.

### Prerequisite
#### Prerequisite

There are some requirements before you build the project.
Please check the document [Prerequisite](./docs/Prerequisite.md) and make sure you have already installed the software in your system.
If you are running a SPARK Cluster, please make sure all the software are installed in every single node.

### Installation
Please check the document [Installation Guide](./docs/Installation.md)
#### Installation

### Configuration & Testing
Please check the document [Configuration Guide](./docs/Configuration.md)
Please check the document [Installation Guide](./docs/Installation.md)

## Get started

To enable OAP NativeSQL Engine, the previous built jar `spark-columnar-core-<version>-jar-with-dependencies.jar` should be added to Spark configuration. We also recommend to use `spark-arrow-datasource-standard-<version>-jar-with-dependencies.jar`. We will demonstrate an example by using both jar files.
SPARK related options are:

Expand All @@ -75,6 +88,8 @@ SPARK related options are:
For Spark Standalone Mode, please set the above value as relative path to the jar file.
For Spark Yarn Cluster Mode, please set the above value as absolute path to the jar file.

More Configuration, please check the document [Configuration Guide](./docs/Configuration.md)

Example to run Spark Shell with ArrowDataSource jar file
```
${SPARK_HOME}/bin/spark-shell \
Expand Down
42 changes: 40 additions & 2 deletions docs/Configuration.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,45 @@
# Spark Configurations for Native SQL Engine

Add below configuration to spark-defaults.conf
There are many configuration could impact the Native SQL Engine performance and can be fine tune in Spark.
You can add these configuration into spark-defaults.conf to enable or disable the setting.

| Parameters | Description | Recommend Setting |
| ---------- | ----------- | --------------- |
| spark.driver.extraClassPath | To add Arrow Data Source and Native SQL Engine jar file in Spark Driver | /path/to/jar_file1:/path/to/jar_file2 |
| spark.executor.extraClassPath | To add Arrow Data Source and Native SQL Engine jar file in Spark Executor | /path/to/jar_file1:/path/to/jar_file2 |
| spark.executorEnv.LIBARROW_DIR | To set up the location of Arrow library, by default it will search the loation of jar to be uncompressed | /path/to/arrow_library/ |
| spark.executorEnv.CC | To set up the location of gcc | /path/to/gcc/ |
| spark.executor.memory| To set up how much memory to be used for Spark Executor. | |
| spark.memory.offHeap.size| To set up how much memory to be used for Java OffHeap.<br /> Please notice Native SQL Engine will leverage this setting to allocate memory space for native usage even offHeap is disabled. <br /> The value is based on your system and it is recommended to set it larger if you are facing Out of Memory issue in Native SQL Engine | 30G |
| spark.executor.extraJavaOptions | To set up how much Direct Memory to be used for Native SQL Engine. The value is based on your system and it is recommended to set it larger if you are facing Out of Memory issue in Native SQL Engine | -XX:MaxDirectMemorySize=30G |
| spark.sql.sources.useV1SourceList | Choose to use V1 source | avro |
| spark.sql.join.preferSortMergeJoin | To turn off preferSortMergeJoin in Spark | false |
| spark.sql.extensions | To turn on Native SQL Engine Plugin | com.intel.oap.ColumnarPlugin |
| spark.shuffle.manager | To turn on Native SQL Engine Columnar Shuffle Plugin | org.apache.spark.shuffle.sort.ColumnarShuffleManager |
| spark.oap.sql.columnar.batchscan | Enable or Disable Columnar Batchscan, default is true | true |
| spark.oap.sql.columnar.hashagg | Enable or Disable Columnar Hash Aggregate, default is true | true |
| spark.oap.sql.columnar.projfilter | Enable or Disable Columnar Project and Filter, default is true | true |
| spark.oap.sql.columnar.codegen.sort | Enable or Disable Columnar Sort, default is true | true |
| spark.oap.sql.columnar.window | Enable or Disable Columnar Window, default is true | true |
| spark.oap.sql.columnar.shuffledhashjoin | Enable or Disable ShffuledHashJoin, default is true | true |
| spark.oap.sql.columnar.sortmergejoin | Enable or Disable Columnar Sort Merge Join, default is true | true |
| spark.oap.sql.columnar.union | Enable or Disable Columnar Union, default is true | true |
| spark.oap.sql.columnar.expand | Enable or Disable Columnar Expand, default is true | true |
| spark.oap.sql.columnar.broadcastexchange | Enable or Disable Columnar Broadcast Exchange, default is true | true |
| spark.oap.sql.columnar.nanCheck | Enable or Disable Nan Check, default is true | true |
| spark.oap.sql.columnar.hashCompare | Enable or Disable Hash Compare in HashJoins or HashAgg, default is true | true |
| spark.oap.sql.columnar.broadcastJoin | Enable or Disable Columnar BradcastHashJoin, default is true | true |
| spark.oap.sql.columnar.wholestagecodegen | Enable or Disable Columnar WholeStageCodeGen, default is true | true |
| spark.oap.sql.columnar.preferColumnar | Enable or Disable Columnar Operators, default is false.<br /> This parameter could impact the performance in different case. In some cases, to set false can get some performance boost. | false |
| spark.oap.sql.columnar.joinOptimizationLevel | Fallback to row operators if there are several continous joins | 6 |
| spark.sql.execution.arrow.maxRecordsPerBatch | Set up the Max Records per Batch | 10000 |
| spark.oap.sql.columnar.wholestagecodegen.breakdownTime | Enable or Disable metrics in Columnar WholeStageCodeGen | false |
| spark.oap.sql.columnar.tmp_dir | Set up a folder to store the codegen files | /tmp |
| spark.oap.sql.columnar.shuffle.customizedCompression.codec | Set up the codec to be used for Columnar Shuffle, default is lz4| lz4 |
| spark.oap.sql.columnar.numaBinding | Set up NUMABinding, default is false| true |
| spark.oap.sql.columnar.coreRange | Set up the core range for NUMABinding, only works when numaBinding set to true. <br /> The setting is based on the number of cores in your system. Use 72 cores as an example. | 0-17,36-53 &#124;18-35,54-71 |

Below is an example for spark-default.conf, if you are using conda to install OAP project.

```
##### Columnar Process Configuration
Expand All @@ -26,4 +65,3 @@ export CC=$HOME/miniconda2/envs/oapenv/bin/gcc
export LIBARROW_DIR=$HOME/miniconda2/envs/oapenv/
```

About arrow-data-source.jar, you can refer [Unified Arrow Data Source ](https://oap-project.github.io/arrow-data-source/).
21 changes: 12 additions & 9 deletions docs/Installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,17 +14,20 @@ yum install gmock
``` shell
git clone -b ${version} https://github.com/oap-project/native-sql-engine.git
cd oap-native-sql
cd cpp/
mkdir build/
cd build/
cmake .. -DTESTS=ON
make -j
mvn clean package -DskipTests -Dcpp_tests=OFF -Dbuild_arrow=ON -Dcheckstyle.skip
```

``` shell
cd ../../core/
mvn clean package -DskipTests
```
Based on the different environment, there are some parameters can be set via -D with mvn.

| Parameters | Description | Default Value |
| ---------- | ----------- | ------------- |
| cpp_tests | Enable or Disable CPP Tests | False |
| build_arrow | Build Arrow from Source | True |
| arrow_root | When build_arrow set to False, arrow_root will be enabled to find the location of your existing arrow library. | /usr/local |
| build_protobuf | Build Protobuf from Source. If set to False, default library path will be used to find protobuf library. | True |

When build_arrow set to True, the build_arrow.sh will be launched and compile a custom arrow library from [OAP Arrow](https://github.com/oap-project/arrow)
If you wish to change any parameters from Arrow, you can change it from the build_arrow.sh script under native-sql-enge/arrow-data-source/script/.

### Additonal Notes
[Notes for Installation Issues](./InstallationNotes.md)