diff --git a/README.md b/README.md index 6e90b2659..20b0a6f55 100644 --- a/README.md +++ b/README.md @@ -24,7 +24,7 @@ With [Spark 27396](https://issues.apache.org/jira/browse/SPARK-27396) its possib ![Overview](./docs/image/dataset.png) -A native parquet reader was developed to speed up the data loading. it's based on Apache Arrow Dataset. For details please check [Arrow Data Source](https://github.com/oap-project/arrow-data-source) +A native parquet reader was developed to speed up the data loading. it's based on Apache Arrow Dataset. For details please check [Arrow Data Source](https://github.com/oap-project/native-sql-engine/tree/master/arrow-data-source) ### Apache Arrow Compute/Gandiva based operators @@ -101,7 +101,7 @@ orders.createOrReplaceTempView("orders") spark.sql("select * from orders where o_orderdate > date '1998-07-26'").show(20000, false) ``` -The result should showup on Spark console and you can check the DAG diagram with some Columnar Processing stage. +The result should showup on Spark console and you can check the DAG diagram with some Columnar Processing stage. Native SQL engine still lacks some features, please check out the [limitations](./docs/limitations.md). ## Performance data diff --git a/docs/ApacheArrowInstallation.md b/docs/ApacheArrowInstallation.md new file mode 100644 index 000000000..4e0647f74 --- /dev/null +++ b/docs/ApacheArrowInstallation.md @@ -0,0 +1,70 @@ +# llvm-7.0: +Arrow Gandiva depends on LLVM, and I noticed current version strictly depends on llvm7.0 if you installed any other version rather than 7.0, it will fail. +``` shell +wget http://releases.llvm.org/7.0.1/llvm-7.0.1.src.tar.xz +tar xf llvm-7.0.1.src.tar.xz +cd llvm-7.0.1.src/ +cd tools +wget http://releases.llvm.org/7.0.1/cfe-7.0.1.src.tar.xz +tar xf cfe-7.0.1.src.tar.xz +mv cfe-7.0.1.src clang +cd .. +mkdir build +cd build +cmake .. -DCMAKE_BUILD_TYPE=Release +cmake --build . -j +cmake --build . --target install +# check if clang has also been compiled, if no +cd tools/clang +mkdir build +cd build +cmake .. +make -j +make install +``` + +# cmake: +Arrow will download package during compiling, in order to support SSL in cmake, build cmake is optional. +``` shell +wget https://github.com/Kitware/CMake/releases/download/v3.15.0-rc4/cmake-3.15.0-rc4.tar.gz +tar xf cmake-3.15.0-rc4.tar.gz +cd cmake-3.15.0-rc4/ +./bootstrap --system-curl --parallel=64 #parallel num depends on your server core number +make -j +make install +cmake --version +cmake version 3.15.0-rc4 +``` + +# Apache Arrow +``` shell +git clone https://github.com/Intel-bigdata/arrow.git +cd arrow && git checkout branch-0.17.0-oap-1.0 +mkdir -p arrow/cpp/release-build +cd arrow/cpp/release-build +cmake -DARROW_DEPENDENCY_SOURCE=BUNDLED -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_HDFS=ON -DARROW_BOOST_USE_SHARED=ON -DARROW_JNI=ON -DARROW_DATASET=ON -DARROW_WITH_PROTOBUF=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_LZ4=ON -DARROW_FILESYSTEM=ON -DARROW_JSON=ON .. +make -j +make install + +# build java +cd ../../java +# change property 'arrow.cpp.build.dir' to the relative path of cpp build dir in gandiva/pom.xml +mvn clean install -P arrow-jni -am -Darrow.cpp.build.dir=../cpp/release-build/release/ -DskipTests +# if you are behine proxy, please also add proxy for socks +mvn clean install -P arrow-jni -am -Darrow.cpp.build.dir=../cpp/release-build/release/ -DskipTests -DsocksProxyHost=${proxyHost} -DsocksProxyPort=1080 +``` + +run test +``` shell +mvn test -pl adapter/parquet -P arrow-jni +mvn test -pl gandiva -P arrow-jni +``` + +# Copy binary files to oap-native-sql resources directory +Because oap-native-sql plugin will build a stand-alone jar file with arrow dependency, if you choose to build Arrow by yourself, you have to copy below files as a replacement from the original one. +You can find those files in Apache Arrow installation directory or release directory. Below example assume Apache Arrow has been installed on /usr/local/lib64 +``` shell +cp /usr/local/lib64/libarrow.so.17 $native-sql-engine-dir/cpp/src/resources +cp /usr/local/lib64/libgandiva.so.17 $native-sql-engine-dir/cpp/src/resources +cp /usr/local/lib64/libparquet.so.17 $native-sql-engine-dir/cpp/src/resources +``` diff --git a/docs/Configuration.md b/docs/Configuration.md new file mode 100644 index 000000000..b20b46f0e --- /dev/null +++ b/docs/Configuration.md @@ -0,0 +1,29 @@ +# Spark Configurations for Native SQL Engine + +Add below configuration to spark-defaults.conf + +``` +##### Columnar Process Configuration + +spark.sql.sources.useV1SourceList avro +spark.sql.join.preferSortMergeJoin false +spark.sql.extensions com.intel.oap.ColumnarPlugin +spark.shuffle.manager org.apache.spark.shuffle.sort.ColumnarShuffleManager + +# note native sql engine depends on arrow data source +spark.driver.extraClassPath $HOME/miniconda2/envs/oapenv/oap_jars/spark-columnar-core--jar-with-dependencies.jar:$HOME/miniconda2/envs/oapenv/oap_jars/spark-arrow-datasource-standard--jar-with-dependencies.jar +spark.executor.extraClassPath $HOME/miniconda2/envs/oapenv/oap_jars/spark-columnar-core--jar-with-dependencies.jar:$HOME/miniconda2/envs/oapenv/oap_jars/spark-arrow-datasource-standard--jar-with-dependencies.jar + +spark.executorEnv.LIBARROW_DIR $HOME/miniconda2/envs/oapenv +spark.executorEnv.CC $HOME/miniconda2/envs/oapenv/bin/gcc +###### +``` + +Before you start spark, you must use below command to add some environment variables. + +``` +export CC=$HOME/miniconda2/envs/oapenv/bin/gcc +export LIBARROW_DIR=$HOME/miniconda2/envs/oapenv/ +``` + +About arrow-data-source.jar, you can refer [Unified Arrow Data Source ](https://oap-project.github.io/arrow-data-source/). diff --git a/docs/Installation.md b/docs/Installation.md new file mode 100644 index 000000000..604829663 --- /dev/null +++ b/docs/Installation.md @@ -0,0 +1,30 @@ +# Spark Native SQL Engine Installation + +For detailed testing scripts, please refer to [solution guide](https://github.com/Intel-bigdata/Solution_navigator/tree/master/nativesql) + +## Install Googletest and Googlemock + +``` shell +yum install gtest-devel +yum install gmock +``` + +## Build Native SQL Engine + +``` shell +git clone -b ${version} https://github.com/oap-project/native-sql-engine.git +cd oap-native-sql +cd cpp/ +mkdir build/ +cd build/ +cmake .. -DTESTS=ON +make -j +``` + +``` shell +cd ../../core/ +mvn clean package -DskipTests +``` + +### Additonal Notes +[Notes for Installation Issues](./InstallationNotes.md) diff --git a/docs/InstallationNotes.md b/docs/InstallationNotes.md new file mode 100644 index 000000000..cf7120be9 --- /dev/null +++ b/docs/InstallationNotes.md @@ -0,0 +1,47 @@ +### Notes for Installation Issues +* Before the Installation, if you have installed other version of oap-native-sql, remove all installed lib and include from system path: libarrow* libgandiva* libspark-columnar-jni* + +* libgandiva_jni.so was not found inside JAR + +change property 'arrow.cpp.build.dir' to $ARROW_DIR/cpp/release-build/release/ in gandiva/pom.xml. If you do not want to change the contents of pom.xml, specify it like this: + +``` +mvn clean install -P arrow-jni -am -Darrow.cpp.build.dir=/root/git/t/arrow/cpp/release-build/release/ -DskipTests -Dcheckstyle.skip +``` + +* No rule to make target '../src/protobuf_ep', needed by `src/proto/Exprs.pb.cc' + +remove the existing libprotobuf installation, then the script for find_package() will be able to download protobuf. + +* can't find the libprotobuf.so.13 in the shared lib + +copy the libprotobuf.so.13 from $OAP_DIR/oap-native-sql/cpp/src/resources to /usr/lib64/ + +* unable to load libhdfs: libgsasl.so.7: cannot open shared object file + +libgsasl is missing, run `yum install libgsasl` + +* CentOS 7.7 looks like didn't provide the glibc we required, so binaries packaged on F30 won't work. + +``` +20/04/21 17:46:17 WARN TaskSetManager: Lost task 0.1 in stage 1.0 (TID 2, 10.0.0.143, executor 6): java.lang.UnsatisfiedLinkError: /tmp/libgandiva_jni.sobe729912-3bbe-4bd0-bb96-4c7ce2e62336: /lib64/libm.so.6: version `GLIBC_2.29' not found (required by /tmp/libgandiva_jni.sobe729912-3bbe-4bd0-bb96-4c7ce2e62336) +``` + +* Missing symbols due to old GCC version. + +``` +[root@vsr243 release-build]# nm /usr/local/lib64/libparquet.so | grep ZN5boost16re_detail_10710012perl_matcherIN9__gnu_cxx17__normal_iteratorIPKcSsEESaINS_9sub_matchIS6_EEENS_12regex_traitsIcNS_16cpp_regex_traitsIcEEEEE14construct_initERKNS_11basic_regexIcSD_EENS_15regex_constants12_match_flagsE +_ZN5boost16re_detail_10710012perl_matcherIN9__gnu_cxx17__normal_iteratorIPKcSsEESaINS_9sub_matchIS6_EEENS_12regex_traitsIcNS_16cpp_regex_traitsIcEEEEE14construct_initERKNS_11basic_regexIcSD_EENS_15regex_constants12_match_flagsE +``` + +Need to compile all packags with newer GCC: + +``` +[root@vsr243 ~]# export CXX=/usr/local/bin/g++ +[root@vsr243 ~]# export CC=/usr/local/bin/gcc +``` + +* Can not connect to hdfs @sr602 + +vsr606, vsr243 are both not able to connect to hdfs @sr602, need to skipTests to generate the jar + diff --git a/docs/OAP-Developer-Guide.md b/docs/OAP-Developer-Guide.md new file mode 100644 index 000000000..8d7ac6abf --- /dev/null +++ b/docs/OAP-Developer-Guide.md @@ -0,0 +1,109 @@ +# OAP Developer Guide + +This document contains the instructions & scripts on installing necessary dependencies and building OAP. +You can get more detailed information from OAP each module below. + +* [SQL Index and Data Source Cache](https://github.com/oap-project/sql-ds-cache/blob/master/docs/Developer-Guide.md) +* [PMem Common](https://github.com/oap-project/pmem-common) +* [PMem Shuffle](https://github.com/oap-project/pmem-shuffle#5-install-dependencies-for-shuffle-remote-pmem-extension) +* [Remote Shuffle](https://github.com/oap-project/remote-shuffle) +* [OAP MLlib](https://github.com/oap-project/oap-mllib) +* [Arrow Data Source](https://github.com/oap-project/arrow-data-source) +* [Native SQL Engine](https://github.com/oap-project/native-sql-engine) + +## Building OAP + +### Prerequisites for Building + +OAP is built with [Apache Maven](http://maven.apache.org/) and Oracle Java 8, and mainly required tools to install on your cluster are listed below. + +- [Cmake](https://help.directadmin.com/item.php?id=494) +- [GCC > 7](https://gcc.gnu.org/wiki/InstallingGCC) +- [Memkind](https://github.com/memkind/memkind/tree/v1.10.1-rc2) +- [Vmemcache](https://github.com/pmem/vmemcache) +- [HPNL](https://github.com/Intel-bigdata/HPNL) +- [PMDK](https://github.com/pmem/pmdk) +- [OneAPI](https://software.intel.com/content/www/us/en/develop/tools/oneapi.html) +- [Arrow](https://github.com/Intel-bigdata/arrow) + +- **Requirements for Shuffle Remote PMem Extension** +If enable Shuffle Remote PMem extension with RDMA, you can refer to [PMem Shuffle](https://github.com/oap-project/pmem-shuffle) to configure and validate RDMA in advance. + +We provide scripts below to help automatically install dependencies above **except RDMA**, need change to **root** account, run: + +``` +# git clone -b https://github.com/Intel-bigdata/OAP.git +# cd OAP +# sh $OAP_HOME/dev/install-compile-time-dependencies.sh +``` + +Run the following command to learn more. + +``` +# sh $OAP_HOME/dev/scripts/prepare_oap_env.sh --help +``` + +Run the following command to automatically install specific dependency such as Maven. + +``` +# sh $OAP_HOME/dev/scripts/prepare_oap_env.sh --prepare_maven +``` + + +### Building + +To build OAP package, run command below then you can find a tarball named `oap-$VERSION-bin-spark-$VERSION.tar.gz` under directory `$OAP_HOME/dev/release-package `. +``` +$ sh $OAP_HOME/dev/compile-oap.sh +``` + +Building Specified OAP Module, such as `oap-cache`, run: +``` +$ sh $OAP_HOME/dev/compile-oap.sh --oap-cache +``` + + +### Running OAP Unit Tests + +Setup building environment manually for intel MLlib, and if your default GCC version is before 7.0 also need export `CC` & `CXX` before using `mvn`, run + +``` +$ export CXX=$OAP_HOME/dev/thirdparty/gcc7/bin/g++ +$ export CC=$OAP_HOME/dev/thirdparty/gcc7/bin/gcc +$ export ONEAPI_ROOT=/opt/intel/inteloneapi +$ source /opt/intel/inteloneapi/daal/2021.1-beta07/env/vars.sh +$ source /opt/intel/inteloneapi/tbb/2021.1-beta07/env/vars.sh +$ source /tmp/oneCCL/build/_install/env/setvars.sh +``` + +Run all the tests: + +``` +$ mvn clean test +``` + +Run Specified OAP Module Unit Test, such as `oap-cache`: + +``` +$ mvn clean -pl com.intel.oap:oap-cache -am test + +``` + +### Building SQL Index and Data Source Cache with PMem + +#### Prerequisites for building with PMem support + +When using SQL Index and Data Source Cache with PMem, finish steps of [Prerequisites for building](#prerequisites-for-building) to ensure needed dependencies have been installed. + +#### Building package + +You can build OAP with PMem support with command below: + +``` +$ sh $OAP_HOME/dev/compile-oap.sh +``` +Or run: + +``` +$ mvn clean -q -Ppersistent-memory -Pvmemcache -DskipTests package +``` diff --git a/docs/OAP-Installation-Guide.md b/docs/OAP-Installation-Guide.md new file mode 100644 index 000000000..e3b229805 --- /dev/null +++ b/docs/OAP-Installation-Guide.md @@ -0,0 +1,69 @@ +# OAP Installation Guide +This document introduces how to install OAP and its dependencies on your cluster nodes by ***Conda***. +Follow steps below on ***every node*** of your cluster to set right environment for each machine. + +## Contents + - [Prerequisites](#prerequisites) + - [Installing OAP](#installing-oap) + - [Configuration](#configuration) + +## Prerequisites + +- **OS Requirements** +We have tested OAP on Fedora 29 and CentOS 7.6 (kernel-4.18.16). We recommend you use **Fedora 29 CentOS 7.6 or above**. Besides, for [Memkind](https://github.com/memkind/memkind/tree/v1.10.1-rc2) we recommend you use **kernel above 3.10**. + +- **Conda Requirements** +Install Conda on your cluster nodes with below commands and follow the prompts on the installer screens.: +```bash +$ wget -c https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh +$ chmod +x Miniconda2-latest-Linux-x86_64.sh +$ bash Miniconda2-latest-Linux-x86_64.sh +``` +For changes to take effect, close and re-open your current shell. To test your installation, run the command `conda list` in your terminal window. A list of installed packages appears if it has been installed correctly. + +## Installing OAP + +Dependencies below are required by OAP and all of them are included in OAP Conda package, they will be automatically installed in your cluster when you Conda install OAP. Ensure you have activated environment which you created in the previous steps. + +- [Arrow](https://github.com/Intel-bigdata/arrow) +- [Plasma](http://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/) +- [Memkind](https://anaconda.org/intel/memkind) +- [Vmemcache](https://anaconda.org/intel/vmemcache) +- [HPNL](https://anaconda.org/intel/hpnl) +- [PMDK](https://github.com/pmem/pmdk) +- [OneAPI](https://software.intel.com/content/www/us/en/develop/tools/oneapi.html) + + +Create a conda environment and install OAP Conda package. +```bash +$ conda create -n oapenv -y python=3.7 +$ conda activate oapenv +$ conda install -c conda-forge -c intel -y oap=1.0.0 +``` + +Once finished steps above, you have completed OAP dependencies installation and OAP building, and will find built OAP jars under `$HOME/miniconda2/envs/oapenv/oap_jars` + +#### Extra Steps for Shuffle Remote PMem Extension + +If you use one of OAP features -- [PMmem Shuffle](https://github.com/oap-project/pmem-shuffle) with **RDMA**, you need to configure and validate RDMA, please refer to [PMem Shuffle](https://github.com/oap-project/pmem-shuffle#4-configure-and-validate-rdma) for the details. + + +## Configuration + +Once finished steps above, make sure libraries installed by Conda can be linked by Spark, please add the following configuration settings to `$SPARK_HOME/conf/spark-defaults.conf`. + +``` +spark.executorEnv.LD_LIBRARY_PATH $HOME/miniconda2/envs/oapenv/lib +spark.executor.extraLibraryPath $HOME/miniconda2/envs/oapenv/lib +spark.driver.extraLibraryPath $HOME/miniconda2/envs/oapenv/lib +spark.executor.extraClassPath $HOME/miniconda2/envs/oapenv/oap_jars/$OAP_FEATURE.jar +spark.driver.extraClassPath $HOME/miniconda2/envs/oapenv/oap_jars/$OAP_FEATURE.jar +``` + +And then you can follow the corresponding feature documents for more details to use them. + + + + + + diff --git a/docs/Prerequisite.md b/docs/Prerequisite.md new file mode 100644 index 000000000..5ff82aa1b --- /dev/null +++ b/docs/Prerequisite.md @@ -0,0 +1,151 @@ +# Prerequisite + +There are some requirements before you build the project. +Please make sure you have already installed the software in your system. + +1. gcc 9.3 or higher version +2. java8 OpenJDK -> yum install java-1.8.0-openjdk +3. cmake 3.2 or higher version +4. maven 3.1.1 or higher version +5. Hadoop 2.7.5 or higher version +6. Spark 3.0.0 or higher version +7. Intel Optimized Arrow 0.17.0 + +## gcc installation + +// installing gcc 9.3 or higher version + +Please notes for better performance support, gcc 9.3 is a minimal requirement with Intel Microarchitecture such as SKYLAKE, CASCADELAKE, ICELAKE. +https://gcc.gnu.org/install/index.html + +Follow the above website to download gcc. +C++ library may ask a certain version, if you are using gcc 9.3 the version would be libstdc++.so.6.0.28. +You may have to launch ./contrib/download_prerequisites command to install all the prerequisites for gcc. +If you are facing downloading issue in download_prerequisites command, you can try to change ftp to http. + +//Follow the steps to configure gcc +https://gcc.gnu.org/install/configure.html + +If you are facing a multilib issue, you can try to add --disable-multilib parameter in ../configure + +//Follow the steps to build gc +https://gcc.gnu.org/install/build.html + +//Follow the steps to install gcc +https://gcc.gnu.org/install/finalinstall.html + +//Set up Environment for new gcc +``` +export PATH=$YOUR_GCC_INSTALLATION_DIR/bin:$PATH +export LD_LIBRARY_PATH=$YOUR_GCC_INSTALLATION_DIR/lib64:$LD_LIBRARY_PATH +``` +Please remember to add and source the setup in your environment files such as /etc/profile or /etc/bashrc + +//Verify if gcc has been installation +Use gcc -v command to verify if your gcc version is correct.(Must larger than 9.3) + +## cmake installation +If you are facing some trouble when installing cmake, please follow below steps to install cmake. + +``` +// installing cmake 3.2 +sudo yum install cmake3 + +// If you have an existing cmake, you can use below command to set it as an option within alternatives command +sudo alternatives --install /usr/local/bin/cmake cmake /usr/bin/cmake 10 --slave /usr/local/bin/ctest ctest /usr/bin/ctest --slave /usr/local/bin/cpack cpack /usr/bin/cpack --slave /usr/local/bin/ccmake ccmake /usr/bin/ccmake --family cmake + +// Set cmake3 as an option within alternatives command +sudo alternatives --install /usr/local/bin/cmake cmake /usr/bin/cmake3 20 --slave /usr/local/bin/ctest ctest /usr/bin/ctest3 --slave /usr/local/bin/cpack cpack /usr/bin/cpack3 --slave /usr/local/bin/ccmake ccmake /usr/bin/ccmake3 --family cmake + +// Use alternatives to choose cmake version +sudo alternatives --config cmake +``` + +## maven installation + +If you are facing some trouble when installing maven, please follow below steps to install maven + +// installing maven 3.6.3 + +Go to https://maven.apache.org/download.cgi and download the specific version of maven + +// Below command use maven 3.6.3 as an example +``` +wget htps://ftp.wayne.edu/apache/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz +wget https://ftp.wayne.edu/apache/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz +tar xzf apache-maven-3.6.3-bin.tar.gz +mkdir /usr/local/maven +mv apache-maven-3.6.3/ /usr/local/maven/ +``` + +// Set maven 3.6.3 as an option within alternatives command +``` +sudo alternatives --install /usr/bin/mvn mvn /usr/local/maven/apache-maven-3.6.3/bin/mvn 1 +``` + +// Use alternatives to choose mvn version + +``` +sudo alternatives --config mvn +``` + +## HADOOP/SPARK Installation + +If there is no existing Hadoop/Spark installed, Please follow the guide to install your Hadoop/Spark [SPARK/HADOOP Installation](./SparkInstallation.md) + +### Hadoop Native Library(Default) + +Please make sure you have set up Hadoop directory properly with Hadoop Native Libraries +By default, Apache Arrow would scan `$HADOOP_HOME` and find the native Hadoop library `libhdfs.so`(under `$HADOOP_HOME/lib/native` directory) to be used for Hadoop client. + +You can also use `ARROW_LIBHDFS_DIR` to configure the location of `libhdfs.so` if it is installed in other directory than `$HADOOP_HOME/lib/native` + +If your SPARK and HADOOP are separated in different nodes, please find `libhdfs.so` in your Hadoop cluster and copy it to SPARK cluster, then use one of the above methods to set it properly. + +For more information, please check +Arrow HDFS interface [documentation](https://github.com/apache/arrow/blob/master/cpp/apidoc/HDFS.md) +Hadoop Native Library, please read the official Hadoop website [documentation](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/NativeLibraries.html) + +### Use libhdfs3 library for better performance(Optional) + +For better performance ArrowDataSource reads HDFS files using the third-party library `libhdfs3`. The library must be pre-installed on machines Spark Executor nodes are running on. + +To install the library, use of [Conda](https://docs.conda.io/en/latest/) is recommended. + +``` +// installing libhdfs3 +conda install -c conda-forge libhdfs3 + +// check the installed library file +ll ~/miniconda/envs/$(YOUR_ENV_NAME)/lib/libhdfs3.so +``` + +We also provide a libhdfs3 binary in cpp/src/resources directory. + +To set up libhdfs3, there are two different ways: +Option1: Overwrite the soft link for libhdfs.so +To install libhdfs3.so, you have to create a soft link for libhdfs.so in your Hadoop directory(`$HADOOP_HOME/lib/native` by default). + +``` +ln -f -s libhdfs3.so libhdfs.so +``` + +Option2: +Add env variable to the system +``` +export ARROW_LIBHDFS3_DIR="PATH_TO_LIBHDFS3_DIR/" +``` + +Add following Spark configuration options before running the DataSource to make the library to be recognized: + +* `spark.executorEnv.ARROW_LIBHDFS3_DIR = "PATH_TO_LIBHDFS3_DIR/"` +* `spark.executorEnv.LD_LIBRARY_PATH = "PATH_TO_LIBHDFS3_DEPENDENCIES_DIR/"` + +Please notes: If you choose to use libhdfs3.so, there are some other dependency libraries you have to installed such as libprotobuf or libcrypto. + + +## Intel Optimized Apache Arrow Installation + +Intel Optimized Apache Arrow is MANDATORY to be used. However, we have a bundle a compiled arrow libraries(libarrow, libgandiva, libparquet) built by GCC9.3 included in the cpp/src/resources directory. +If you wish to build Apache Arrow by yourself, please follow the guide to build and install Apache Arrow [ArrowInstallation](./ApacheArrowInstallation.md) + diff --git a/docs/SparkInstallation.md b/docs/SparkInstallation.md new file mode 100644 index 000000000..9d2a864ae --- /dev/null +++ b/docs/SparkInstallation.md @@ -0,0 +1,44 @@ +### Download Spark 3.0.1 + +Currently Native SQL Engine works on the Spark 3.0.1 version. + +``` +wget http://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz +sudo mkdir -p /opt/spark && sudo mv spark-3.0.1-bin-hadoop3.2.tgz /opt/spark +sudo cd /opt/spark && sudo tar -xf spark-3.0.1-bin-hadoop3.2.tgz +export SPARK_HOME=/opt/spark/spark-3.0.1-bin-hadoop3.2/ +``` + +### [Or building Spark from source](https://spark.apache.org/docs/latest/building-spark.html) + +``` shell +git clone https://github.com/intel-bigdata/spark.git +cd spark && git checkout native-sql-engine-clean +# check spark supported hadoop version +grep \ -r pom.xml + 2.7.4 + 3.2.0 +# so we should build spark specifying hadoop version as 3.2 +./build/mvn -Pyarn -Phadoop-3.2 -Dhadoop.version=3.2.0 -DskipTests clean install +``` +Specify SPARK_HOME to spark path + +``` shell +export SPARK_HOME=${HADOOP_PATH} +``` + +### Hadoop building from source + +``` shell +git clone https://github.com/apache/hadoop.git +cd hadoop +git checkout rel/release-3.2.0 +# only build binary for hadoop +mvn clean install -Pdist -DskipTests -Dtar +# build binary and native library such as libhdfs.so for hadoop +# mvn clean install -Pdist,native -DskipTests -Dtar +``` + +``` shell +export HADOOP_HOME=${HADOOP_PATH}/hadoop-dist/target/hadoop-3.2.0/ +``` diff --git a/docs/User-Guide.md b/docs/User-Guide.md new file mode 100644 index 000000000..c3c05cebf --- /dev/null +++ b/docs/User-Guide.md @@ -0,0 +1,118 @@ +# Spark Native SQL Engine + +A Native Engine for Spark SQL with vectorized SIMD optimizations + +## Introduction + +![Overview](./image/nativesql_arch.png) + +Spark SQL works very well with structured row-based data. It used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions, especially under complicated queries. [Apache Arrow](https://arrow.apache.org/) provided CPU-cache friendly columnar in-memory layout, its SIMD optimized kernels and LLVM based SQL engine Gandiva are also very efficient. Native SQL Engine used these technoligies and brought better performance to Spark SQL. + +## Key Features + +### Apache Arrow formatted intermediate data among Spark operator + +![Overview](./image/columnar.png) + +With [Spark 27396](https://issues.apache.org/jira/browse/SPARK-27396) its possible to pass a RDD of Columnarbatch to operators. We implemented this API with Arrow columnar format. + +### Apache Arrow based Native Readers for Parquet and other formats + +![Overview](./image/dataset.png) + +A native parquet reader was developed to speed up the data loading. it's based on Apache Arrow Dataset. For details please check [Arrow Data Source](https://github.com/oap-project/arrow-data-source) + +### Apache Arrow Compute/Gandiva based operators + +![Overview](./image/kernel.png) + +We implemented common operators based on Apache Arrow Compute and Gandiva. The SQL expression was compiled to one expression tree with protobuf and passed to native kernels. The native kernels will then evaluate the these expressions based on the input columnar batch. + +### Native Columnar Shuffle Operator with efficient compression support + +![Overview](./image/shuffle.png) + +We implemented columnar shuffle to improve the shuffle performance. With the columnar layout we could do very efficient data compression for different data format. + +## Build the Plugin + +### Building by Conda + +If you already have a working Hadoop Spark Cluster, we provide a Conda package which will automatically install dependencies needed by OAP, you can refer to [OAP-Installation-Guide](./OAP-Installation-Guide.md) for more information. Once finished [OAP-Installation-Guide](./OAP-Installation-Guide.md), you can find built `spark-columnar-core--jar-with-dependencies.jar` under `$HOME/miniconda2/envs/oapenv/oap_jars`. +Then you can just skip below steps and jump to Getting Started [Get Started](#get-started). + +### Building by yourself + +If you prefer to build from the source code on your hand, please follow below steps to set up your environment. + +### Prerequisite +There are some requirements before you build the project. +Please check the document [Prerequisite](./Prerequisite.md) and make sure you have already installed the software in your system. +If you are running a SPARK Cluster, please make sure all the software are installed in every single node. + +### Installation +Please check the document [Installation Guide](./Installation.md) + +### Configuration & Testing +Please check the document [Configuration Guide](./Configuration.md) + +## Get started +To enable OAP NativeSQL Engine, the previous built jar `spark-columnar-core--jar-with-dependencies.jar` should be added to Spark configuration. We also recommend to use `spark-arrow-datasource-standard--jar-with-dependencies.jar`. We will demonstrate an example by using both jar files. +SPARK related options are: + +* `spark.driver.extraClassPath` : Set to load jar file to driver. +* `spark.executor.extraClassPath` : Set to load jar file to executor. +* `jars` : Set to copy jar file to the executors when using yarn cluster mode. +* `spark.executorEnv.ARROW_LIBHDFS3_DIR` : Optional if you are using a custom libhdfs3.so. +* `spark.executorEnv.LD_LIBRARY_PATH` : Optional if you are using a custom libhdfs3.so. + +For Spark Standalone Mode, please set the above value as relative path to the jar file. +For Spark Yarn Cluster Mode, please set the above value as absolute path to the jar file. + +Example to run Spark Shell with ArrowDataSource jar file +``` +${SPARK_HOME}/bin/spark-shell \ + --verbose \ + --master yarn \ + --driver-memory 10G \ + --conf spark.driver.extraClassPath=$PATH_TO_JAR/spark-arrow-datasource-standard--jar-with-dependencies.jar:$PATH_TO_JAR/spark-columnar-core--jar-with-dependencies.jar \ + --conf spark.executor.extraClassPath=$PATH_TO_JAR/spark-arrow-datasource-standard--jar-with-dependencies.jar:$PATH_TO_JAR/spark-columnar-core--jar-with-dependencies.jar \ + --conf spark.driver.cores=1 \ + --conf spark.executor.instances=12 \ + --conf spark.executor.cores=6 \ + --conf spark.executor.memory=20G \ + --conf spark.memory.offHeap.size=80G \ + --conf spark.task.cpus=1 \ + --conf spark.locality.wait=0s \ + --conf spark.sql.shuffle.partitions=72 \ + --conf spark.executorEnv.ARROW_LIBHDFS3_DIR="$PATH_TO_LIBHDFS3_DIR/" \ + --conf spark.executorEnv.LD_LIBRARY_PATH="$PATH_TO_LIBHDFS3_DEPENDENCIES_DIR" + --jars $PATH_TO_JAR/spark-arrow-datasource-standard--jar-with-dependencies.jar,$PATH_TO_JAR/spark-columnar-core--jar-with-dependencies.jar +``` + +Here is one example to verify if native sql engine works, make sure you have TPC-H dataset. We could do a simple projection on one parquet table. For detailed testing scripts, please refer to [Solution Guide](https://github.com/Intel-bigdata/Solution_navigator/tree/master/nativesql). +``` +val orders = spark.read.format("arrow").load("hdfs:////user/root/date_tpch_10/orders") +orders.createOrReplaceTempView("orders") +spark.sql("select * from orders where o_orderdate > date '1998-07-26'").show(20000, false) +``` + +The result should show up on Spark console and you can check the DAG diagram with some Columnar Processing stage. + + +## Performance data + +For initial microbenchmark performance, we add 10 fields up with spark, data size is 200G data + +![Performance](./image/performance.png) + +## Coding Style + +* For Java code, we used [google-java-format](https://github.com/google/google-java-format) +* For Scala code, we used [Spark Scala Format](https://github.com/apache/spark/blob/master/dev/.scalafmt.conf), please use [scalafmt](https://github.com/scalameta/scalafmt) or run ./scalafmt for scala codes format +* For Cpp codes, we used Clang-Format, check on this link [google-vim-codefmt](https://github.com/google/vim-codefmt) for details. + +## Contact + +chendi.xue@intel.com +binwei.yang@intel.com diff --git a/docs/image/columnar.png b/docs/image/columnar.png new file mode 100644 index 000000000..d89074905 Binary files /dev/null and b/docs/image/columnar.png differ diff --git a/docs/image/core_arch.jpg b/docs/image/core_arch.jpg new file mode 100644 index 000000000..4f732a4ff Binary files /dev/null and b/docs/image/core_arch.jpg differ diff --git a/docs/image/dataset.png b/docs/image/dataset.png new file mode 100644 index 000000000..5d3e607ab Binary files /dev/null and b/docs/image/dataset.png differ diff --git a/docs/image/decision_support_bench1_result_by_query.png b/docs/image/decision_support_bench1_result_by_query.png new file mode 100644 index 000000000..af1c67e8d Binary files /dev/null and b/docs/image/decision_support_bench1_result_by_query.png differ diff --git a/docs/image/decision_support_bench1_result_in_total.png b/docs/image/decision_support_bench1_result_in_total.png new file mode 100644 index 000000000..9674abc9a Binary files /dev/null and b/docs/image/decision_support_bench1_result_in_total.png differ diff --git a/docs/image/decision_support_bench2_result_by_query.png b/docs/image/decision_support_bench2_result_by_query.png new file mode 100644 index 000000000..4578dd307 Binary files /dev/null and b/docs/image/decision_support_bench2_result_by_query.png differ diff --git a/docs/image/decision_support_bench2_result_in_total.png b/docs/image/decision_support_bench2_result_in_total.png new file mode 100644 index 000000000..88db8f768 Binary files /dev/null and b/docs/image/decision_support_bench2_result_in_total.png differ diff --git a/docs/image/kernel.png b/docs/image/kernel.png new file mode 100644 index 000000000..f88b002aa Binary files /dev/null and b/docs/image/kernel.png differ diff --git a/docs/image/nativesql_arch.png b/docs/image/nativesql_arch.png new file mode 100644 index 000000000..a8304f5af Binary files /dev/null and b/docs/image/nativesql_arch.png differ diff --git a/docs/image/performance.png b/docs/image/performance.png new file mode 100644 index 000000000..a4351cd9a Binary files /dev/null and b/docs/image/performance.png differ diff --git a/docs/image/shuffle.png b/docs/image/shuffle.png new file mode 100644 index 000000000..504234536 Binary files /dev/null and b/docs/image/shuffle.png differ diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 000000000..a0662883f --- /dev/null +++ b/docs/index.md @@ -0,0 +1,118 @@ +# Spark Native SQL Engine + +A Native Engine for Spark SQL with vectorized SIMD optimizations + +## Introduction + +![Overview](./image/nativesql_arch.png) + +Spark SQL works very well with structured row-based data. It used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions, especially under complicated queries. [Apache Arrow](https://arrow.apache.org/) provided CPU-cache friendly columnar in-memory layout, its SIMD optimized kernels and LLVM based SQL engine Gandiva are also very efficient. Native SQL Engine used these technoligies and brought better performance to Spark SQL. + +## Key Features + +### Apache Arrow formatted intermediate data among Spark operator + +![Overview](./image/columnar.png) + +With [Spark 27396](https://issues.apache.org/jira/browse/SPARK-27396) its possible to pass a RDD of Columnarbatch to operators. We implemented this API with Arrow columnar format. + +### Apache Arrow based Native Readers for Parquet and other formats + +![Overview](./image/dataset.png) + +A native parquet reader was developed to speed up the data loading. it's based on Apache Arrow Dataset. For details please check [Arrow Data Source](https://github.com/oap-project/arrow-data-source) + +### Apache Arrow Compute/Gandiva based operators + +![Overview](./image/kernel.png) + +We implemented common operators based on Apache Arrow Compute and Gandiva. The SQL expression was compiled to one expression tree with protobuf and passed to native kernels. The native kernels will then evaluate the these expressions based on the input columnar batch. + +### Native Columnar Shuffle Operator with efficient compression support + +![Overview](./image/shuffle.png) + +We implemented columnar shuffle to improve the shuffle performance. With the columnar layout we could do very efficient data compression for different data format. + +## Build the Plugin + +### Building by Conda + +If you already have a working Hadoop Spark Cluster, we provide a Conda package which will automatically install dependencies needed by OAP, you can refer to [OAP-Installation-Guide](./OAP-Installation-Guide.md) for more information. Once finished [OAP-Installation-Guide](./OAP-Installation-Guide.md), you can find built `spark-columnar-core-1.0.0-jar-with-dependencies.jar` under `$HOME/miniconda2/envs/oapenv/oap_jars`. +Then you can just skip below steps and jump to Getting Started [Get Started](#get-started). + +### Building by yourself + +If you prefer to build from the source code on your hand, please follow below steps to set up your environment. + +### Prerequisite +There are some requirements before you build the project. +Please check the document [Prerequisite](./Prerequisite.md) and make sure you have already installed the software in your system. +If you are running a SPARK Cluster, please make sure all the software are installed in every single node. + +### Installation +Please check the document [Installation Guide](./Installation.md) + +### Configuration & Testing +Please check the document [Configuration Guide](./Configuration.md) + +## Get started +To enable OAP NativeSQL Engine, the previous built jar `spark-columnar-core--jar-with-dependencies.jar` should be added to Spark configuration. We also recommend to use `spark-arrow-datasource-standard--jar-with-dependencies.jar`. We will demonstrate an example by using both jar files. +SPARK related options are: + +* `spark.driver.extraClassPath` : Set to load jar file to driver. +* `spark.executor.extraClassPath` : Set to load jar file to executor. +* `jars` : Set to copy jar file to the executors when using yarn cluster mode. +* `spark.executorEnv.ARROW_LIBHDFS3_DIR` : Optional if you are using a custom libhdfs3.so. +* `spark.executorEnv.LD_LIBRARY_PATH` : Optional if you are using a custom libhdfs3.so. + +For Spark Standalone Mode, please set the above value as relative path to the jar file. +For Spark Yarn Cluster Mode, please set the above value as absolute path to the jar file. + +Example to run Spark Shell with ArrowDataSource jar file +``` +${SPARK_HOME}/bin/spark-shell \ + --verbose \ + --master yarn \ + --driver-memory 10G \ + --conf spark.driver.extraClassPath=$PATH_TO_JAR/spark-arrow-datasource-standard--jar-with-dependencies.jar:$PATH_TO_JAR/spark-columnar-core--jar-with-dependencies.jar \ + --conf spark.executor.extraClassPath=$PATH_TO_JAR/spark-arrow-datasource-standard--jar-with-dependencies.jar:$PATH_TO_JAR/spark-columnar-core--jar-with-dependencies.jar \ + --conf spark.driver.cores=1 \ + --conf spark.executor.instances=12 \ + --conf spark.executor.cores=6 \ + --conf spark.executor.memory=20G \ + --conf spark.memory.offHeap.size=80G \ + --conf spark.task.cpus=1 \ + --conf spark.locality.wait=0s \ + --conf spark.sql.shuffle.partitions=72 \ + --conf spark.executorEnv.ARROW_LIBHDFS3_DIR="$PATH_TO_LIBHDFS3_DIR/" \ + --conf spark.executorEnv.LD_LIBRARY_PATH="$PATH_TO_LIBHDFS3_DEPENDENCIES_DIR" + --jars $PATH_TO_JAR/spark-arrow-datasource-standard--jar-with-dependencies.jar,$PATH_TO_JAR/spark-columnar-core--jar-with-dependencies.jar +``` + +Here is one example to verify if native sql engine works, make sure you have TPC-H dataset. We could do a simple projection on one parquet table. For detailed testing scripts, please refer to [Solution Guide](https://github.com/Intel-bigdata/Solution_navigator/tree/master/nativesql). +``` +val orders = spark.read.format("arrow").load("hdfs:////user/root/date_tpch_10/orders") +orders.createOrReplaceTempView("orders") +spark.sql("select * from orders where o_orderdate > date '1998-07-26'").show(20000, false) +``` + +The result should show up on Spark console and you can check the DAG diagram with some Columnar Processing stage. + + +## Performance data + +For initial microbenchmark performance, we add 10 fields up with spark, data size is 200G data + +![Performance](./image/performance.png) + +## Coding Style + +* For Java code, we used [google-java-format](https://github.com/google/google-java-format) +* For Scala code, we used [Spark Scala Format](https://github.com/apache/spark/blob/master/dev/.scalafmt.conf), please use [scalafmt](https://github.com/scalameta/scalafmt) or run ./scalafmt for scala codes format +* For Cpp codes, we used Clang-Format, check on this link [google-vim-codefmt](https://github.com/google/vim-codefmt) for details. + +## Contact + +chendi.xue@intel.com +binwei.yang@intel.com diff --git a/docs/limitation.md b/docs/limitation.md new file mode 100644 index 000000000..a4b66f5e1 --- /dev/null +++ b/docs/limitation.md @@ -0,0 +1,17 @@ +# Limitations for Native SQL Engine + +## Spark compability +Native SQL engine currenlty works with Spark 3.0.0 only. There are still some trouble with latest Shuffle/AQE API from Spark 3.0.1, 3.0.2 or 3.1.x. + +## Operator limitations +All performance critical operators in TPC-H/TPC-DS should be supported. For those unsupported operators, Native SQL engine will automatically fallback to row operators in vanilla Spark. + +### Columnar Projection with Filter +We used 16 bit selection vector for filter so the max batch size need to be < 65536 + +### Columnar Sort +Columnar Sort does not support spill to disk yet. To reduce the peak memory usage, we used smaller data structure(uin16_t), so this limits +- the max batch size to be < 65536 +- the number of batches in one partiton to be < 65536 + +