oap-project · zhouyuan · Mar 29, 2021 · Mar 29, 2021 · Mar 29, 2021 · Mar 29, 2021
diff --git a/README.md b/README.md
@@ -24,7 +24,7 @@ With [Spark 27396](https://issues.apache.org/jira/browse/SPARK-27396) its possib
 
 ![Overview](./docs/image/dataset.png)
 
-A native parquet reader was developed to speed up the data loading. it's based on Apache Arrow Dataset. For details please check [Arrow Data Source](https://github.com/oap-project/arrow-data-source)
+A native parquet reader was developed to speed up the data loading. it's based on Apache Arrow Dataset. For details please check [Arrow Data Source](https://github.com/oap-project/native-sql-engine/tree/master/arrow-data-source)
 
 ### Apache Arrow Compute/Gandiva based operators
 
@@ -101,7 +101,7 @@ orders.createOrReplaceTempView("orders")
 spark.sql("select * from orders where o_orderdate > date '1998-07-26'").show(20000, false)
 ```
 
-The result should showup on Spark console and you can check the DAG diagram with some Columnar Processing stage.
+The result should showup on Spark console and you can check the DAG diagram with some Columnar Processing stage. Native SQL engine still lacks some features, please check out the [limitations](./docs/limitations.md).
 
 
 ## Performance data

diff --git a/docs/ApacheArrowInstallation.md b/docs/ApacheArrowInstallation.md
@@ -0,0 +1,70 @@
+# llvm-7.0: 
+Arrow Gandiva depends on LLVM, and I noticed current version strictly depends on llvm7.0 if you installed any other version rather than 7.0, it will fail.
+``` shell
+wget http://releases.llvm.org/7.0.1/llvm-7.0.1.src.tar.xz
+tar xf llvm-7.0.1.src.tar.xz
+cd llvm-7.0.1.src/
+cd tools
+wget http://releases.llvm.org/7.0.1/cfe-7.0.1.src.tar.xz
+tar xf cfe-7.0.1.src.tar.xz
+mv cfe-7.0.1.src clang
+cd ..
+mkdir build
+cd build
+cmake .. -DCMAKE_BUILD_TYPE=Release
+cmake --build . -j
+cmake --build . --target install
+# check if clang has also been compiled, if no
+cd tools/clang
+mkdir build
+cd build
+cmake ..
+make -j
+make install
+```
+
+# cmake: 
+Arrow will download package during compiling, in order to support SSL in cmake, build cmake is optional.
+``` shell
+wget https://github.com/Kitware/CMake/releases/download/v3.15.0-rc4/cmake-3.15.0-rc4.tar.gz
+tar xf cmake-3.15.0-rc4.tar.gz
+cd cmake-3.15.0-rc4/
+./bootstrap --system-curl --parallel=64 #parallel num depends on your server core number
+make -j
+make install
+cmake --version
+cmake version 3.15.0-rc4
+```
+
+# Apache Arrow
+``` shell
+git clone https://github.com/Intel-bigdata/arrow.git
+cd arrow && git checkout branch-0.17.0-oap-1.0
+mkdir -p arrow/cpp/release-build
+cd arrow/cpp/release-build
+cmake -DARROW_DEPENDENCY_SOURCE=BUNDLED -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_HDFS=ON -DARROW_BOOST_USE_SHARED=ON -DARROW_JNI=ON -DARROW_DATASET=ON -DARROW_WITH_PROTOBUF=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_LZ4=ON -DARROW_FILESYSTEM=ON -DARROW_JSON=ON ..
+make -j
+make install
+
+# build java
+cd ../../java
+# change property 'arrow.cpp.build.dir' to the relative path of cpp build dir in gandiva/pom.xml
+mvn clean install -P arrow-jni -am -Darrow.cpp.build.dir=../cpp/release-build/release/ -DskipTests 
+# if you are behine proxy, please also add proxy for socks
+mvn clean install -P arrow-jni -am -Darrow.cpp.build.dir=../cpp/release-build/release/ -DskipTests -DsocksProxyHost=${proxyHost} -DsocksProxyPort=1080 
+```
+
+run test
+``` shell
+mvn test -pl adapter/parquet -P arrow-jni
+mvn test -pl gandiva -P arrow-jni
+```
+
+# Copy binary files to oap-native-sql resources directory
+Because oap-native-sql plugin will build a stand-alone jar file with arrow dependency, if you choose to build Arrow by yourself, you have to copy below files as a replacement from the original one.
+You can find those files in Apache Arrow installation directory or release directory. Below example assume Apache Arrow has been installed on /usr/local/lib64
+``` shell
+cp /usr/local/lib64/libarrow.so.17 $native-sql-engine-dir/cpp/src/resources
+cp /usr/local/lib64/libgandiva.so.17 $native-sql-engine-dir/cpp/src/resources
+cp /usr/local/lib64/libparquet.so.17 $native-sql-engine-dir/cpp/src/resources
+``` 
diff --git a/docs/Configuration.md b/docs/Configuration.md
@@ -0,0 +1,29 @@
+# Spark Configurations for Native SQL Engine
+
+Add below configuration to spark-defaults.conf
+
+```
+##### Columnar Process Configuration
+
+spark.sql.sources.useV1SourceList avro
+spark.sql.join.preferSortMergeJoin false
+spark.sql.extensions com.intel.oap.ColumnarPlugin
+spark.shuffle.manager org.apache.spark.shuffle.sort.ColumnarShuffleManager
+
+# note native sql engine depends on arrow data source
+spark.driver.extraClassPath $HOME/miniconda2/envs/oapenv/oap_jars/spark-columnar-core-<version>-jar-with-dependencies.jar:$HOME/miniconda2/envs/oapenv/oap_jars/spark-arrow-datasource-standard-<version>-jar-with-dependencies.jar
+spark.executor.extraClassPath $HOME/miniconda2/envs/oapenv/oap_jars/spark-columnar-core-<version>-jar-with-dependencies.jar:$HOME/miniconda2/envs/oapenv/oap_jars/spark-arrow-datasource-standard-<version>-jar-with-dependencies.jar
+
+spark.executorEnv.LIBARROW_DIR      $HOME/miniconda2/envs/oapenv
+spark.executorEnv.CC                $HOME/miniconda2/envs/oapenv/bin/gcc
+######
+```
+
+Before you start spark, you must use below command to add some environment variables.
+
+```
+export CC=$HOME/miniconda2/envs/oapenv/bin/gcc
+export LIBARROW_DIR=$HOME/miniconda2/envs/oapenv/
+```
+
+About arrow-data-source.jar, you can refer [Unified Arrow Data Source ](https://oap-project.github.io/arrow-data-source/).
diff --git a/docs/Installation.md b/docs/Installation.md
@@ -0,0 +1,30 @@
+# Spark Native SQL Engine Installation
+
+For detailed testing scripts, please refer to [solution guide](https://github.com/Intel-bigdata/Solution_navigator/tree/master/nativesql)
+
+## Install Googletest and Googlemock
+
+``` shell
+yum install gtest-devel
+yum install gmock
+```
+
+## Build Native SQL Engine
+
+``` shell
+git clone -b ${version} https://github.com/oap-project/native-sql-engine.git
+cd oap-native-sql
+cd cpp/
+mkdir build/
+cd build/
+cmake .. -DTESTS=ON
+make -j
+```
+
+``` shell
+cd ../../core/
+mvn clean package -DskipTests
+```
+
+### Additonal Notes
+[Notes for Installation Issues](./InstallationNotes.md)
diff --git a/docs/InstallationNotes.md b/docs/InstallationNotes.md
@@ -0,0 +1,47 @@
+### Notes for Installation Issues
+* Before the Installation, if you have installed other version of oap-native-sql, remove all installed lib and include from system path: libarrow* libgandiva* libspark-columnar-jni*
+
+* libgandiva_jni.so was not found inside JAR
+
+change property 'arrow.cpp.build.dir' to $ARROW_DIR/cpp/release-build/release/ in gandiva/pom.xml. If you do not want to change the contents of pom.xml, specify it like this:
+
+```
+mvn clean install -P arrow-jni -am -Darrow.cpp.build.dir=/root/git/t/arrow/cpp/release-build/release/ -DskipTests -Dcheckstyle.skip
+```
+
+* No rule to make target '../src/protobuf_ep', needed by `src/proto/Exprs.pb.cc'
+
+remove the existing libprotobuf installation, then the script for find_package() will be able to download protobuf.
+
+* can't find the libprotobuf.so.13 in the shared lib
+
+copy the libprotobuf.so.13 from $OAP_DIR/oap-native-sql/cpp/src/resources to /usr/lib64/
+
+* unable to load libhdfs: libgsasl.so.7: cannot open shared object file
+
+libgsasl is missing, run `yum install libgsasl`
+
+* CentOS 7.7 looks like didn't provide the glibc we required, so binaries packaged on F30 won't work.
+
+```
+20/04/21 17:46:17 WARN TaskSetManager: Lost task 0.1 in stage 1.0 (TID 2, 10.0.0.143, executor 6): java.lang.UnsatisfiedLinkError: /tmp/libgandiva_jni.sobe729912-3bbe-4bd0-bb96-4c7ce2e62336: /lib64/libm.so.6: version `GLIBC_2.29' not found (required by /tmp/libgandiva_jni.sobe729912-3bbe-4bd0-bb96-4c7ce2e62336)
+```
+
+* Missing symbols due to old GCC version.
+
+```
+[root@vsr243 release-build]# nm /usr/local/lib64/libparquet.so | grep ZN5boost16re_detail_10710012perl_matcherIN9__gnu_cxx17__normal_iteratorIPKcSsEESaINS_9sub_matchIS6_EEENS_12regex_traitsIcNS_16cpp_regex_traitsIcEEEEE14construct_initERKNS_11basic_regexIcSD_EENS_15regex_constants12_match_flagsE
+_ZN5boost16re_detail_10710012perl_matcherIN9__gnu_cxx17__normal_iteratorIPKcSsEESaINS_9sub_matchIS6_EEENS_12regex_traitsIcNS_16cpp_regex_traitsIcEEEEE14construct_initERKNS_11basic_regexIcSD_EENS_15regex_constants12_match_flagsE
+```
+
+Need to compile all packags with newer GCC:
+
+```
+[root@vsr243 ~]# export CXX=/usr/local/bin/g++
+[root@vsr243 ~]# export CC=/usr/local/bin/gcc
+```
+
+* Can not connect to hdfs @sr602
+
+vsr606, vsr243 are both not able to connect to hdfs @sr602, need to skipTests to generate the jar
+
diff --git a/docs/OAP-Developer-Guide.md b/docs/OAP-Developer-Guide.md
@@ -0,0 +1,109 @@
+# OAP Developer Guide
+
+This document contains the instructions & scripts on installing necessary dependencies and building OAP. 
+You can get more detailed information from OAP each module below.
+
+* [SQL Index and Data Source Cache](https://github.com/oap-project/sql-ds-cache/blob/master/docs/Developer-Guide.md)
+* [PMem Common](https://github.com/oap-project/pmem-common)
+* [PMem Shuffle](https://github.com/oap-project/pmem-shuffle#5-install-dependencies-for-shuffle-remote-pmem-extension)
+* [Remote Shuffle](https://github.com/oap-project/remote-shuffle)
+* [OAP MLlib](https://github.com/oap-project/oap-mllib)
+* [Arrow Data Source](https://github.com/oap-project/arrow-data-source)
+* [Native SQL Engine](https://github.com/oap-project/native-sql-engine)
+
+## Building OAP
+
+### Prerequisites for Building
+
+OAP is built with [Apache Maven](http://maven.apache.org/) and Oracle Java 8, and mainly required tools to install on your cluster are listed below.
+
+- [Cmake](https://help.directadmin.com/item.php?id=494)
+- [GCC > 7](https://gcc.gnu.org/wiki/InstallingGCC)
+- [Memkind](https://github.com/memkind/memkind/tree/v1.10.1-rc2)
+- [Vmemcache](https://github.com/pmem/vmemcache)
+- [HPNL](https://github.com/Intel-bigdata/HPNL)
+- [PMDK](https://github.com/pmem/pmdk)  
+- [OneAPI](https://software.intel.com/content/www/us/en/develop/tools/oneapi.html)
+- [Arrow](https://github.com/Intel-bigdata/arrow)
+
+- **Requirements for Shuffle Remote PMem Extension**  
+If enable Shuffle Remote PMem extension with RDMA, you can refer to [PMem Shuffle](https://github.com/oap-project/pmem-shuffle) to configure and validate RDMA in advance.
+
+We provide scripts below to help automatically install dependencies above **except RDMA**, need change to **root** account, run:
+
+```
+# git clone -b <tag-version> https://github.com/Intel-bigdata/OAP.git
+# cd OAP
+# sh $OAP_HOME/dev/install-compile-time-dependencies.sh
+```
+
+Run the following command to learn more.
+
+```
+# sh $OAP_HOME/dev/scripts/prepare_oap_env.sh --help
+```
+
+Run the following command to automatically install specific dependency such as Maven.
+
+```
+# sh $OAP_HOME/dev/scripts/prepare_oap_env.sh --prepare_maven
+```
+
+
+### Building
+
+To build OAP package, run command below then you can find a tarball named `oap-$VERSION-bin-spark-$VERSION.tar.gz` under directory `$OAP_HOME/dev/release-package `.
+```
+$ sh $OAP_HOME/dev/compile-oap.sh
+```
+
+Building Specified OAP Module, such as `oap-cache`, run:
+```
+$ sh $OAP_HOME/dev/compile-oap.sh --oap-cache
+```
+
+
+### Running OAP Unit Tests
+
+Setup building environment manually for intel MLlib, and if your default GCC version is before 7.0 also need export `CC` & `CXX` before using `mvn`, run
+
+```
+$ export CXX=$OAP_HOME/dev/thirdparty/gcc7/bin/g++
+$ export CC=$OAP_HOME/dev/thirdparty/gcc7/bin/gcc
+$ export ONEAPI_ROOT=/opt/intel/inteloneapi
+$ source /opt/intel/inteloneapi/daal/2021.1-beta07/env/vars.sh
+$ source /opt/intel/inteloneapi/tbb/2021.1-beta07/env/vars.sh
+$ source /tmp/oneCCL/build/_install/env/setvars.sh
+```
+
+Run all the tests:
+
+```
+$ mvn clean test
+```
+
+Run Specified OAP Module Unit Test, such as `oap-cache`:
+
+```
+$ mvn clean -pl com.intel.oap:oap-cache -am test
+
+```
+
+### Building SQL Index and Data Source Cache with PMem
+
+#### Prerequisites for building with PMem support
+
+When using SQL Index and Data Source Cache with PMem, finish steps of [Prerequisites for building](#prerequisites-for-building) to ensure needed dependencies have been installed.
+
+#### Building package
+
+You can build OAP with PMem support with command below:
+
+```
+$ sh $OAP_HOME/dev/compile-oap.sh
+```
+Or run:
+
+```
+$ mvn clean -q -Ppersistent-memory -Pvmemcache -DskipTests package
+```
diff --git a/docs/OAP-Installation-Guide.md b/docs/OAP-Installation-Guide.md
@@ -0,0 +1,69 @@
+# OAP Installation Guide
+This document introduces how to install OAP and its dependencies on your cluster nodes by ***Conda***. 
+Follow steps below on ***every node*** of your cluster to set right environment for each machine.
+
+## Contents
+  - [Prerequisites](#prerequisites)
+  - [Installing OAP](#installing-oap)
+  - [Configuration](#configuration)
+
+## Prerequisites 
+
+- **OS Requirements**  
+We have tested OAP on Fedora 29 and CentOS 7.6 (kernel-4.18.16). We recommend you use **Fedora 29 CentOS 7.6 or above**. Besides, for [Memkind](https://github.com/memkind/memkind/tree/v1.10.1-rc2) we recommend you use **kernel above 3.10**.
+
+- **Conda Requirements**   
+Install Conda on your cluster nodes with below commands and follow the prompts on the installer screens.:
+```bash
+$ wget -c https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh
+$ chmod +x Miniconda2-latest-Linux-x86_64.sh 
+$ bash Miniconda2-latest-Linux-x86_64.sh 
+```
+For changes to take effect, close and re-open your current shell. To test your installation,  run the command `conda list` in your terminal window. A list of installed packages appears if it has been installed correctly.
+
+## Installing OAP
+
+Dependencies below are required by OAP and all of them are included in OAP Conda package, they will be automatically installed in your cluster when you Conda install OAP. Ensure you have activated environment which you created in the previous steps.
+
+- [Arrow](https://github.com/Intel-bigdata/arrow)
+- [Plasma](http://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/)
+- [Memkind](https://anaconda.org/intel/memkind)
+- [Vmemcache](https://anaconda.org/intel/vmemcache)
+- [HPNL](https://anaconda.org/intel/hpnl)
+- [PMDK](https://github.com/pmem/pmdk)  
+- [OneAPI](https://software.intel.com/content/www/us/en/develop/tools/oneapi.html)
+
+
+Create a conda environment and install OAP Conda package.
+```bash
+$ conda create -n oapenv -y python=3.7
+$ conda activate oapenv
+$ conda install -c conda-forge -c intel -y oap=1.0.0
+```
+
+Once finished steps above, you have completed OAP dependencies installation and OAP building, and will find built OAP jars under `$HOME/miniconda2/envs/oapenv/oap_jars`
+
+#### Extra Steps for Shuffle Remote PMem Extension
+
+If you use one of OAP features -- [PMmem Shuffle](https://github.com/oap-project/pmem-shuffle) with **RDMA**, you need to configure and validate RDMA, please refer to [PMem Shuffle](https://github.com/oap-project/pmem-shuffle#4-configure-and-validate-rdma) for the details.
+
+
+##  Configuration
+
+Once finished steps above, make sure libraries installed by Conda can be linked by Spark, please add the following configuration settings to `$SPARK_HOME/conf/spark-defaults.conf`.
+
+```
+spark.executorEnv.LD_LIBRARY_PATH   $HOME/miniconda2/envs/oapenv/lib
+spark.executor.extraLibraryPath     $HOME/miniconda2/envs/oapenv/lib
+spark.driver.extraLibraryPath       $HOME/miniconda2/envs/oapenv/lib
+spark.executor.extraClassPath       $HOME/miniconda2/envs/oapenv/oap_jars/$OAP_FEATURE.jar
+spark.driver.extraClassPath         $HOME/miniconda2/envs/oapenv/oap_jars/$OAP_FEATURE.jar
+```
+
+And then you can follow the corresponding feature documents for more details to use them.
+
+
+
+
+
+