Skip to content
This repository has been archived by the owner on Sep 18, 2023. It is now read-only.

Commit

Permalink
[NSE-186] backport to 1.1 (#210)
Browse files Browse the repository at this point in the history
* [NSE-130] support decimal round and abs (#166)

* support decimal round and abs

* remove duplicate cast in multiply/divide

* [NSE-161] adding format check (#165)

* adding format check

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* formating code

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* adding google format

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* reformat with new style

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* lower clang version to 10

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* adding script to format code

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [NSE-170]improve sort shuffle code (#171)

* improve sort shuffle code

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* fix format

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* pass by ref in builder

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* fix string/decimal builder

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [NSE-62]Fixing issue0062 for package arrow dependencies in jar with refresh2 (#172)

* Add arrow build and dependency support

* Add compress.sh default value

* Fix bug for parameter's default value

* Add CACHE PATH

* fix copy bitmap in InplaceSort (#174)

* [NSE-153] fix window results (#175)

* fix window sort memory

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* remove unused code

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* fix windown w/o avg

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* fix format

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* fix decimal sort

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* Fix issue 179 for arrow include directory (#181)

* Fix issue0191 for .so file copy to tmp. (#192)

* Following NSE-153, optimize fallback conditions for columnar window (#189)

* Fix q14a/b segfault (#193)

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [NSE-194]Turn on several Arrow parameters (#195)

* Turn on several Arrow parameters

* Change SIMD Level Setting

* Hashmap build opt for semi/anti/exists join (#197)

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [NSE-198] support the month() and dayofmonth() functions with DateType (#199)

* [NSE-206] fix doc link, update limitations (#205)

* fix doc link

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* update arrow data source

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* adding limitations

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* mention limits

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [NSE-170] using unsafe appender (#203)

This patch adds an unsafe appender which will reserve space before builder array.
The boolean builder is not touched as only malloc small sized memory
The string builder are not touched as it's diffcult to pre-allocate the space

* using unsafe appender

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* fix format

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* update arrow branch

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

Co-authored-by: Rui Mo <rui.mo@intel.com>
Co-authored-by: Wei-Ting Chen <weiting.chen@intel.com>
Co-authored-by: Hongze Zhang <hongze.zhang@intel.com>
Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
Co-authored-by: JiaKe <ke.a.jia@intel.com>
  • Loading branch information
6 people authored Mar 29, 2021
1 parent 5e89cd3 commit 692d574
Show file tree
Hide file tree
Showing 114 changed files with 12,569 additions and 12,418 deletions.
20 changes: 20 additions & 0 deletions .clang-format
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
---
BasedOnStyle: Google
DerivePointerAlignment: false
ColumnLimit: 90
2 changes: 1 addition & 1 deletion .github/workflows/report_ram_log.yml
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ jobs:
- name: Install OAP optimized Arrow
run: |
cd /tmp
git clone -b arrow-3.0.0-oap https://github.com/oap-project/arrow.git
git clone -b arrow-3.0.0-oap-1.1 https://github.com/oap-project/arrow.git
cd arrow/java
mvn clean install -B -P arrow-jni -am -Dorg.slf4j.simpleLogger.log.org.apache.maven.cli.transfer.Slf4jMavenTransferListener=warn -Darrow.cpp.build.dir=/tmp/arrow/cpp/build/release/ -DskipTests -Dcheckstyle.skip
- name: Run Maven tests
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/tpch.yml
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ jobs:
run: |
cd /tmp
git clone https://github.com/oap-project/arrow.git
cd arrow && git checkout arrow-3.0.0-oap && cd cpp
cd arrow && git checkout arrow-3.0.0-oap-1.1 && cd cpp
mkdir build && cd build
cmake .. -DARROW_JNI=ON -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_HDFS=ON -DARROW_FILESYSTEM=ON -DARROW_WITH_SNAPPY=ON -DARROW_JSON=ON -DARROW_DATASET=ON -DARROW_WITH_LZ4=ON -DARROW_JEMALLOC=OFF && make -j2
sudo make install
Expand Down
13 changes: 12 additions & 1 deletion .github/workflows/unittests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ jobs:
run: |
cd /tmp
git clone https://github.com/oap-project/arrow.git
cd arrow && git checkout arrow-3.0.0-oap && cd cpp
cd arrow && git checkout arrow-3.0.0-oap-1.1 && cd cpp
mkdir build && cd build
cmake .. -DARROW_JNI=ON -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_HDFS=ON -DARROW_FILESYSTEM=ON -DARROW_WITH_SNAPPY=ON -DARROW_JSON=ON -DARROW_DATASET=ON -DARROW_WITH_LZ4=ON -DGTEST_ROOT=/usr/src/gtest && make -j2
sudo make install
Expand All @@ -59,3 +59,14 @@ jobs:
cd src
ctest -R
formatting-check:
name: Formatting Check
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Run clang-format style check for C/C++ programs.
uses: jidicula/clang-format-action@v3.2.0
with:
clang-format-version: '10'
check-path: 'native-sql-engine/cpp/src'
fallback-style: 'Google' # optional
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ With [Spark 27396](https://issues.apache.org/jira/browse/SPARK-27396) its possib

![Overview](./docs/image/dataset.png)

A native parquet reader was developed to speed up the data loading. it's based on Apache Arrow Dataset. For details please check [Arrow Data Source](https://github.com/oap-project/arrow-data-source)
A native parquet reader was developed to speed up the data loading. it's based on Apache Arrow Dataset. For details please check [Arrow Data Source](https://github.com/oap-project/native-sql-engine/tree/master/arrow-data-source)

### Apache Arrow Compute/Gandiva based operators

Expand Down Expand Up @@ -101,7 +101,7 @@ orders.createOrReplaceTempView("orders")
spark.sql("select * from orders where o_orderdate > date '1998-07-26'").show(20000, false)
```

The result should showup on Spark console and you can check the DAG diagram with some Columnar Processing stage.
The result should showup on Spark console and you can check the DAG diagram with some Columnar Processing stage. Native SQL engine still lacks some features, please check out the [limitations](./docs/limitations.md).


## Performance data
Expand Down
19 changes: 8 additions & 11 deletions arrow-data-source/docs/Installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,19 +11,16 @@ yum install gmock

## Build Native SQL Engine

``` shell
git clone -b ${version} https://github.com/oap-project/native-sql-engine.git
cd oap-native-sql
cd cpp/
mkdir build/
cd build/
cmake .. -DTESTS=ON
make -j
```
cmake parameters:
BUILD_ARROW(Default is On): Build Arrow from Source
STATIC_ARROW(Default is Off): When BUILD_ARROW is ON, you can choose to build static or shared Arrow library, please notice current only support to build SHARED ARROW.
ARROW_ROOT(Default is /usr/local): When BUILD_ARROW is OFF, you can set the ARROW library path to link the existing library in your environment.
BUILD_PROTOBUF(Default is On): Build Protobuf from Source

``` shell
cd ../../core/
mvn clean package -DskipTests
git clone -b ${version} https://github.com/oap-project/native-sql-engine.git
cd native-sql-engine
mvn clean package -am -DskipTests -Dcpp_tests=OFF -Dbuild_arrow=ON -Dstatic_arrow=OFF -Darrow_root=/usr/local -Dbuild_protobuf=ON
```

### Additonal Notes
Expand Down
70 changes: 70 additions & 0 deletions docs/ApacheArrowInstallation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# llvm-7.0:
Arrow Gandiva depends on LLVM, and I noticed current version strictly depends on llvm7.0 if you installed any other version rather than 7.0, it will fail.
``` shell
wget http://releases.llvm.org/7.0.1/llvm-7.0.1.src.tar.xz
tar xf llvm-7.0.1.src.tar.xz
cd llvm-7.0.1.src/
cd tools
wget http://releases.llvm.org/7.0.1/cfe-7.0.1.src.tar.xz
tar xf cfe-7.0.1.src.tar.xz
mv cfe-7.0.1.src clang
cd ..
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . -j
cmake --build . --target install
# check if clang has also been compiled, if no
cd tools/clang
mkdir build
cd build
cmake ..
make -j
make install
```

# cmake:
Arrow will download package during compiling, in order to support SSL in cmake, build cmake is optional.
``` shell
wget https://github.com/Kitware/CMake/releases/download/v3.15.0-rc4/cmake-3.15.0-rc4.tar.gz
tar xf cmake-3.15.0-rc4.tar.gz
cd cmake-3.15.0-rc4/
./bootstrap --system-curl --parallel=64 #parallel num depends on your server core number
make -j
make install
cmake --version
cmake version 3.15.0-rc4
```

# Apache Arrow
``` shell
git clone https://github.com/Intel-bigdata/arrow.git
cd arrow && git checkout branch-0.17.0-oap-1.0
mkdir -p arrow/cpp/release-build
cd arrow/cpp/release-build
cmake -DARROW_DEPENDENCY_SOURCE=BUNDLED -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_HDFS=ON -DARROW_BOOST_USE_SHARED=ON -DARROW_JNI=ON -DARROW_DATASET=ON -DARROW_WITH_PROTOBUF=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_LZ4=ON -DARROW_FILESYSTEM=ON -DARROW_JSON=ON ..
make -j
make install

# build java
cd ../../java
# change property 'arrow.cpp.build.dir' to the relative path of cpp build dir in gandiva/pom.xml
mvn clean install -P arrow-jni -am -Darrow.cpp.build.dir=../cpp/release-build/release/ -DskipTests
# if you are behine proxy, please also add proxy for socks
mvn clean install -P arrow-jni -am -Darrow.cpp.build.dir=../cpp/release-build/release/ -DskipTests -DsocksProxyHost=${proxyHost} -DsocksProxyPort=1080
```

run test
``` shell
mvn test -pl adapter/parquet -P arrow-jni
mvn test -pl gandiva -P arrow-jni
```

# Copy binary files to oap-native-sql resources directory
Because oap-native-sql plugin will build a stand-alone jar file with arrow dependency, if you choose to build Arrow by yourself, you have to copy below files as a replacement from the original one.
You can find those files in Apache Arrow installation directory or release directory. Below example assume Apache Arrow has been installed on /usr/local/lib64
``` shell
cp /usr/local/lib64/libarrow.so.17 $native-sql-engine-dir/cpp/src/resources
cp /usr/local/lib64/libgandiva.so.17 $native-sql-engine-dir/cpp/src/resources
cp /usr/local/lib64/libparquet.so.17 $native-sql-engine-dir/cpp/src/resources
```
29 changes: 29 additions & 0 deletions docs/Configuration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Spark Configurations for Native SQL Engine

Add below configuration to spark-defaults.conf

```
##### Columnar Process Configuration
spark.sql.sources.useV1SourceList avro
spark.sql.join.preferSortMergeJoin false
spark.sql.extensions com.intel.oap.ColumnarPlugin
spark.shuffle.manager org.apache.spark.shuffle.sort.ColumnarShuffleManager
# note native sql engine depends on arrow data source
spark.driver.extraClassPath $HOME/miniconda2/envs/oapenv/oap_jars/spark-columnar-core-<version>-jar-with-dependencies.jar:$HOME/miniconda2/envs/oapenv/oap_jars/spark-arrow-datasource-standard-<version>-jar-with-dependencies.jar
spark.executor.extraClassPath $HOME/miniconda2/envs/oapenv/oap_jars/spark-columnar-core-<version>-jar-with-dependencies.jar:$HOME/miniconda2/envs/oapenv/oap_jars/spark-arrow-datasource-standard-<version>-jar-with-dependencies.jar
spark.executorEnv.LIBARROW_DIR $HOME/miniconda2/envs/oapenv
spark.executorEnv.CC $HOME/miniconda2/envs/oapenv/bin/gcc
######
```

Before you start spark, you must use below command to add some environment variables.

```
export CC=$HOME/miniconda2/envs/oapenv/bin/gcc
export LIBARROW_DIR=$HOME/miniconda2/envs/oapenv/
```

About arrow-data-source.jar, you can refer [Unified Arrow Data Source ](https://oap-project.github.io/arrow-data-source/).
30 changes: 30 additions & 0 deletions docs/Installation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Spark Native SQL Engine Installation

For detailed testing scripts, please refer to [solution guide](https://github.com/Intel-bigdata/Solution_navigator/tree/master/nativesql)

## Install Googletest and Googlemock

``` shell
yum install gtest-devel
yum install gmock
```

## Build Native SQL Engine

``` shell
git clone -b ${version} https://github.com/oap-project/native-sql-engine.git
cd oap-native-sql
cd cpp/
mkdir build/
cd build/
cmake .. -DTESTS=ON
make -j
```

``` shell
cd ../../core/
mvn clean package -DskipTests
```

### Additonal Notes
[Notes for Installation Issues](./InstallationNotes.md)
47 changes: 47 additions & 0 deletions docs/InstallationNotes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
### Notes for Installation Issues
* Before the Installation, if you have installed other version of oap-native-sql, remove all installed lib and include from system path: libarrow* libgandiva* libspark-columnar-jni*

* libgandiva_jni.so was not found inside JAR

change property 'arrow.cpp.build.dir' to $ARROW_DIR/cpp/release-build/release/ in gandiva/pom.xml. If you do not want to change the contents of pom.xml, specify it like this:

```
mvn clean install -P arrow-jni -am -Darrow.cpp.build.dir=/root/git/t/arrow/cpp/release-build/release/ -DskipTests -Dcheckstyle.skip
```

* No rule to make target '../src/protobuf_ep', needed by `src/proto/Exprs.pb.cc'

remove the existing libprotobuf installation, then the script for find_package() will be able to download protobuf.

* can't find the libprotobuf.so.13 in the shared lib

copy the libprotobuf.so.13 from $OAP_DIR/oap-native-sql/cpp/src/resources to /usr/lib64/

* unable to load libhdfs: libgsasl.so.7: cannot open shared object file

libgsasl is missing, run `yum install libgsasl`

* CentOS 7.7 looks like didn't provide the glibc we required, so binaries packaged on F30 won't work.

```
20/04/21 17:46:17 WARN TaskSetManager: Lost task 0.1 in stage 1.0 (TID 2, 10.0.0.143, executor 6): java.lang.UnsatisfiedLinkError: /tmp/libgandiva_jni.sobe729912-3bbe-4bd0-bb96-4c7ce2e62336: /lib64/libm.so.6: version `GLIBC_2.29' not found (required by /tmp/libgandiva_jni.sobe729912-3bbe-4bd0-bb96-4c7ce2e62336)
```

* Missing symbols due to old GCC version.

```
[root@vsr243 release-build]# nm /usr/local/lib64/libparquet.so | grep ZN5boost16re_detail_10710012perl_matcherIN9__gnu_cxx17__normal_iteratorIPKcSsEESaINS_9sub_matchIS6_EEENS_12regex_traitsIcNS_16cpp_regex_traitsIcEEEEE14construct_initERKNS_11basic_regexIcSD_EENS_15regex_constants12_match_flagsE
_ZN5boost16re_detail_10710012perl_matcherIN9__gnu_cxx17__normal_iteratorIPKcSsEESaINS_9sub_matchIS6_EEENS_12regex_traitsIcNS_16cpp_regex_traitsIcEEEEE14construct_initERKNS_11basic_regexIcSD_EENS_15regex_constants12_match_flagsE
```

Need to compile all packags with newer GCC:

```
[root@vsr243 ~]# export CXX=/usr/local/bin/g++
[root@vsr243 ~]# export CC=/usr/local/bin/gcc
```

* Can not connect to hdfs @sr602

vsr606, vsr243 are both not able to connect to hdfs @sr602, need to skipTests to generate the jar

109 changes: 109 additions & 0 deletions docs/OAP-Developer-Guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# OAP Developer Guide

This document contains the instructions & scripts on installing necessary dependencies and building OAP.
You can get more detailed information from OAP each module below.

* [SQL Index and Data Source Cache](https://github.com/oap-project/sql-ds-cache/blob/master/docs/Developer-Guide.md)
* [PMem Common](https://github.com/oap-project/pmem-common)
* [PMem Shuffle](https://github.com/oap-project/pmem-shuffle#5-install-dependencies-for-shuffle-remote-pmem-extension)
* [Remote Shuffle](https://github.com/oap-project/remote-shuffle)
* [OAP MLlib](https://github.com/oap-project/oap-mllib)
* [Arrow Data Source](https://github.com/oap-project/arrow-data-source)
* [Native SQL Engine](https://github.com/oap-project/native-sql-engine)

## Building OAP

### Prerequisites for Building

OAP is built with [Apache Maven](http://maven.apache.org/) and Oracle Java 8, and mainly required tools to install on your cluster are listed below.

- [Cmake](https://help.directadmin.com/item.php?id=494)
- [GCC > 7](https://gcc.gnu.org/wiki/InstallingGCC)
- [Memkind](https://github.com/memkind/memkind/tree/v1.10.1-rc2)
- [Vmemcache](https://github.com/pmem/vmemcache)
- [HPNL](https://github.com/Intel-bigdata/HPNL)
- [PMDK](https://github.com/pmem/pmdk)
- [OneAPI](https://software.intel.com/content/www/us/en/develop/tools/oneapi.html)
- [Arrow](https://github.com/Intel-bigdata/arrow)

- **Requirements for Shuffle Remote PMem Extension**
If enable Shuffle Remote PMem extension with RDMA, you can refer to [PMem Shuffle](https://github.com/oap-project/pmem-shuffle) to configure and validate RDMA in advance.

We provide scripts below to help automatically install dependencies above **except RDMA**, need change to **root** account, run:

```
# git clone -b <tag-version> https://github.com/Intel-bigdata/OAP.git
# cd OAP
# sh $OAP_HOME/dev/install-compile-time-dependencies.sh
```

Run the following command to learn more.

```
# sh $OAP_HOME/dev/scripts/prepare_oap_env.sh --help
```

Run the following command to automatically install specific dependency such as Maven.

```
# sh $OAP_HOME/dev/scripts/prepare_oap_env.sh --prepare_maven
```


### Building

To build OAP package, run command below then you can find a tarball named `oap-$VERSION-bin-spark-$VERSION.tar.gz` under directory `$OAP_HOME/dev/release-package `.
```
$ sh $OAP_HOME/dev/compile-oap.sh
```

Building Specified OAP Module, such as `oap-cache`, run:
```
$ sh $OAP_HOME/dev/compile-oap.sh --oap-cache
```


### Running OAP Unit Tests

Setup building environment manually for intel MLlib, and if your default GCC version is before 7.0 also need export `CC` & `CXX` before using `mvn`, run

```
$ export CXX=$OAP_HOME/dev/thirdparty/gcc7/bin/g++
$ export CC=$OAP_HOME/dev/thirdparty/gcc7/bin/gcc
$ export ONEAPI_ROOT=/opt/intel/inteloneapi
$ source /opt/intel/inteloneapi/daal/2021.1-beta07/env/vars.sh
$ source /opt/intel/inteloneapi/tbb/2021.1-beta07/env/vars.sh
$ source /tmp/oneCCL/build/_install/env/setvars.sh
```

Run all the tests:

```
$ mvn clean test
```

Run Specified OAP Module Unit Test, such as `oap-cache`:

```
$ mvn clean -pl com.intel.oap:oap-cache -am test
```

### Building SQL Index and Data Source Cache with PMem

#### Prerequisites for building with PMem support

When using SQL Index and Data Source Cache with PMem, finish steps of [Prerequisites for building](#prerequisites-for-building) to ensure needed dependencies have been installed.

#### Building package

You can build OAP with PMem support with command below:

```
$ sh $OAP_HOME/dev/compile-oap.sh
```
Or run:

```
$ mvn clean -q -Ppersistent-memory -Pvmemcache -DskipTests package
```
Loading

0 comments on commit 692d574

Please sign in to comment.