Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2159: Vectorized BytePacker decoder using Java VectorAPI #1011

Merged
merged 25 commits into from
Mar 4, 2023
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
2d8cd72
java17 vector decode opt.
jiangjiguang Nov 7, 2022
3377674
add java17-target profile, comments, junit
jiangjiguang Jan 23, 2023
c0732b4
fix test case fail
jiangjiguang Jan 28, 2023
baed80b
Update parquet-column/src/main/java/org/apache/parquet/column/values/…
jiangjiguang Jan 28, 2023
8c23d51
Update parquet-column/src/main/java/org/apache/parquet/column/values/…
jiangjiguang Jan 28, 2023
624a5b5
Update parquet-column/src/main/java/org/apache/parquet/column/values/…
jiangjiguang Jan 28, 2023
cc58e60
Update parquet-encoding/src/main/java/org/apache/parquet/column/value…
jiangjiguang Jan 28, 2023
bfd7f84
Update parquet-encoding/src/main/java/org/apache/parquet/column/value…
jiangjiguang Jan 28, 2023
3505596
optimizing maven parameters and naming norm
jiangjiguang Jan 29, 2023
b188ecd
Update parquet-column/src/main/java/org/apache/parquet/column/values/…
jiangjiguang Jan 29, 2023
1486fc6
add enum, optimize test case and variable name
jiangjiguang Jan 29, 2023
11ce933
Java Vector API support doc
jiangjiguang Feb 6, 2023
5c2d5ae
update cooments
jiangjiguang Feb 17, 2023
0f5262a
new parquet-encoding-vector module
jiangjiguang Feb 23, 2023
cfba8d3
Update plugins/parquet-encoding-vector/src/main/java/org/apache/parqu…
jiangjiguang Feb 24, 2023
29d8747
parquet-encoding-vector module optimization
jiangjiguang Feb 26, 2023
5fe806e
optimication pom
jiangjiguang Feb 26, 2023
1aeaf38
add profile plugins
jiangjiguang Feb 27, 2023
e7c318a
rename profile vector-plugins
jiangjiguang Feb 28, 2023
45efb3c
this feature is experimental
jiangjiguang Feb 28, 2023
4fc4c7f
add vector-plugins workflows
jiangjiguang Mar 1, 2023
8653923
vector-plugins workflows update
jiangjiguang Mar 1, 2023
2df278b
vector-plugins specify modules
jiangjiguang Mar 2, 2023
ef5d356
Assume the vector junit
jiangjiguang Mar 2, 2023
af01ac0
fix bug
jiangjiguang Mar 2, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,20 @@ Parquet is a very active project, and new features are being added quickly. Here
* Column stats
* Delta encoding
* Index pages
* Java Vector API support

## Java Vector API support
jiangjiguang marked this conversation as resolved.
Show resolved Hide resolved
Parquet-MR has supported Java Vector API to speed up reading, to enable this feature:
jiangjiguang marked this conversation as resolved.
Show resolved Hide resolved
* Java 17+, 64-bit
* Requiring the CPU to support instruction sets:
* avx512vbmi
* avx512_vbmi2
* To build the jars: `mvn clean package -P vector-plugins`
* For Apache Spark to enable this feature:
* Build parquet and replace the parquet-encoding-{VERSION}.jar on the spark jars folder
* Build parquet-encoding-vector and copy parquet-encoding-vector-{VERSION}.jar to the spark jars folder
* Edit spark class#VectorizedRleValuesReader, function#readNextGroup refer to parquet class#ParquetReadRouter, function#readBatchUsing512Vector
* Build spark with maven and replace spark-sql_2.12-{VERSION}.jar on the spark jars folder

## Map/Reduce integration

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,13 @@ public abstract class BytePacker {

private final int bitWidth;

/**
* Number of integer values to be unpacked at a time.
* unpackCount is a multiple of 8
* For AVX512, the register is 512 bits, so output values at a time maybe different when different bitWidth.
*/
protected int unpackCount;

BytePacker(int bitWidth) {
this.bitWidth = bitWidth;
}
Expand All @@ -42,6 +49,10 @@ public final int getBitWidth() {
return bitWidth;
}

public int getUnpackCount() {
throw new RuntimeException("getUnpackCount must be implemented by subclass!");
}

/**
* pack 8 values from input at inPos into bitWidth bytes in output at outPos.
* nextPosition: inPos += 8; outPos += getBitWidth()
Expand Down Expand Up @@ -105,4 +116,26 @@ public void unpack8Values(final byte[] input, final int inPos, final int[] outpu
public void unpack32Values(byte[] input, int inPos, int[] output, int outPos) {
unpack32Values(ByteBuffer.wrap(input), inPos, output, outPos);
}

/**
* unpack bitWidth bytes from input at inPos into {unpackCount} values in output at outPos using Java Vector API.
* @param input the input bytes
* @param inPos where to read from in input
* @param output the output values
* @param outPos where to write to in output
*/
public void unpackValuesUsingVector(final byte[] input, final int inPos, final int[] output, final int outPos) {
throw new RuntimeException("unpackValuesUsingVector must be implemented by subclass!");
}

/**
* unpack bitWidth bytes from input at inPos into {unpackCount} values in output at outPos using Java Vector API.
* @param input the input bytes
* @param inPos where to read from in input
* @param output the output values
* @param outPos where to write to in output
*/
public void unpackValuesUsingVector(final ByteBuffer input, final int inPos, final int[] output, final int outPos) {
throw new RuntimeException("unpackValuesUsingVector must be implemented by subclass!");
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,12 @@ public IntPacker newIntPacker(int width) {
public BytePacker newBytePacker(int width) {
return beBytePackerFactory.newBytePacker(width);
}

@Override
public BytePacker newBytePackerVector(int width) {
throw new RuntimeException("Not currently supported!");
}

@Override
public BytePackerForLong newBytePackerForLong(int width) {
return beBytePackerForLongFactory.newBytePackerForLong(width);
Expand All @@ -55,6 +61,22 @@ public IntPacker newIntPacker(int width) {
public BytePacker newBytePacker(int width) {
return leBytePackerFactory.newBytePacker(width);
}

@Override
public BytePacker newBytePackerVector(int width) {
if (leBytePacker512VectorFactory == null) {
synchronized (Packer.class) {
if (leBytePacker512VectorFactory == null) {
leBytePacker512VectorFactory = getBytePackerFactory("ByteBitPacking512VectorLE");
}
}
}
if (leBytePacker512VectorFactory == null) {
throw new RuntimeException("No enable java vector plugin on little endian architectures");
}
return leBytePacker512VectorFactory.newBytePacker(width);
}

@Override
public BytePackerForLong newBytePackerForLong(int width) {
return leBytePackerForLongFactory.newBytePackerForLong(width);
Expand Down Expand Up @@ -86,6 +108,8 @@ private static Object getStaticField(String className, String fieldName) {
static IntPackerFactory leIntPackerFactory = getIntPackerFactory("LemireBitPackingLE");
static BytePackerFactory beBytePackerFactory = getBytePackerFactory("ByteBitPackingBE");
static BytePackerFactory leBytePackerFactory = getBytePackerFactory("ByteBitPackingLE");
// ByteBitPacking512VectorLE is not enabled default, so leBytePacker512VectorFactory cannot be initialized as a static property
static BytePackerFactory leBytePacker512VectorFactory = null;
jiangjiguang marked this conversation as resolved.
Show resolved Hide resolved
static BytePackerForLongFactory beBytePackerForLongFactory = getBytePackerForLongFactory("ByteBitPackingForLongBE");
static BytePackerForLongFactory leBytePackerForLongFactory = getBytePackerForLongFactory("ByteBitPackingForLongLE");

Expand All @@ -101,6 +125,10 @@ private static Object getStaticField(String className, String fieldName) {
*/
public abstract BytePacker newBytePacker(int width);

public BytePacker newBytePackerVector(int width) {
throw new RuntimeException("newBytePackerVector must be implemented by subclasses!");
}

/**
* @param width the width in bits of the packed values
* @return a byte based packer for INT64
Expand Down
127 changes: 127 additions & 0 deletions parquet-plugins/parquet-encoding-vector/pom.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet</artifactId>
<version>1.13.0-SNAPSHOT</version>
<relativePath>../../pom.xml</relativePath>
</parent>

<modelVersion>4.0.0</modelVersion>

<artifactId>parquet-encoding-vector</artifactId>
<packaging>jar</packaging>

<name>Apache Parquet Encodings Vector</name>
<url>https://parquet.apache.org</url>

<properties>
<extraJavaVectorArgs>
--add-modules=jdk.incubator.vector
</extraJavaVectorArgs>
</properties>

<dependencies>
<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-common</artifactId>
<version>${project.version}</version>
</dependency>

<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-encoding</artifactId>
<version>${project.version}</version>
</dependency>

<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>${slf4j.version}</version>
</dependency>

<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-simple</artifactId>
<version>${slf4j.version}</version>
<scope>test</scope>
</dependency>
</dependencies>

<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<release>17</release>
<compilerArgs combine.children="append">
<compilerArg>${extraJavaVectorArgs}</compilerArg>
</compilerArgs>
</configuration>
</plugin>

<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-failsafe-plugin</artifactId>
<configuration>
<argLine>${extraJavaVectorArgs}</argLine>
</configuration>
</plugin>

<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<configuration>
<argLine>${extraJavaVectorArgs}</argLine>
</configuration>
</plugin>

<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.4.1</version>
</plugin>

<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-dependency-plugin</artifactId>
<version>3.3.0</version>
<executions>
<execution>
<goals>
<goal>analyze-only</goal>
</goals>
<configuration>
<failOnWarning>true</failOnWarning>
<ignoreNonCompile>true</ignoreNonCompile>
<ignoredNonTestScopedDependencies>
<ignoredNonTestScopedDependency>*</ignoredNonTestScopedDependency>
</ignoredNonTestScopedDependencies>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>

</project>
Loading