-
Notifications
You must be signed in to change notification settings - Fork 980
Revisit Vector Accessors
Drill has a complex stack of code used to write to, and read from, value vectors:
- Application code, which keeps track of the current row position, the number of rows (when writing), and does type-specific access using the
Accessor
orMutator
class for each vector{-|Nullable|Repeated}{type}Vector
. Or, uses an accessor to do writing (Parquet,ScanBatch
.) - Accessors (generated in the vector.complex package) that write to, or read from, vectors, also in a type-specific way:
{-|Nullable|Repeated}{type}Writer
and{-|Nullable|Repeated}{type}Reader
. -
Accessor
andMutator
classes on each value vector that provide type-specific access to data. - Various flavors of get/set methods on DrillBuf, often for primitive or
byte[]
arguments. - Repeated versions of (mostly) the same methods on the
UnsignedDirectLittleEndian
(UDLE) class. - Repeated versions of (mostly) the same methods on Netty's pooled or unpooled byte buffers.
- Netty's
PlatformIndependent
methods that repeat the primitive get/set methods to work with memory addresses. - Java Unsafe class that implements the above methods.
Each layer has a set (not always the same) of get/set methods.
The basics of performance engineering is to optimize inner loops. In Drill, the inner loop often includes reading from and/or writing a specific column value. Each access must descend though the six or seven layers identified above.
Further, each layer has roughly the same (but slightly different) versions of the same get/set methods. The result is a very large footprint of code to be kept consistent. For each vector, for each primitive data type, multiple copies of get/set methods are needed.
The goal of this exercise is to determine if we can optimize the accessor stack to improve performance and reduce code complexity. Indeed, other pages on this site identified that we can double performance by eliminating most of the access layers.
The above stack is redesigned to separate concerns:
- Netty
ByteBuf
(DrillDrillBuf
) hold data. - A set of encoders map primitive types to/from Drill's little endian (LE) storage format, given only an address, offset and data value.
- Accessors (readers, writers) provide a type-independent API to read and write vectors. Generated implementations map the generic API to specific vector types.
- Application code works with the readers and writers.
The call stack to access a value now is:
- Application
- Accessor
- Encoder
PlatformIndependent
In practice, even the encoder can probably be skipped for most simple types; perhaps it is needed only to encode decimal and period types.
The (column) accessor is defined as an interface which is all the application needs. See the recently-added ColumnReader
and ColumnWriter
classes for a prototype. Since the interface itself is vector-type neutral, the hierarchy can be used to represent implementations rather than data types. for example:
- ColumnWriter
- VectorColumnWriter -- "classic" writers
- IntVectorColumnWriter
- NullableIntVectorColumnWriter
- DirectColumnWriter -- "streamlined" writers for direct memory
- DirectIntColumnWriter
- DirectNullableIntColumnWriter
- HeapColumnWriter -- hypothetical implementation using heap buffers
- HeapIntColunWriter
- HeapNullableIntColumnWriter
- VectorColumnWriter -- "classic" writers
Here, we have three different storage formats (classic vectors, direct memory and buffer memory), each with implementations for each vector type. As with the original ByteBuf(fer)
concepts, a single interface works for all implementations. (Methods that don't apply to a given vector type just throw exceptions.)