[SPARK-3294][SQL] Eliminates boxing costs from in-memory columnar storage #2327

liancheng · 2014-09-09T04:12:54Z

This is a major refactoring of the in-memory columnar storage implementation, aims to eliminate boxing costs from critical paths (building/accessing column buffers) as much as possible. The basic idea is to refactor all major interfaces into a row-based form and use them together with SpecificMutableRow. The difficult part is how to adapt all compression schemes, esp. RunLengthEncoding and DictionaryEncoding, to this design. Since in-memory compression is disabled by default for now, and this PR should be strictly better than before no matter in-memory compression is enabled or not, maybe I'll finish that part in another PR.

UPDATE This PR also took the chance to optimize HiveTableScan by

leveraging SpecificMutableRow to avoid boxing cost, and
building specific Writable unwrapper functions a head of time to avoid per row pattern matching and branching costs.

TODO

Benchmark
~~Eliminate boxing costs in RunLengthEncoding~~ (left to future PRs)
~~Eliminate boxing costs in DictionaryEncoding (seems not easy to do without specializing DictionaryEncoding for every supported column type)~~ (left to future PRs)

Micro benchmark

The benchmark uses a 10 million line CSV table consists of bytes, shorts, integers, longs, floats and doubles, measures the time to build the in-memory version of this table, and the time to scan the whole in-memory table.

Benchmark code can be found here. Script used to generate the input table can be found here.

Speedup:

Hive table scanning + column buffer building: 18.74%

The original benchmark uses 1K as in-memory batch size, when increased to 10K, it can be 28.32% faster.
In-memory table scanning: 7.95%

Before:

	Building	Scanning
1	16472	525
2	16168	530
3	16386	529
4	16184	538
5	16209	521
Average	16283.8	528.6

After:

	Building	Scanning
1	13124	458
2	13260	529
3	12981	463
4	13214	483
5	13583	500
Average	13232.4	486.6

liancheng · 2014-09-09T04:22:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SpecificRow.scala

-  override def update(ordinal: Int, value: Any): Unit = values(ordinal).update(value)
+  override def update(ordinal: Int, value: Any) {
+    if (value == null) setNullAt(ordinal) else values(ordinal).update(value)
+  }


This change is submitted separately in #2325 as this PR may take longer time to finish.

SparkQA · 2014-09-09T04:49:18Z

QA tests have started for PR 2327 at commit 269bd78.

This patch merges cleanly.

SparkQA · 2014-09-09T06:27:53Z

QA tests have finished for PR 2327 at commit 269bd78.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T]
- class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T]
- class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T]
- class Encoder extends compression.Encoder[IntegerType.type]
- class Decoder(buffer: ByteBuffer, columnType: NativeColumnType[IntegerType.type])
- class Encoder extends compression.Encoder[LongType.type]
- class Decoder(buffer: ByteBuffer, columnType: NativeColumnType[LongType.type])

aarondav · 2014-09-09T11:23:29Z

Out of curiosity, does this also eliminate boxing for nested data types?

liancheng · 2014-09-09T18:05:08Z

No, unlike Parquet, currently our in-memory columnar format doesn't support complex nested objects well. They are just serialized by Kryo and stored as opaque byte arrays.

marmbrus · 2014-09-10T01:50:45Z

@aarondav to expand on that, as soon as there is any nesting all of our clever tricks for eliminating allocations go out the window. We can probably improve this in future releases.

…d implementations

…rs a head of time

SparkQA · 2014-09-10T02:51:52Z

QA tests have started for PR 2327 at commit 97bbc4e.

This patch merges cleanly.

SparkQA · 2014-09-10T04:26:11Z

QA tests have finished for PR 2327 at commit 97bbc4e.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2014-09-10T21:21:36Z

ok to test

SparkQA · 2014-09-10T21:53:25Z

QA tests have started for PR 2327 at commit 489f97b.

This patch merges cleanly.

SparkQA · 2014-09-10T23:29:07Z

QA tests have finished for PR 2327 at commit 489f97b.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T]
- class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T]
- class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T]
- class Encoder extends compression.Encoder[IntegerType.type]
- class Decoder(buffer: ByteBuffer, columnType: NativeColumnType[IntegerType.type])
- class Encoder extends compression.Encoder[LongType.type]
- class Decoder(buffer: ByteBuffer, columnType: NativeColumnType[LongType.type])

liancheng · 2014-09-11T00:12:05Z

test this please

SparkQA · 2014-09-11T00:42:12Z

QA tests have started for PR 2327 at commit e5d2cf2.

This patch merges cleanly.

SparkQA · 2014-09-11T00:53:13Z

QA tests have started for PR 2327 at commit e5d2cf2.

This patch merges cleanly.

SparkQA · 2014-09-11T02:33:28Z

QA tests have finished for PR 2327 at commit e5d2cf2.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T]
- class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T]
- class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T]
- class Encoder extends compression.Encoder[IntegerType.type]
- class Decoder(buffer: ByteBuffer, columnType: NativeColumnType[IntegerType.type])
- class Encoder extends compression.Encoder[LongType.type]
- class Decoder(buffer: ByteBuffer, columnType: NativeColumnType[LongType.type])

SparkQA · 2014-09-11T02:53:14Z

Tests timed out after a configured wait of 120m.

liancheng · 2014-09-11T04:51:31Z

@marmbrus Please help review this one. HiveTableScan is also optimized BTW.

SparkQA · 2014-09-11T09:19:27Z

QA tests have started for PR 2327 at commit e5d2cf2.

This patch merges cleanly.

SparkQA · 2014-09-11T11:04:31Z

QA tests have finished for PR 2327 at commit e5d2cf2.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T]
- class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T]
- class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T]
- class Encoder extends compression.Encoder[IntegerType.type]
- class Decoder(buffer: ByteBuffer, columnType: NativeColumnType[IntegerType.type])
- class Encoder extends compression.Encoder[LongType.type]
- class Decoder(buffer: ByteBuffer, columnType: NativeColumnType[LongType.type])

marmbrus · 2014-09-11T18:23:18Z

I need to look this over still, but want to remove WIP?

marmbrus · 2014-09-11T18:24:26Z

sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnAccessor.scala

@@ -51,10 +51,12 @@ private[sql] abstract class BasicColumnAccessor[T <: DataType, JvmType](
  def hasNext = buffer.hasRemaining

  def extractTo(row: MutableRow, ordinal: Int) {


This style is going to go away in 2.12 or 2.13 I think. Should be :Unit =

marmbrus · 2014-09-11T18:37:08Z

Nice speed ups. I think they might be even more pronounced when there are multiple threads fighting for the GC.

Minor comments only. Will merge after they are addressed.

SparkQA · 2014-09-13T20:09:17Z

QA tests have started for PR 2327 at commit 4419fe4.

This patch merges cleanly.

SparkQA · 2014-09-13T21:59:23Z

QA tests have finished for PR 2327 at commit 4419fe4.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T]
- class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T]
- class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T]
- class Encoder extends compression.Encoder[IntegerType.type]
- class Decoder(buffer: ByteBuffer, columnType: NativeColumnType[IntegerType.type])
- class Encoder extends compression.Encoder[LongType.type]
- class Decoder(buffer: ByteBuffer, columnType: NativeColumnType[LongType.type])

marmbrus · 2014-09-13T22:08:43Z

Thanks! I've merged this to master.

liancheng mentioned this pull request Sep 9, 2014

[SPARK-3412] [SQL] Add 3 missing types for Row API #2284

Closed

liancheng reviewed Sep 9, 2014
View reviewed changes

liancheng added 9 commits September 9, 2014 19:12

Made some in-memory columnar storage interfaces row-based

b70d519

Removes boxing cost in IntDelta and LongDelta by providing specialize…

8216936

…d implementations

Makes ColumnAccessor.extractSingle row based

edac3cd

Made compression decoder row based

456c366

Added row based ColumnType.append/extract

9cf30b0

Use SpecificMutableRow in InMemoryColumnarTableScan to avoid boxing

f2a7890

Lowers log level of compression scheme details

5b39cb9

Minor changes to eliminate row object creation

3dc1f94

Optimizes hive.TableReader by by providing specific Writable unwrappe…

97bbc4e

…rs a head of time

liancheng force-pushed the prevent-boxing/unboxing branch from 5cacd9a to 97bbc4e Compare September 10, 2014 02:13

Bug fix: TableReader.fillObject uses wrong ordinals

489f97b

Only checks for partition batch pruning flag once

8b8552b

Bug fix: should call setNullAt when field value is null to avoid NPE

e5d2cf2

marmbrus reviewed Sep 11, 2014
View reviewed changes

liancheng changed the title ~~[SPARK-3294][SQL] WIP: eliminates boxing costs from in-memory columnar storage~~ [SPARK-3294][SQL] Eliminates boxing costs from in-memory columnar storage Sep 11, 2014

Addressing comments

4419fe4

asfgit closed this in 7404924 Sep 13, 2014

liancheng deleted the prevent-boxing/unboxing branch September 24, 2014 00:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-3294][SQL] Eliminates boxing costs from in-memory columnar storage #2327

[SPARK-3294][SQL] Eliminates boxing costs from in-memory columnar storage #2327

liancheng commented Sep 9, 2014

liancheng Sep 9, 2014

SparkQA commented Sep 9, 2014

SparkQA commented Sep 9, 2014

aarondav commented Sep 9, 2014

liancheng commented Sep 9, 2014

marmbrus commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

liancheng commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

liancheng commented Sep 11, 2014

SparkQA commented Sep 11, 2014

SparkQA commented Sep 11, 2014

SparkQA commented Sep 11, 2014

SparkQA commented Sep 11, 2014

liancheng commented Sep 11, 2014

SparkQA commented Sep 11, 2014

SparkQA commented Sep 11, 2014

marmbrus commented Sep 11, 2014

marmbrus Sep 11, 2014

marmbrus commented Sep 11, 2014

SparkQA commented Sep 13, 2014

SparkQA commented Sep 13, 2014

marmbrus commented Sep 13, 2014

		@@ -51,10 +51,12 @@ private[sql] abstract class BasicColumnAccessor[T <: DataType, JvmType](
		def hasNext = buffer.hasRemaining

		def extractTo(row: MutableRow, ordinal: Int) {

[SPARK-3294][SQL] Eliminates boxing costs from in-memory columnar storage #2327

[SPARK-3294][SQL] Eliminates boxing costs from in-memory columnar storage #2327

Conversation

liancheng commented Sep 9, 2014

Micro benchmark

liancheng Sep 9, 2014

Choose a reason for hiding this comment

SparkQA commented Sep 9, 2014

SparkQA commented Sep 9, 2014

aarondav commented Sep 9, 2014

liancheng commented Sep 9, 2014

marmbrus commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

liancheng commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

liancheng commented Sep 11, 2014

SparkQA commented Sep 11, 2014

SparkQA commented Sep 11, 2014

SparkQA commented Sep 11, 2014

SparkQA commented Sep 11, 2014

liancheng commented Sep 11, 2014

SparkQA commented Sep 11, 2014

SparkQA commented Sep 11, 2014

marmbrus commented Sep 11, 2014

marmbrus Sep 11, 2014

Choose a reason for hiding this comment

marmbrus commented Sep 11, 2014

SparkQA commented Sep 13, 2014

SparkQA commented Sep 13, 2014

marmbrus commented Sep 13, 2014