[SPARK-17187][SQL] Supports using arbitrary Java object as internal aggregation buffer object #14753

clockfly · 2016-08-22T15:40:46Z

What changes were proposed in this pull request?

This PR introduces an abstract class TypedImperativeAggregate so that an aggregation function of TypedImperativeAggregate can use arbitrary user-defined Java object as intermediate aggregation buffer object.

This has advantages like:

It now can support larger category of aggregation functions. For example, it will be much easier to implement aggregation function percentile_approx, which has a complex aggregation buffer definition.
It can be used to avoid doing serialization/de-serialization for every call of update or merge when converting domain specific aggregation object to internal Spark-Sql storage format.
It is easier to integrate with other existing monoid libraries like algebird, and supports more aggregation functions with high performance.

Please see org.apache.spark.sql.TypedImperativeAggregateSuite.TypedMaxAggregate to find an example of how to defined a TypedImperativeAggregate aggregation function.
Please see Java doc of TypedImperativeAggregate and Jira ticket SPARK-17187 for more information.

How was this patch tested?

Unit tests.

SparkQA · 2016-08-22T17:48:18Z

Test build #64211 has finished for PR 14753 at commit 6efddad.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class TypedImperativeAggregate[T >: Null] extends ImperativeAggregate

SparkQA · 2016-08-22T17:56:19Z

Test build #64213 has finished for PR 14753 at commit 10861b2.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class TypedImperativeAggregate[T >: Null] extends ImperativeAggregate

yhuai · 2016-08-22T23:13:20Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala

+ *     calls method `eval(buffer: T)` to generate the final output for this group.
+ *  5. The framework moves on to next group, until all groups have been processed.
+ */
+abstract class TypedImperativeAggregate[T >: Null] extends ImperativeAggregate {


Does it work in java?

I believe so, but I will do a double check

cloud-fan · 2016-08-23T02:05:32Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala

+   * @param buffer The aggregation buffer object.
+   * @param input an input row
+   */
+  def update(buffer: T, input: InternalRow): Unit


This assumes the buffer object type T can do in-place update, which is not always true, e.g. percentile_approx, how about def update(buffer: T, input: InternalRow): T?

@cloud-fan User can define a wrapper to do inplace update.

Seems update needs to evaluate the input. We need to document it.

clockfly · 2016-08-23T03:28:26Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala

+  def aggregationBufferClass: Class[T]
+
+  /** Serializes the aggregation buffer object T to Array[Byte] */
+  def serialize(buffer: T): Array[Byte]


Here we limit the serializable format to Array[Byte]

The reason is that SpecialMutableRow will do type check for atomic types for each update call of the aggregation buffer. If we declare the storage format to be IntegerType, but actually stores an arbitrary object in the aggregation buffer, then SpecialMutableRow will catch this error and reports exception.

This detail deserves a comment in the code.

SparkQA · 2016-08-23T03:44:24Z

Test build #64255 has finished for PR 14753 at commit 0fdc1ea.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class TypedImperativeAggregate[T] extends ImperativeAggregate

SparkQA · 2016-08-23T04:00:30Z

Test build #64256 has finished for PR 14753 at commit 7d88b20.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-23T05:35:01Z

Test build #64264 has finished for PR 14753 at commit d3108ab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-23T05:37:28Z

Test build #64262 has finished for PR 14753 at commit 0173d2c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-08-23T12:54:56Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala

+    }
+  }
+
+  private def field[U](input: InternalRow, fieldIndex: Int): U = {


do you have a better name?

Do you think the name is not clear enough? Or maybe getField?

SparkQA · 2016-08-24T01:24:03Z

Test build #64316 has finished for PR 14753 at commit 2873765.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-08-24T04:50:30Z

...e/src/main/scala/org/apache/spark/sql/execution/aggregate/SortBasedAggregationIterator.scala

+  // For TypedImperativeAggregate with generic aggregation buffer object, we need to call
+  // serializeAggregateBufferInPlace(...) explicitly to convert the aggregation buffer object
+  // to Spark Sql internally supported serializable storage format.
+  private def serializeTypedAggregateBuffer(aggregationBuffer: MutableRow): Unit = {


Unused parameter aggregationBuffer. Or replace the following sortBasedAggregationBuffer to aggregationBuffer?

SparkQA · 2016-08-24T12:39:19Z

Test build #64344 has finished for PR 14753 at commit 8c8bd9a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-08-24T12:50:19Z

sql/core/src/test/scala/org/apache/spark/sql/TypedImperativeAggregateSuite.scala

+          if (inputValue > buffer.value) {
+            buffer.value = inputValue
+          }
+        case null => buffer


nit: case null =>, we don't need to return anything here, the return type is Unit

cloud-fan · 2016-08-24T13:35:15Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala

+
+  /**
+   * In-place replaces the aggregation buffer object stored at buffer's index
+   * `mutableAggBufferOffset`, with SparkSQL internally supported underlying storage format.


with SparkSQL internally supported underlying storage format. It can only be BinaryType now.

cloud-fan · 2016-08-24T13:37:16Z

LGTM except one comment

SparkQA · 2016-08-24T15:31:40Z

Test build #64350 has finished for PR 14753 at commit 7e7cb85.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-24T15:46:22Z

Test build #64351 has finished for PR 14753 at commit 86166a1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-08-24T16:45:16Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala

+  final override def merge(buffer: MutableRow, inputBuffer: InternalRow): Unit = {
+    val bufferObject = getField[T](buffer, mutableAggBufferOffset)
+    // The inputBuffer stores serialized aggregation buffer object produced by partial aggregate
+    val inputObject = deserialize(getField[Array[Byte]](inputBuffer, inputAggBufferOffset))


nit: we should use inputBuffer.getBinary(inputAggBufferOffset) instead of getField[Array[Byte]](inputBuffer, inputAggBufferOffset), as the data type is BinaryType, not ObjectType(classOf[Any])

The inputBuffer is a safeRow in SortAggregateExec

processRow(sortBasedAggregationBuffer, safeProj(currentRow))

inputBuffer.getBinary(inputAggBufferOffset) and getField[Array[Byte]](inputBuffer, inputAggBufferOffset) are equivalent.

Yes, it is better to use inputBuffer.getBinary(inputAggBufferOffset) directly

yhuai · 2016-08-24T22:39:34Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala

+ *                                                     ^
+ *                                                     |
+ *                    Aggregation buffer object for `TypedImperativeAggregate` aggregation function
+ * }}}


Let's also add a normal agg buffer after the generic one. So, readers will not assume that generic ones will always be put at the end.

SparkQA · 2016-08-25T00:55:02Z

Test build #64381 has finished for PR 14753 at commit e060d21.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-08-25T03:01:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregationIterator.scala

+        // Serializes the generic object stored in aggregation buffer
+        var i = 0
+        while (i < typedImperativeAggregates.length) {
+          i += 1


@clockfly can you also address https://github.com/apache/spark/pull/14753/files#r76154000 ?

SparkQA · 2016-08-25T05:06:59Z

Test build #64393 has finished for PR 14753 at commit ac8e36a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-08-25T11:26:58Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala

+ * aggregation only support aggregation buffer of mutable types (like LongType, IntType that have
+ * fixed length and can be mutated in place in UnsafeRow)
+ */
+abstract class TypedImperativeAggregate[T] extends ImperativeAggregate {


Isn't this the wrong way around? Isn't ImperativeAggregate the untyped version of an TypedImperativeAggregate? Much like Dataset and DataFrame?

I know this has been done for engineering purposes, but I still wonder if we shouldn't reverse the hierarchy here.

ImperativeAggregate only defines the interface. It does not specify what are accepted buffer types, right?

hvanhovell · 2016-08-25T11:27:31Z

@clockfly is this supposed to work with window functions?

yhuai · 2016-08-25T15:40:46Z

@hvanhovell This is supposed to work with window functions.

SparkQA · 2016-08-25T17:41:32Z

Test build #64426 has finished for PR 14753 at commit ca574e1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-08-25T23:33:28Z

Thanks. Overall looks good. I am merging this to master. Let me tweak the interface later.

clockfly mentioned this pull request Aug 22, 2016

[SQL][WIP][Test] Supports object-based aggregation function which can store arbitrary objects in aggregation buffer. #14723

Closed

clockfly force-pushed the object_aggregation_buffer_try_2 branch from 6efddad to 2d84528 Compare August 22, 2016 15:53

object aggregation buffer

10861b2

clockfly force-pushed the object_aggregation_buffer_try_2 branch from 2d84528 to 10861b2 Compare August 22, 2016 15:56

clockfly mentioned this pull request Aug 22, 2016

[SPARK-16283][SQL] Implement percentile_approx SQL function #14298

Closed

yhuai reviewed Aug 22, 2016
View reviewed changes

fix comments

0fdc1ea

cloud-fan reviewed Aug 23, 2016
View reviewed changes

clockfly force-pushed the object_aggregation_buffer_try_2 branch from 7d88b20 to 0173d2c Compare August 23, 2016 03:20

fix review comments

d3108ab

clockfly force-pushed the object_aggregation_buffer_try_2 branch from 0173d2c to d3108ab Compare August 23, 2016 03:24

clockfly reviewed Aug 23, 2016
View reviewed changes

cloud-fan reviewed Aug 23, 2016
View reviewed changes

fix review comments

2873765

fix review comments

7190eb0

clockfly force-pushed the object_aggregation_buffer_try_2 branch from b843f2f to 7190eb0 Compare August 24, 2016 03:50

viirya reviewed Aug 24, 2016
View reviewed changes

cloud-fan reviewed Aug 24, 2016
View reviewed changes

On wenchen's comment

7e7cb85

clockfly force-pushed the object_aggregation_buffer_try_2 branch from 5086847 to 7e7cb85 Compare August 24, 2016 13:30

cloud-fan reviewed Aug 24, 2016
View reviewed changes

On wenchen's comment

86166a1

cloud-fan reviewed Aug 24, 2016
View reviewed changes

On wenchen's comment

e060d21

yhuai reviewed Aug 24, 2016
View reviewed changes

add test for nullable aggregation function

ac8e36a

clockfly force-pushed the object_aggregation_buffer_try_2 branch from cfc22ed to ac8e36a Compare August 25, 2016 03:00

cloud-fan reviewed Aug 25, 2016
View reviewed changes

hvanhovell mentioned this pull request Aug 25, 2016

[SPARK-16200][SQL] Rename AggregateFunction#supportsPartial #13852

Closed

hvanhovell reviewed Aug 25, 2016
View reviewed changes

use while loop

ca574e1

asfgit closed this in d96d151 Aug 26, 2016

[SPARK-17187][SQL] Supports using arbitrary Java object as internal aggregation buffer object #14753

[SPARK-17187][SQL] Supports using arbitrary Java object as internal aggregation buffer object #14753

Conversation

clockfly commented Aug 22, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Aug 22, 2016

SparkQA commented Aug 22, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 23, 2016

SparkQA commented Aug 23, 2016

SparkQA commented Aug 23, 2016

SparkQA commented Aug 23, 2016

Choose a reason for hiding this comment

clockfly Aug 23, 2016 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Aug 24, 2016

Choose a reason for hiding this comment

SparkQA commented Aug 24, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Aug 24, 2016

SparkQA commented Aug 24, 2016

SparkQA commented Aug 24, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 25, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 25, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hvanhovell commented Aug 25, 2016

yhuai commented Aug 25, 2016

SparkQA commented Aug 25, 2016

yhuai commented Aug 25, 2016

clockfly commented Aug 22, 2016 •

edited

Loading

clockfly Aug 23, 2016 •

edited

Loading