Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-17187][SQL] Supports using arbitrary Java object as internal aggregation buffer object #14753

Closed

Conversation

clockfly
Copy link
Contributor

@clockfly clockfly commented Aug 22, 2016

What changes were proposed in this pull request?

This PR introduces an abstract class TypedImperativeAggregate so that an aggregation function of TypedImperativeAggregate can use arbitrary user-defined Java object as intermediate aggregation buffer object.

This has advantages like:

  1. It now can support larger category of aggregation functions. For example, it will be much easier to implement aggregation function percentile_approx, which has a complex aggregation buffer definition.
  2. It can be used to avoid doing serialization/de-serialization for every call of update or merge when converting domain specific aggregation object to internal Spark-Sql storage format.
  3. It is easier to integrate with other existing monoid libraries like algebird, and supports more aggregation functions with high performance.

Please see org.apache.spark.sql.TypedImperativeAggregateSuite.TypedMaxAggregate to find an example of how to defined a TypedImperativeAggregate aggregation function.
Please see Java doc of TypedImperativeAggregate and Jira ticket SPARK-17187 for more information.

How was this patch tested?

Unit tests.

@SparkQA
Copy link

SparkQA commented Aug 22, 2016

Test build #64211 has finished for PR 14753 at commit 6efddad.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class TypedImperativeAggregate[T >: Null] extends ImperativeAggregate

@SparkQA
Copy link

SparkQA commented Aug 22, 2016

Test build #64213 has finished for PR 14753 at commit 10861b2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class TypedImperativeAggregate[T >: Null] extends ImperativeAggregate

* calls method `eval(buffer: T)` to generate the final output for this group.
* 5. The framework moves on to next group, until all groups have been processed.
*/
abstract class TypedImperativeAggregate[T >: Null] extends ImperativeAggregate {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it work in java?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe so, but I will do a double check

* @param buffer The aggregation buffer object.
* @param input an input row
*/
def update(buffer: T, input: InternalRow): Unit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes the buffer object type T can do in-place update, which is not always true, e.g. percentile_approx, how about def update(buffer: T, input: InternalRow): T?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan User can define a wrapper to do inplace update.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems update needs to evaluate the input. We need to document it.

@clockfly clockfly force-pushed the object_aggregation_buffer_try_2 branch from 7d88b20 to 0173d2c Compare August 23, 2016 03:20
@clockfly clockfly force-pushed the object_aggregation_buffer_try_2 branch from 0173d2c to d3108ab Compare August 23, 2016 03:24
def aggregationBufferClass: Class[T]

/** Serializes the aggregation buffer object T to Array[Byte] */
def serialize(buffer: T): Array[Byte]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we limit the serializable format to Array[Byte]

The reason is that SpecialMutableRow will do type check for atomic types for each update call of the aggregation buffer. If we declare the storage format to be IntegerType, but actually stores an arbitrary object in the aggregation buffer, then SpecialMutableRow will catch this error and reports exception.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This detail deserves a comment in the code.

@SparkQA
Copy link

SparkQA commented Aug 23, 2016

Test build #64255 has finished for PR 14753 at commit 0fdc1ea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class TypedImperativeAggregate[T] extends ImperativeAggregate

@SparkQA
Copy link

SparkQA commented Aug 23, 2016

Test build #64256 has finished for PR 14753 at commit 7d88b20.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 23, 2016

Test build #64264 has finished for PR 14753 at commit d3108ab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 23, 2016

Test build #64262 has finished for PR 14753 at commit 0173d2c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
}

private def field[U](input: InternalRow, fieldIndex: Int): U = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have a better name?

Copy link
Contributor Author

@clockfly clockfly Aug 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think the name is not clear enough? Or maybe getField?

@SparkQA
Copy link

SparkQA commented Aug 24, 2016

Test build #64316 has finished for PR 14753 at commit 2873765.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@clockfly clockfly force-pushed the object_aggregation_buffer_try_2 branch from b843f2f to 7190eb0 Compare August 24, 2016 03:50
// For TypedImperativeAggregate with generic aggregation buffer object, we need to call
// serializeAggregateBufferInPlace(...) explicitly to convert the aggregation buffer object
// to Spark Sql internally supported serializable storage format.
private def serializeTypedAggregateBuffer(aggregationBuffer: MutableRow): Unit = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused parameter aggregationBuffer. Or replace the following sortBasedAggregationBuffer to aggregationBuffer?

@SparkQA
Copy link

SparkQA commented Aug 24, 2016

Test build #64344 has finished for PR 14753 at commit 8c8bd9a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if (inputValue > buffer.value) {
buffer.value = inputValue
}
case null => buffer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: case null =>, we don't need to return anything here, the return type is Unit

@clockfly clockfly force-pushed the object_aggregation_buffer_try_2 branch from 5086847 to 7e7cb85 Compare August 24, 2016 13:30

/**
* In-place replaces the aggregation buffer object stored at buffer's index
* `mutableAggBufferOffset`, with SparkSQL internally supported underlying storage format.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with SparkSQL internally supported underlying storage format. It can only be BinaryType now.

@cloud-fan
Copy link
Contributor

LGTM except one comment

@SparkQA
Copy link

SparkQA commented Aug 24, 2016

Test build #64350 has finished for PR 14753 at commit 7e7cb85.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 24, 2016

Test build #64351 has finished for PR 14753 at commit 86166a1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

final override def merge(buffer: MutableRow, inputBuffer: InternalRow): Unit = {
val bufferObject = getField[T](buffer, mutableAggBufferOffset)
// The inputBuffer stores serialized aggregation buffer object produced by partial aggregate
val inputObject = deserialize(getField[Array[Byte]](inputBuffer, inputAggBufferOffset))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we should use inputBuffer.getBinary(inputAggBufferOffset) instead of getField[Array[Byte]](inputBuffer, inputAggBufferOffset), as the data type is BinaryType, not ObjectType(classOf[Any])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inputBuffer is a safeRow in SortAggregateExec

processRow(sortBasedAggregationBuffer, safeProj(currentRow))

inputBuffer.getBinary(inputAggBufferOffset) and getField[Array[Byte]](inputBuffer, inputAggBufferOffset) are equivalent.

Yes, it is better to use inputBuffer.getBinary(inputAggBufferOffset) directly

* ^
* |
* Aggregation buffer object for `TypedImperativeAggregate` aggregation function
* }}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add a normal agg buffer after the generic one. So, readers will not assume that generic ones will always be put at the end.

@SparkQA
Copy link

SparkQA commented Aug 25, 2016

Test build #64381 has finished for PR 14753 at commit e060d21.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@clockfly clockfly force-pushed the object_aggregation_buffer_try_2 branch from cfc22ed to ac8e36a Compare August 25, 2016 03:00
// Serializes the generic object stored in aggregation buffer
var i = 0
while (i < typedImperativeAggregates.length) {
i += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

???

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SparkQA
Copy link

SparkQA commented Aug 25, 2016

Test build #64393 has finished for PR 14753 at commit ac8e36a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* aggregation only support aggregation buffer of mutable types (like LongType, IntType that have
* fixed length and can be mutated in place in UnsafeRow)
*/
abstract class TypedImperativeAggregate[T] extends ImperativeAggregate {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the wrong way around? Isn't ImperativeAggregate the untyped version of an TypedImperativeAggregate? Much like Dataset and DataFrame?

I know this has been done for engineering purposes, but I still wonder if we shouldn't reverse the hierarchy here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ImperativeAggregate only defines the interface. It does not specify what are accepted buffer types, right?

@hvanhovell
Copy link
Contributor

@clockfly is this supposed to work with window functions?

@yhuai
Copy link
Contributor

yhuai commented Aug 25, 2016

@hvanhovell This is supposed to work with window functions.

@SparkQA
Copy link

SparkQA commented Aug 25, 2016

Test build #64426 has finished for PR 14753 at commit ca574e1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Aug 25, 2016

Thanks. Overall looks good. I am merging this to master. Let me tweak the interface later.

@asfgit asfgit closed this in d96d151 Aug 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants