SPARK-2047: Introduce an in-mem Sorter, and use it to reduce mem usage #1502

aarondav · 2014-07-20T20:39:22Z

Why and what?

Currently, the AppendOnlyMap performs an "in-place" sort by converting its array of [key, value, key, value] pairs into a an array of [(key, value), (key, value)] pairs. However, this causes us to allocate many Tuple2 objects, which come at a nontrivial overhead.

This patch adds a Sorter API, intended for in memory sorts, which simply ports the Android Timsort implementation (available under Apache v2) and abstracts the interface in a way which introduces no more than 1 virtual function invocation of overhead at each abstraction point.

Please compare our port of the Android Timsort sort with the original implementation: http://www.diffchecker.com/wiwrykcl

Memory implications

An AppendOnlyMap contains N kv pairs, which results in roughly 2N elements within its underlying array. Each of these elements is 4 bytes wide in a compressed OOPS system, which is the default.

Today's approach immediately allocates N Tuple2 objects, which take up 24N bytes in total (exposed via YourKit), and undergoes a Java sort. The Java 6 version immediately copies the entire array (4N bytes here), while the Java 7 version has a worst-case allocation of half the array (2N bytes).
This results in a worst-case sorting overhead of 24N + 2N = 26N bytes (for Java 7).

The Sorter does not require allocating any tuples, but since it uses Timsort, it may copy up to half the entire array in the worst case.
This results in a worst-case sorting overhead of 4N bytes.

Thus, we have reduced the worst-case overhead of the sort by roughly 22 bytes times the number of elements.

Performance implications

As the destructiveSortedIterator is used for spilling in an ExternalAppendOnlyMap, the purpose of this patch is to provide stability by reducing memory usage rather than improve performance. However, because it implements Timsort, it also brings a substantial performance boost over our prior implementation.

Here are the results of a microbenchmark that sorted 25 million, randomly distributed (Float, Int) pairs. The Java Arrays.sort() tests were run only on the keys, and thus moved less data. Our current implementation is called "Tuple-sort using Arrays.sort()" while the new implementation is "KV-array using Sorter".

Test	First run (JDK6)	Average of 10 (JDK6)	First run (JDK7)	Average of 10 (JDK7)
primitive Arrays.sort()	3216 ms	1190 ms	2724 ms	131 ms (!!)
Arrays.sort()	18564 ms	2006 ms	13201 ms	878 ms
Tuple-sort using Arrays.sort()	31813 ms	3550 ms	20990 ms	1919 ms
KV-array using Sorter			15020 ms	834 ms

The results show that this Sorter performs exactly as expected (after the first run) -- it is as fast as the Java 7 Arrays.sort() (which shares the same algorithm), but is significantly faster than the Tuple-sort on Java 6 or 7.

In short, this patch should significantly improve performance for users running either Java 6 or 7.

Currently, the AppendOnlyMap performs an "in-place" sort by converting its array of [key, value, key, value] pairs into a an array of [(key, value), (key, value)] pairs. However, this causes us to allocate many Tuple2 objects, which come at a nontrivial overhead. This patch adds a Sorter API, intended for in memory sorts, which simply ports the Java OpenJDK 6 implementation of Arrays.sort() (which uses a merge sort) and abstracts the interface in a way which introduces no more than 1 virtual function invocation of overhead at each abstraction point. Please compare our port of the Java 6 sort with the original implementation: http://www.diffchecker.com/kh9ufcqo === Memory implications === An AppendOnlyMap contains N kv pairs, which results in roughly 2N elements within its underlying array. Each of these elements is 4 bytes wide in a [compressed OOP](https://wikis.oracle.com/display/HotSpotInternals/CompressedOops) system, which is the default. Today's approach immediately allocates N Tuple2 objects, which take up 24N bytes in total (exposed via YourKit), and undergoes a Java sort. The Java 6 version immediately copies the entire array (4N bytes here), while the Java 7 version has a worst-case allocation of half the array (2N bytes). This results in a sorting overhead of 24N + 4N = 28N bytes (for Java 6). The Sorter does not require allocating any tuples, but since it uses the Java 6 merge sort algorithm, it does copy the entire array (and that is the entire array, not just the half needed for Tuples). This results in a sorting overhead of 8N bytes. Thus, we have reduced the overhead of the sort by roughly 20 bytes times the number of elements. === Performance implications === As the destructiveSortedIterator is used for spilling in an ExternalAppendOnlyMap, the purpose of this patch is to provide stability by reducing memory usage rather than improve performance. Indeed, this PR implements Java 6's merge sort rather than the Java 7 Timsort, which is much more performant. A future optimization is to port the Timsort over, which the SortDataFormat API should support with minimal changes. Nevertheless, here are the results of a microbenchmark that sorted 25 million, randomly distributed (Float, Int) pairs. The Java Arrays.sort() tests were run **only on the keys**, and thus moved less data. Our current implementation is called "Tuple-sort using Arrays.sort()". <table> <tr><th>Java version</th><th>Test</th><th>First run</th><th>Average of 10</th></tr> <tr><td>6</td><td>primitive Arrays.sort()</td><td>3216 ms</td><td>1190 ms</td></tr> <tr><td>6</td><td>Arrays.sort()</td><td>18564 ms</td><td>2006 ms</td></tr> <tr><td>6</td><td>Tuple-sort using Arrays.sort()</td><td>31813 ms</td><td>3550 ms</td></tr> <tr><td>7</td><td>primitive Arrays.sort()</td><td>2724 ms</td><td>131 ms (!!)</td></tr> <tr><td>7</td><td>Arrays.sort()</td><td>13201 ms</td><td>878 ms</td></tr> <tr><td>7</td><td>Tuple-sort using Arrays.sort()</td><td>20990 ms</td><td>1919 ms</td></tr> <tr><td>7</td><td>**KV-sort using Sorter**</td><td>**18232 ms**</td><td>**2030 ms**</td></tr> <tr><td>7</td><td>Microbenchmarks are stupid</td><td>25708 ms</td><td>2400 ms</td></tr> </table> Note that the final test was the same as KV-sort using Sorter, but with a second impelementation of SortDataFormat loaded in the JVM (presumably causing a de-opt for the virtual function call shortcircuit). The results show that this Sorter performs exactly as expected -- it is about as fast as the Java 6 Arrays.sort() (which shares the same algorithm), but is significantly faster than the Tuple-sort on Java 6. The Java 7 Timsort provided a huge speedup in this benchmark, suggesting that using it instead would result in roughly a 2x speedup. However, the Tuple-based approach is still not significantly faster despite the much better algorithm. In short, this patch should significantly improve performance on users running Java 6, and provide a minor performance degradation for users running Java 7.

SparkQA · 2014-07-20T20:43:12Z

QA tests have started for PR 1502. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16881/consoleFull

rxin · 2014-07-20T20:50:48Z

Cool. What about P^3 sort? :)

SparkQA · 2014-07-20T22:27:27Z

QA results for PR 1502:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
* This trait extends Any to ensure it is universal (and thus compiled to a Java interface).
trait SortDataFormat[K, Buffer] extends Any {
class KVArraySortDataFormat[K, T <: AnyRef : ClassTag] extends SortDataFormat[K, Array[T]] {
class Sorter<K, Buffer> {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16881/consoleFull

SparkQA · 2014-07-20T22:48:10Z

QA tests have started for PR 1502. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16883/consoleFull

SparkQA · 2014-07-21T00:26:47Z

QA results for PR 1502:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
* This trait extends Any to ensure it is universal (and thus compiled to a Java interface).
class KVArraySortDataFormat[K, T <: AnyRef : ClassTag] extends SortDataFormat[K, Array[T]] {
class Sorter<K, Buffer> {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16883/consoleFull

aarondav · 2014-07-21T01:24:03Z

The "added public classes" are private[spark] and package-private to org.apache.spark.util, repsectively. Also * This trait extends Any to ensure it is universal (and thus compiled to a Java interface). is a comment, so I wouldn't consider it public API.

@mateiz Please take a brief look at the API to make sure it's still suitable for your SizeTrackingCollection's destructiveSortedIterator. The only major change to that API was that the Comparator now only takes the key instead of a (K, C) pair.

pwendell · 2014-07-21T06:26:01Z

I spoke with @aarondav, but I'm not sure we can borrow this code from Java if it is LGPL licensed.

rxin · 2014-07-21T06:32:09Z

He did it!

SparkQA · 2014-07-21T06:33:15Z

QA tests have started for PR 1502. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16898/consoleFull

OpenJDK is GPLv2, not compatible. This Timsort is available under Apache v2 from the Android repo. Not to be confused with the identical code in OpenJDK7 which is under GPLv2. Go figure.

aarondav · 2014-07-21T06:50:14Z

In light of that minor issue, I have ported an Apache v2 Timsort (from the Android repos). It's a bit longer, but far more performant (roughly twice as fast on 25 million elements!)

SparkQA · 2014-07-21T06:53:41Z

QA tests have started for PR 1502. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16902/consoleFull

mateiz · 2014-07-21T07:02:32Z

Hey Aaron, make sure you add a note in the LICENSE file saying this part of the code is from Android (similar to the other notes there).

mateiz · 2014-07-21T07:08:04Z

BTW the API looks good to me! Actually it might allow more efficient ways to keep track of the partition ID for each key in my case too.

SparkQA · 2014-07-21T07:23:11Z

QA tests have started for PR 1502. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16903/consoleFull

SparkQA · 2014-07-21T07:33:39Z

QA tests have started for PR 1502. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16905/consoleFull

aarondav · 2014-07-21T07:50:05Z

License updated, as well as performance numbers. Our memory overhead is now very low on certain workloads (4*N bytes worst case scenario, but experimental results have shown remarkably little scratch space allocated), and our performance is now 2-4 times better than the previous implementation.

SparkQA · 2014-07-21T07:52:34Z

QA tests have started for PR 1502. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16909/consoleFull

SparkQA · 2014-07-21T08:18:45Z

QA results for PR 1502:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
* This trait extends Any to ensure it is universal (and thus compiled to a Java interface).
class KVArraySortDataFormat[K, T <: AnyRef : ClassTag] extends SortDataFormat[K, Array[T]] {
class Sorter<K, Buffer> {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16898/consoleFull

SparkQA · 2014-07-21T08:45:21Z

QA results for PR 1502:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
* This trait extends Any to ensure it is universal (and thus compiled to a Java interface).
class KVArraySortDataFormat[K, T <: AnyRef : ClassTag] extends SortDataFormat[K, Array[T]] {
class Sorter<K, Buffer> {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16902/consoleFull

SparkQA · 2014-07-21T09:01:28Z

QA results for PR 1502:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
* This trait extends Any to ensure it is universal (and thus compiled to a Java interface).
class KVArraySortDataFormat[K, T <: AnyRef : ClassTag] extends SortDataFormat[K, Array[T]] {
class Sorter<K, Buffer> {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16903/consoleFull

SparkQA · 2014-07-21T09:21:35Z

QA results for PR 1502:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
* This trait extends Any to ensure it is universal (and thus compiled to a Java interface).
class KVArraySortDataFormat[K, T <: AnyRef : ClassTag] extends SortDataFormat[K, Array[T]] {
class Sorter<K, Buffer> {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16905/consoleFull

SparkQA · 2014-07-21T09:32:06Z

QA results for PR 1502:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
* This trait extends Any to ensure it is universal (and thus compiled to a Java interface).
class KVArraySortDataFormat[K, T <: AnyRef : ClassTag] extends SortDataFormat[K, Array[T]] {
class Sorter<K, Buffer> {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16909/consoleFull

aarondav · 2014-07-21T17:01:56Z

(This PR passed Jenkins 3 times and then failed inside HiveContext -- it's probably OK. I submitted #1514 to fix the flakey test.)

pwendell · 2014-07-22T07:42:04Z

Jenkins, retest this please.

mateiz · 2014-07-22T07:45:38Z

core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala

@@ -252,7 +251,7 @@ class ExternalAppendOnlyMap[K, V, C](
      if (it.hasNext) {
        var kc = it.next()
        kcPairs += kc
-        val minHash = getKeyHashCode(kc)
+        val minHash = hashKey(kc)


Just curious, is this more efficient than calling ExternalAppendOnlyMap.hash directly as we did before? It was kind of weird that we were doing it in an inner class, maybe it created another pointer dereference.

This isn't for efficiency, it's actually for type safety. I changed this because when I changed the meaning of getKeyHashCode to just hash(whatever), these calls were not compile errors, though they no longer did the right thing. Now we actually ensure you pass in a (K, C) pair rather than just a Tuple2 (or, now, any object).

Ah, got it, that makes a lot of sense. I actually ran into this problem before (calling == or hashCode on the (K, C) pair instead of the K).

SparkQA · 2014-07-22T07:48:37Z

QA tests have started for PR 1502. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16953/consoleFull

mateiz · 2014-07-22T08:04:30Z

This looks good to me other than the question above. This will be pretty cool for sorting multiple columnar data structures.

mateiz · 2014-07-22T08:29:44Z

LICENSE

@@ -483,6 +483,24 @@ SUCH DAMAGE.


 ========================================================================
+For Timsort:


Say which source file this is in (core/src/main/java/...)

SparkQA · 2014-07-22T09:40:57Z

QA results for PR 1502:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
* This trait extends Any to ensure it is universal (and thus compiled to a Java interface).
class KVArraySortDataFormat[K, T <: AnyRef : ClassTag] extends SortDataFormat[K, Array[T]] {
class Sorter<K, Buffer> {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16953/consoleFull

SparkQA · 2014-07-22T17:08:19Z

QA tests have started for PR 1502. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16974/consoleFull

SparkQA · 2014-07-22T18:48:00Z

QA results for PR 1502:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class Sorter<K, Buffer> {
* This trait extends Any to ensure it is universal (and thus compiled to a Java interface).
class KVArraySortDataFormat[K, T <: AnyRef : ClassTag] extends SortDataFormat[K, Array[T]] {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16974/consoleFull

mateiz · 2014-07-22T18:57:03Z

Looks like this actually passed unit tests but there's a binary compatibility thing introduced by a previous commit to MLlib.

mateiz · 2014-07-22T18:58:19Z

Actually Xiangrui said he fixed that, so I'm going to merge this.

### Why and what? Currently, the AppendOnlyMap performs an "in-place" sort by converting its array of [key, value, key, value] pairs into a an array of [(key, value), (key, value)] pairs. However, this causes us to allocate many Tuple2 objects, which come at a nontrivial overhead. This patch adds a Sorter API, intended for in memory sorts, which simply ports the Android Timsort implementation (available under Apache v2) and abstracts the interface in a way which introduces no more than 1 virtual function invocation of overhead at each abstraction point. Please compare our port of the Android Timsort sort with the original implementation: http://www.diffchecker.com/wiwrykcl ### Memory implications An AppendOnlyMap contains N kv pairs, which results in roughly 2N elements within its underlying array. Each of these elements is 4 bytes wide in a [compressed OOPS](https://wikis.oracle.com/display/HotSpotInternals/CompressedOops) system, which is the default. Today's approach immediately allocates N Tuple2 objects, which take up 24N bytes in total (exposed via YourKit), and undergoes a Java sort. The Java 6 version immediately copies the entire array (4N bytes here), while the Java 7 version has a worst-case allocation of half the array (2N bytes). This results in a worst-case sorting overhead of 24N + 2N = 26N bytes (for Java 7). The Sorter does not require allocating any tuples, but since it uses Timsort, it may copy up to half the entire array in the worst case. This results in a worst-case sorting overhead of 4N bytes. Thus, we have reduced the worst-case overhead of the sort by roughly 22 bytes times the number of elements. ### Performance implications As the destructiveSortedIterator is used for spilling in an ExternalAppendOnlyMap, the purpose of this patch is to provide stability by reducing memory usage rather than improve performance. However, because it implements Timsort, it also brings a substantial performance boost over our prior implementation. Here are the results of a microbenchmark that sorted 25 million, randomly distributed (Float, Int) pairs. The Java Arrays.sort() tests were run **only on the keys**, and thus moved less data. Our current implementation is called "Tuple-sort using Arrays.sort()" while the new implementation is "KV-array using Sorter". <table> <tr><th>Test</th><th>First run (JDK6)</th><th>Average of 10 (JDK6)</th><th>First run (JDK7)</th><th>Average of 10 (JDK7)</th></tr> <tr><td>primitive Arrays.sort()</td><td>3216 ms</td><td>1190 ms</td><td>2724 ms</td><td>131 ms (!!)</td></tr> <tr><td>Arrays.sort()</td><td>18564 ms</td><td>2006 ms</td><td>13201 ms</td><td>878 ms</td></tr> <tr><td>Tuple-sort using Arrays.sort()</td><td>31813 ms</td><td>3550 ms</td><td>20990 ms</td><td>1919 ms</td></tr> <tr><td><b>KV-array using Sorter</b></td><td></td><td></td><td><b>15020 ms</b></td><td><b>834 ms</b></td></tr> </table> The results show that this Sorter performs exactly as expected (after the first run) -- it is as fast as the Java 7 Arrays.sort() (which shares the same algorithm), but is significantly faster than the Tuple-sort on Java 6 or 7. In short, this patch should significantly improve performance for users running either Java 6 or 7. Author: Aaron Davidson <aaron@databricks.com> Closes apache#1502 from aarondav/sort and squashes the following commits: 652d936 [Aaron Davidson] Update license, move Sorter to java src a7b5b1c [Aaron Davidson] fix licenses 5c0efaf [Aaron Davidson] Update tmpLength ec395c8 [Aaron Davidson] Ignore benchmark (again) and fix docs 034bf10 [Aaron Davidson] Change to Apache v2 Timsort b97296c [Aaron Davidson] Don't try to run benchmark on Jenkins + private[spark] 6307338 [Aaron Davidson] SPARK-2047: Introduce an in-mem Sorter, and use it to reduce mem usage

Don't try to run benchmark on Jenkins + private[spark]

b97296c

Change to Apache v2 Timsort

034bf10

OpenJDK is GPLv2, not compatible. This Timsort is available under Apache v2 from the Android repo. Not to be confused with the identical code in OpenJDK7 which is under GPLv2. Go figure.

Ignore benchmark (again) and fix docs

ec395c8

Update tmpLength

5c0efaf

fix licenses

a7b5b1c

mateiz reviewed Jul 22, 2014
View reviewed changes

Update license, move Sorter to java src

652d936

asfgit closed this in 85d3596 Jul 22, 2014

		@@ -483,6 +483,24 @@ SUCH DAMAGE.


		========================================================================
		For Timsort:

SPARK-2047: Introduce an in-mem Sorter, and use it to reduce mem usage #1502

SPARK-2047: Introduce an in-mem Sorter, and use it to reduce mem usage #1502

Conversation

aarondav commented Jul 20, 2014

Why and what?

Memory implications

Performance implications

SparkQA commented Jul 20, 2014

rxin commented Jul 20, 2014

SparkQA commented Jul 20, 2014

SparkQA commented Jul 20, 2014

SparkQA commented Jul 21, 2014

aarondav commented Jul 21, 2014

pwendell commented Jul 21, 2014

rxin commented Jul 21, 2014

SparkQA commented Jul 21, 2014

aarondav commented Jul 21, 2014

SparkQA commented Jul 21, 2014

mateiz commented Jul 21, 2014

mateiz commented Jul 21, 2014

SparkQA commented Jul 21, 2014

SparkQA commented Jul 21, 2014

aarondav commented Jul 21, 2014

SparkQA commented Jul 21, 2014

SparkQA commented Jul 21, 2014

SparkQA commented Jul 21, 2014

SparkQA commented Jul 21, 2014

SparkQA commented Jul 21, 2014

SparkQA commented Jul 21, 2014

aarondav commented Jul 21, 2014

pwendell commented Jul 22, 2014

mateiz Jul 22, 2014

Choose a reason for hiding this comment

aarondav Jul 22, 2014

Choose a reason for hiding this comment

mateiz Jul 22, 2014

Choose a reason for hiding this comment

SparkQA commented Jul 22, 2014

mateiz commented Jul 22, 2014

mateiz Jul 22, 2014

Choose a reason for hiding this comment

SparkQA commented Jul 22, 2014

SparkQA commented Jul 22, 2014

SparkQA commented Jul 22, 2014

mateiz commented Jul 22, 2014

mateiz commented Jul 22, 2014