-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-2047: Introduce an in-mem Sorter, and use it to reduce mem usage #1502
Conversation
Currently, the AppendOnlyMap performs an "in-place" sort by converting its array of [key, value, key, value] pairs into a an array of [(key, value), (key, value)] pairs. However, this causes us to allocate many Tuple2 objects, which come at a nontrivial overhead. This patch adds a Sorter API, intended for in memory sorts, which simply ports the Java OpenJDK 6 implementation of Arrays.sort() (which uses a merge sort) and abstracts the interface in a way which introduces no more than 1 virtual function invocation of overhead at each abstraction point. Please compare our port of the Java 6 sort with the original implementation: http://www.diffchecker.com/kh9ufcqo === Memory implications === An AppendOnlyMap contains N kv pairs, which results in roughly 2N elements within its underlying array. Each of these elements is 4 bytes wide in a [compressed OOP](https://wikis.oracle.com/display/HotSpotInternals/CompressedOops) system, which is the default. Today's approach immediately allocates N Tuple2 objects, which take up 24N bytes in total (exposed via YourKit), and undergoes a Java sort. The Java 6 version immediately copies the entire array (4N bytes here), while the Java 7 version has a worst-case allocation of half the array (2N bytes). This results in a sorting overhead of 24N + 4N = 28N bytes (for Java 6). The Sorter does not require allocating any tuples, but since it uses the Java 6 merge sort algorithm, it does copy the entire array (and that is the entire array, not just the half needed for Tuples). This results in a sorting overhead of 8N bytes. Thus, we have reduced the overhead of the sort by roughly 20 bytes times the number of elements. === Performance implications === As the destructiveSortedIterator is used for spilling in an ExternalAppendOnlyMap, the purpose of this patch is to provide stability by reducing memory usage rather than improve performance. Indeed, this PR implements Java 6's merge sort rather than the Java 7 Timsort, which is much more performant. A future optimization is to port the Timsort over, which the SortDataFormat API should support with minimal changes. Nevertheless, here are the results of a microbenchmark that sorted 25 million, randomly distributed (Float, Int) pairs. The Java Arrays.sort() tests were run **only on the keys**, and thus moved less data. Our current implementation is called "Tuple-sort using Arrays.sort()". <table> <tr><th>Java version</th><th>Test</th><th>First run</th><th>Average of 10</th></tr> <tr><td>6</td><td>primitive Arrays.sort()</td><td>3216 ms</td><td>1190 ms</td></tr> <tr><td>6</td><td>Arrays.sort()</td><td>18564 ms</td><td>2006 ms</td></tr> <tr><td>6</td><td>Tuple-sort using Arrays.sort()</td><td>31813 ms</td><td>3550 ms</td></tr> <tr><td>7</td><td>primitive Arrays.sort()</td><td>2724 ms</td><td>131 ms (!!)</td></tr> <tr><td>7</td><td>Arrays.sort()</td><td>13201 ms</td><td>878 ms</td></tr> <tr><td>7</td><td>Tuple-sort using Arrays.sort()</td><td>20990 ms</td><td>1919 ms</td></tr> <tr><td>7</td><td>**KV-sort using Sorter**</td><td>**18232 ms**</td><td>**2030 ms**</td></tr> <tr><td>7</td><td>Microbenchmarks are stupid</td><td>25708 ms</td><td>2400 ms</td></tr> </table> Note that the final test was the same as KV-sort using Sorter, but with a second impelementation of SortDataFormat loaded in the JVM (presumably causing a de-opt for the virtual function call shortcircuit). The results show that this Sorter performs exactly as expected -- it is about as fast as the Java 6 Arrays.sort() (which shares the same algorithm), but is significantly faster than the Tuple-sort on Java 6. The Java 7 Timsort provided a huge speedup in this benchmark, suggesting that using it instead would result in roughly a 2x speedup. However, the Tuple-based approach is still not significantly faster despite the much better algorithm. In short, this patch should significantly improve performance on users running Java 6, and provide a minor performance degradation for users running Java 7.
QA tests have started for PR 1502. This patch merges cleanly. |
Cool. What about P^3 sort? :) |
QA results for PR 1502: |
QA tests have started for PR 1502. This patch merges cleanly. |
QA results for PR 1502: |
The "added public classes" are private[spark] and package-private to org.apache.spark.util, repsectively. Also @mateiz Please take a brief look at the API to make sure it's still suitable for your SizeTrackingCollection's destructiveSortedIterator. The only major change to that API was that the Comparator now only takes the key instead of a (K, C) pair. |
I spoke with @aarondav, but I'm not sure we can borrow this code from Java if it is LGPL licensed. |
He did it! |
QA tests have started for PR 1502. This patch merges cleanly. |
OpenJDK is GPLv2, not compatible. This Timsort is available under Apache v2 from the Android repo. Not to be confused with the identical code in OpenJDK7 which is under GPLv2. Go figure.
In light of that minor issue, I have ported an Apache v2 Timsort (from the Android repos). It's a bit longer, but far more performant (roughly twice as fast on 25 million elements!) |
QA tests have started for PR 1502. This patch merges cleanly. |
Hey Aaron, make sure you add a note in the LICENSE file saying this part of the code is from Android (similar to the other notes there). |
BTW the API looks good to me! Actually it might allow more efficient ways to keep track of the partition ID for each key in my case too. |
QA tests have started for PR 1502. This patch merges cleanly. |
QA tests have started for PR 1502. This patch merges cleanly. |
License updated, as well as performance numbers. Our memory overhead is now very low on certain workloads (4*N bytes worst case scenario, but experimental results have shown remarkably little scratch space allocated), and our performance is now 2-4 times better than the previous implementation. |
QA tests have started for PR 1502. This patch merges cleanly. |
QA results for PR 1502: |
QA results for PR 1502: |
QA results for PR 1502: |
QA results for PR 1502: |
QA results for PR 1502: |
(This PR passed Jenkins 3 times and then failed inside HiveContext -- it's probably OK. I submitted #1514 to fix the flakey test.) |
Jenkins, retest this please. |
@@ -252,7 +251,7 @@ class ExternalAppendOnlyMap[K, V, C]( | |||
if (it.hasNext) { | |||
var kc = it.next() | |||
kcPairs += kc | |||
val minHash = getKeyHashCode(kc) | |||
val minHash = hashKey(kc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, is this more efficient than calling ExternalAppendOnlyMap.hash directly as we did before? It was kind of weird that we were doing it in an inner class, maybe it created another pointer dereference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't for efficiency, it's actually for type safety. I changed this because when I changed the meaning of getKeyHashCode to just hash(whatever), these calls were not compile errors, though they no longer did the right thing. Now we actually ensure you pass in a (K, C) pair rather than just a Tuple2 (or, now, any object).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, got it, that makes a lot of sense. I actually ran into this problem before (calling == or hashCode on the (K, C) pair instead of the K).
QA tests have started for PR 1502. This patch merges cleanly. |
This looks good to me other than the question above. This will be pretty cool for sorting multiple columnar data structures. |
@@ -483,6 +483,24 @@ SUCH DAMAGE. | |||
|
|||
|
|||
======================================================================== | |||
For Timsort: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Say which source file this is in (core/src/main/java/...)
QA results for PR 1502: |
QA tests have started for PR 1502. This patch merges cleanly. |
QA results for PR 1502: |
Looks like this actually passed unit tests but there's a binary compatibility thing introduced by a previous commit to MLlib. |
Actually Xiangrui said he fixed that, so I'm going to merge this. |
### Why and what? Currently, the AppendOnlyMap performs an "in-place" sort by converting its array of [key, value, key, value] pairs into a an array of [(key, value), (key, value)] pairs. However, this causes us to allocate many Tuple2 objects, which come at a nontrivial overhead. This patch adds a Sorter API, intended for in memory sorts, which simply ports the Android Timsort implementation (available under Apache v2) and abstracts the interface in a way which introduces no more than 1 virtual function invocation of overhead at each abstraction point. Please compare our port of the Android Timsort sort with the original implementation: http://www.diffchecker.com/wiwrykcl ### Memory implications An AppendOnlyMap contains N kv pairs, which results in roughly 2N elements within its underlying array. Each of these elements is 4 bytes wide in a [compressed OOPS](https://wikis.oracle.com/display/HotSpotInternals/CompressedOops) system, which is the default. Today's approach immediately allocates N Tuple2 objects, which take up 24N bytes in total (exposed via YourKit), and undergoes a Java sort. The Java 6 version immediately copies the entire array (4N bytes here), while the Java 7 version has a worst-case allocation of half the array (2N bytes). This results in a worst-case sorting overhead of 24N + 2N = 26N bytes (for Java 7). The Sorter does not require allocating any tuples, but since it uses Timsort, it may copy up to half the entire array in the worst case. This results in a worst-case sorting overhead of 4N bytes. Thus, we have reduced the worst-case overhead of the sort by roughly 22 bytes times the number of elements. ### Performance implications As the destructiveSortedIterator is used for spilling in an ExternalAppendOnlyMap, the purpose of this patch is to provide stability by reducing memory usage rather than improve performance. However, because it implements Timsort, it also brings a substantial performance boost over our prior implementation. Here are the results of a microbenchmark that sorted 25 million, randomly distributed (Float, Int) pairs. The Java Arrays.sort() tests were run **only on the keys**, and thus moved less data. Our current implementation is called "Tuple-sort using Arrays.sort()" while the new implementation is "KV-array using Sorter". <table> <tr><th>Test</th><th>First run (JDK6)</th><th>Average of 10 (JDK6)</th><th>First run (JDK7)</th><th>Average of 10 (JDK7)</th></tr> <tr><td>primitive Arrays.sort()</td><td>3216 ms</td><td>1190 ms</td><td>2724 ms</td><td>131 ms (!!)</td></tr> <tr><td>Arrays.sort()</td><td>18564 ms</td><td>2006 ms</td><td>13201 ms</td><td>878 ms</td></tr> <tr><td>Tuple-sort using Arrays.sort()</td><td>31813 ms</td><td>3550 ms</td><td>20990 ms</td><td>1919 ms</td></tr> <tr><td><b>KV-array using Sorter</b></td><td></td><td></td><td><b>15020 ms</b></td><td><b>834 ms</b></td></tr> </table> The results show that this Sorter performs exactly as expected (after the first run) -- it is as fast as the Java 7 Arrays.sort() (which shares the same algorithm), but is significantly faster than the Tuple-sort on Java 6 or 7. In short, this patch should significantly improve performance for users running either Java 6 or 7. Author: Aaron Davidson <aaron@databricks.com> Closes apache#1502 from aarondav/sort and squashes the following commits: 652d936 [Aaron Davidson] Update license, move Sorter to java src a7b5b1c [Aaron Davidson] fix licenses 5c0efaf [Aaron Davidson] Update tmpLength ec395c8 [Aaron Davidson] Ignore benchmark (again) and fix docs 034bf10 [Aaron Davidson] Change to Apache v2 Timsort b97296c [Aaron Davidson] Don't try to run benchmark on Jenkins + private[spark] 6307338 [Aaron Davidson] SPARK-2047: Introduce an in-mem Sorter, and use it to reduce mem usage
Why and what?
Currently, the AppendOnlyMap performs an "in-place" sort by converting its array of [key, value, key, value] pairs into a an array of [(key, value), (key, value)] pairs. However, this causes us to allocate many Tuple2 objects, which come at a nontrivial overhead.
This patch adds a Sorter API, intended for in memory sorts, which simply ports the Android Timsort implementation (available under Apache v2) and abstracts the interface in a way which introduces no more than 1 virtual function invocation of overhead at each abstraction point.
Please compare our port of the Android Timsort sort with the original implementation: http://www.diffchecker.com/wiwrykcl
Memory implications
An AppendOnlyMap contains N kv pairs, which results in roughly 2N elements within its underlying array. Each of these elements is 4 bytes wide in a compressed OOPS system, which is the default.
Today's approach immediately allocates N Tuple2 objects, which take up 24N bytes in total (exposed via YourKit), and undergoes a Java sort. The Java 6 version immediately copies the entire array (4N bytes here), while the Java 7 version has a worst-case allocation of half the array (2N bytes).
This results in a worst-case sorting overhead of 24N + 2N = 26N bytes (for Java 7).
The Sorter does not require allocating any tuples, but since it uses Timsort, it may copy up to half the entire array in the worst case.
This results in a worst-case sorting overhead of 4N bytes.
Thus, we have reduced the worst-case overhead of the sort by roughly 22 bytes times the number of elements.
Performance implications
As the destructiveSortedIterator is used for spilling in an ExternalAppendOnlyMap, the purpose of this patch is to provide stability by reducing memory usage rather than improve performance. However, because it implements Timsort, it also brings a substantial performance boost over our prior implementation.
Here are the results of a microbenchmark that sorted 25 million, randomly distributed (Float, Int) pairs. The Java Arrays.sort() tests were run only on the keys, and thus moved less data. Our current implementation is called "Tuple-sort using Arrays.sort()" while the new implementation is "KV-array using Sorter".
The results show that this Sorter performs exactly as expected (after the first run) -- it is as fast as the Java 7 Arrays.sort() (which shares the same algorithm), but is significantly faster than the Tuple-sort on Java 6 or 7.
In short, this patch should significantly improve performance for users running either Java 6 or 7.