SPARK-1321 Use Guava's top k implementation rather than our BoundedPriorityQueue based implementation #229

rxin · 2014-03-25T21:58:37Z

Also updated the documentation for top and takeOrdered.

On my simple test of sorting 100 million (Int, Int) tuples using Spark, Guava's top k implementation (in Ordering) is much faster than the BoundedPriorityQueue implementation for roughly sorted input (10 - 20X faster), and still faster for purely random input (2 - 5X).

AmplabJenkins · 2014-03-25T22:11:39Z

Merged build triggered.

AmplabJenkins · 2014-03-25T22:11:39Z

Merged build started.

AmplabJenkins · 2014-03-25T23:11:30Z

Merged build finished.

AmplabJenkins · 2014-03-25T23:11:30Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13445/

pwendell · 2014-03-26T04:25:28Z

core/src/main/scala/org/apache/spark/util/collection/Utils.scala

+      override def compare(l: T, r: T) = ord.compare(l, r)
+    }
+    collectionAsScalaIterable(
+    ordering.leastOf(asJavaIterator(input), num)).iterator


is this proper indendation? Should this be indented from the previous line?

pwendell · 2014-03-26T04:25:37Z

Looks good to me. One small style comment.

rxin · 2014-03-26T04:27:17Z

weird i missed that. fixed.

aarondav · 2014-03-26T04:43:20Z

core/src/main/scala/org/apache/spark/util/collection/Utils.scala

+    val ordering = new GuavaOrdering[T] {
+      override def compare(l: T, r: T) = ord.compare(l, r)
+    }
+    collectionAsScalaIterable(ordering.leastOf(asJavaIterator(input), num)).iterator


Why can't this just be
ordering.leastOf(input, num).iterator
?

it could - but trying to be more explicit here. i'm really not a fan of implicits, especially in critical paths ..

… based implementation. Also updated the documentation for top and takeOrdered. Guava's top k implementation (in Ordering) is much faster than the BoundedPriorityQueue implementation for roughly sorted input (10 - 20X faster), and still faster for purely random input (2 - 5X).

AmplabJenkins · 2014-03-26T05:14:29Z

Merged build triggered.

AmplabJenkins · 2014-03-26T05:14:29Z

Merged build started.

AmplabJenkins · 2014-03-26T06:11:51Z

Merged build finished.

AmplabJenkins · 2014-03-26T06:11:51Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13456/

rxin · 2014-03-26T07:09:37Z

Ok I've merged this.

…iorityQueue based implementation Also updated the documentation for top and takeOrdered. On my simple test of sorting 100 million (Int, Int) tuples using Spark, Guava's top k implementation (in Ordering) is much faster than the BoundedPriorityQueue implementation for roughly sorted input (10 - 20X faster), and still faster for purely random input (2 - 5X). Author: Reynold Xin <rxin@apache.org> Closes apache#229 from rxin/takeOrdered and squashes the following commits: 0d11844 [Reynold Xin] Use Guava's top k implementation rather than our BoundedPriorityQueue based implementation. Also updated the documentation for top and takeOrdered.

* In hdfs soak test, updated hdfsclient image and specified hdfs service name. * Made the HDFS service name configurable. * Use the "app name" (begins with a slash) instead of service name. * Simplified the "hdfs-kerberos-delete-terasort-files" job * Updated service name and image version in 'hdfs-kerberos-delete-terasort-files' job * Make the executors run as root (for centos) * Templated the principal in the 'hdfs-kerberos-delete-terasort-files' job

Fix the Mitaka devstack installation

AL-1864 upgrade jackson-databind to 2.9.10.8

rxin changed the title ~~Use Guava's top k implementation rather than our BoundedPriorityQueue based implementation~~ SPARK-1321 Use Guava's top k implementation rather than our BoundedPriorityQueue based implementation Mar 25, 2014

pwendell reviewed Mar 26, 2014
View reviewed changes

aarondav reviewed Mar 26, 2014
View reviewed changes

asfgit closed this in b859853 Mar 26, 2014

rxin deleted the takeOrdered branch June 24, 2014 00:23

futurely mentioned this pull request Apr 16, 2015

Google Guava MinMaxPriorityQueue is a faster bounded priority queue with excellent Javadoc MKLab-ITI/multimedia-indexing#3

Closed

jamesrgrinter pushed a commit to jamesrgrinter/spark that referenced this pull request Apr 22, 2018

MapR [SPARK-155] Change Spark Master port from 8080 (apache#229)

f341e85

bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019

Merge pull request apache#229 from liu-sheng/openlab/issues/65

5cd7fc0

Fix the Mitaka devstack installation

arjunshroff pushed a commit to arjunshroff/spark that referenced this pull request Nov 24, 2020

MapR [SPARK-155] Change Spark Master port from 8080 (apache#229)

4efd75f

fishcus pushed a commit to fishcus/spark that referenced this pull request Nov 18, 2021

Merge pull request apache#229 from shanxuecheng/AL-1864

d18c3d7

AL-1864 upgrade jackson-databind to 2.9.10.8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-1321 Use Guava's top k implementation rather than our BoundedPriorityQueue based implementation #229

SPARK-1321 Use Guava's top k implementation rather than our BoundedPriorityQueue based implementation #229

rxin commented Mar 25, 2014

AmplabJenkins commented Mar 25, 2014

AmplabJenkins commented Mar 25, 2014

AmplabJenkins commented Mar 25, 2014

AmplabJenkins commented Mar 25, 2014

pwendell Mar 26, 2014

pwendell commented Mar 26, 2014

rxin commented Mar 26, 2014

aarondav Mar 26, 2014

rxin Mar 26, 2014

AmplabJenkins commented Mar 26, 2014

AmplabJenkins commented Mar 26, 2014

AmplabJenkins commented Mar 26, 2014

AmplabJenkins commented Mar 26, 2014

rxin commented Mar 26, 2014

SPARK-1321 Use Guava's top k implementation rather than our BoundedPriorityQueue based implementation #229

SPARK-1321 Use Guava's top k implementation rather than our BoundedPriorityQueue based implementation #229

Conversation

rxin commented Mar 25, 2014

AmplabJenkins commented Mar 25, 2014

AmplabJenkins commented Mar 25, 2014

AmplabJenkins commented Mar 25, 2014

AmplabJenkins commented Mar 25, 2014

pwendell Mar 26, 2014

Choose a reason for hiding this comment

pwendell commented Mar 26, 2014

rxin commented Mar 26, 2014

aarondav Mar 26, 2014

Choose a reason for hiding this comment

rxin Mar 26, 2014

Choose a reason for hiding this comment

AmplabJenkins commented Mar 26, 2014

AmplabJenkins commented Mar 26, 2014

AmplabJenkins commented Mar 26, 2014

AmplabJenkins commented Mar 26, 2014

rxin commented Mar 26, 2014