Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-1321 Use Guava's top k implementation rather than our BoundedPriorityQueue based implementation #229

Closed
wants to merge 1 commit into from

Conversation

rxin
Copy link
Contributor

@rxin rxin commented Mar 25, 2014

Also updated the documentation for top and takeOrdered.

On my simple test of sorting 100 million (Int, Int) tuples using Spark, Guava's top k implementation (in Ordering) is much faster than the BoundedPriorityQueue implementation for roughly sorted input (10 - 20X faster), and still faster for purely random input (2 - 5X).

@rxin rxin changed the title Use Guava's top k implementation rather than our BoundedPriorityQueue based implementation SPARK-1321 Use Guava's top k implementation rather than our BoundedPriorityQueue based implementation Mar 25, 2014
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13445/

override def compare(l: T, r: T) = ord.compare(l, r)
}
collectionAsScalaIterable(
ordering.leastOf(asJavaIterator(input), num)).iterator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this proper indendation? Should this be indented from the previous line?

@pwendell
Copy link
Contributor

Looks good to me. One small style comment.

@rxin
Copy link
Contributor Author

rxin commented Mar 26, 2014

weird i missed that. fixed.

val ordering = new GuavaOrdering[T] {
override def compare(l: T, r: T) = ord.compare(l, r)
}
collectionAsScalaIterable(ordering.leastOf(asJavaIterator(input), num)).iterator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't this just be
ordering.leastOf(input, num).iterator
?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it could - but trying to be more explicit here. i'm really not a fan of implicits, especially in critical paths ..

… based implementation. Also updated the documentation for top and takeOrdered.

Guava's top k implementation (in Ordering) is much faster than the BoundedPriorityQueue implementation for roughly sorted input (10 - 20X faster), and still faster for purely random input (2 - 5X).
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13456/

@rxin
Copy link
Contributor Author

rxin commented Mar 26, 2014

Ok I've merged this.

@asfgit asfgit closed this in b859853 Mar 26, 2014
@rxin rxin deleted the takeOrdered branch June 24, 2014 00:23
pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
…iorityQueue based implementation

Also updated the documentation for top and takeOrdered.

On my simple test of sorting 100 million (Int, Int) tuples using Spark, Guava's top k implementation (in Ordering) is much faster than the BoundedPriorityQueue implementation for roughly sorted input (10 - 20X faster), and still faster for purely random input (2 - 5X).

Author: Reynold Xin <rxin@apache.org>

Closes apache#229 from rxin/takeOrdered and squashes the following commits:

0d11844 [Reynold Xin] Use Guava's top k implementation rather than our BoundedPriorityQueue based implementation. Also updated the documentation for top and takeOrdered.
jamesrgrinter pushed a commit to jamesrgrinter/spark that referenced this pull request Apr 22, 2018
Igosuki pushed a commit to Adikteev/spark that referenced this pull request Jul 31, 2018
* In hdfs soak test, updated hdfsclient image and specified hdfs service name.

* Made the HDFS service name configurable.

* Use the "app name" (begins with a slash) instead of service name.

* Simplified the "hdfs-kerberos-delete-terasort-files" job

* Updated service name and image version in 'hdfs-kerberos-delete-terasort-files' job

* Make the executors run as root (for centos)

* Templated the principal in the 'hdfs-kerberos-delete-terasort-files' job
bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019
arjunshroff pushed a commit to arjunshroff/spark that referenced this pull request Nov 24, 2020
fishcus pushed a commit to fishcus/spark that referenced this pull request Nov 18, 2021
AL-1864 upgrade jackson-databind to 2.9.10.8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants