-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add method to return the k
most common items and speed up most_common_*()
methods
#25
Add method to return the k
most common items and speed up most_common_*()
methods
#25
Conversation
Functionally this method is equivalent to the following: let mut items = counter.most_common_ordered(); items.truncate(k); items However, by using a binary heap to keep only the `k` most common items, the implementation can be more efficient, often much more so, than sorting all the items.
There is no reason to preserve the relative order of items that compare equal, since that order is already unspecified due to the hashing used by the `HashMap`. The unstable sorting algorithm (quicksort) is typically faster than the stable sorting algorithm (timsort) and does not require the temporary storage used by the latter. Benchmarks comparing `Counter::most_common_ordered()` and `Counter::k_most_common_ordered(n)`, where `n` is the length of the counter, found that the heapsort used by `k_most_common_ordered()` was typically faster than the stable sort used by `most_common_ordered()`. However, this is no longer the case when switching to the unstable sort; the quicksort implementation is typically a little faster than the heapsort, so now it is better for `k_most_common_ordered()` to call `most_common_ordered()` with `k >= n`.
Wow! I don't have time right now to look at this, but I will attempt to get a proper review in within one week. If I bust that deadline, please feel free to drop a note here to remind me. |
I hadn't previously used BinaryHeap::peek_mut, and wasn't sure about its semantics when changing the value, so I wanted to ensure that we really weren't going to lose any data. With this test, I am pretty confident in this implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really good work! Thank you for contributing it, and in particular for the thorough documentation.
|
||
// Step 3. Sort the items in the heap with the second phases of heapsort. The number of | ||
// comparisons is 2 * k * log2(k) + O(k). | ||
heap.into_sorted_vec() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too bad about rust-lang/rust#76234, huh?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, exactly. That would be useful here.
Incidentally, I orginally used .into_vec()
followed by slice::sort_unstable()
instead of .into_sorted_vec()
as suggested in the discussion on that PR:
let mut topk = heap.into_vec();
topk.sort_unstable();
In my benchmarks, for small heaps .into_sorted_vec()
is just slightly faster, but as the heap size grows it becomes faster to take the vector from the heap and use slice::sort_unstable()
. I was on the fence for a while about what to do here. In the end I thought it was a little more direct to just use .into_sorted_vec()
. Plus, then I realized that as k approaches n the algorithm becomes a heapsort. So if nothing else it is a little easier to think about what is happening, at least for me.
Anyway, it would certainly be an option to take the vector from the heap and sort it instead of finishing the heapsort in step 3 if you like that approach better. But as long as k is small compared to n it doesn't matter which way we do it, it is much better than sorting the entire counter.
Thanks for reviewing it and for the feedback, as well as for adding the better test. |
Summary
This PR proposes the following changes to improve the flexibility and efficiency of
Counter
's API for obtaining the most common items in the counter.k_most_common_ordered()
to return a vector of the k most common items.most_common_tiebreaker()
.Motivation
At present
Counter
offers no way for a user to specifically ask for the k most common items, where k may be less than the length of the counter n. While it is easy enough to do something like the following, such code would be quite inefficient if k is small compared to n:The above code sorts the entire list of n items, which has complexity O(n * log n), when all we need to do is to select the top k most common items and then sort just those k items. Selecting just the top k items can be accomplished efficiently by using a binary heap. The resulting algorithm can be much more efficient than sorting all n items. For a fixed value of k, this algorithm scales with increasing n as n + O(log n), instead of O(n * log n) when sorting all n items. For the extreme case of k = 1, this requires only n - 1 comparisons. In the limit as k approaches n, this algorithm approaches a heapsort of the n items.
This is precisely what Python's implementation in
collections.Counter.most_common()
does, callingheapq.nlargest()
. For a more detailed analysis of the complexity of the algorithm, see these notes in the Python source code. The implementation here is fairly similar.For
counter-rs
there is an additional advantage to providing a dedicated API for the k most common items. Since the existingmost_common_*()
functions return owned values of the key typeT
and therefore have to clone the keys, a method for obtaining only the top k most common items could get away with cloning only k keys instead of all n.It would be nice to add a method which implements this more efficient algorithm when only the top k most common items are desired. A check of some dependents of
counter-rs
shows 2 of the 9 listed crates usingCounter::most_common()
for the purpose of finding the single most common item: see examples here and here.Finally,
most_common_tiebreaker()
uses the stableslice::sort_by()
algorithm (based on timsort) instead ofslice::sort_unstable_by()
(based on quicksort). There is a definite performance improvement to be had by switching to the unstable quicksort algorithm. This is recommended in the documentation in cases where the stability guarantee is not needed. In the case ofCounter
, there is no reason to preserve the relative pre-sort order of items which compare as equal, since this order is already unspecified due to the hashing function.Benchmarks
I made some simple benchmarks to compare the times of
most_common_ordered()
andk_most_common_ordered()
for various values ofk
, as well as to compareslice::sort_by()
andslice::sort_unstable_by()
inmost_common_ordered()
with counters of various lengths.Some of the benchmarks use counters where the counts were just
0..n
, so all of the counts are distinct and there are no ties. These cases used two different key types,usize
andString
, so that we test both with aCopy
type and a type where there is a nontrivial cost to cloning the keys. The strings keys all had length 16.Another set of benchmarks used a more realistic example by making counts of the frequencies of all the words appearing in The Complete Works of William Shakespeare. (Here "words" are continguous ASCII alphabetic characters seperated by non-alphabetic characters.) There are a total of 988,852 words, of which 25,469 are distinct. There are of course many ties, that is words having the same counts, and ties are much more common among low frequency words than among high frequency words. This situation is another advantage for
k_most_common_ordered()
: it has to break ties less often than doesmost_common_ordered()
and tiebreaking is a more expensive comparision operation. This set of benchmarks used counters with both borrowed (&str
) and owned (String
) keys.All times are in units of microseconds. All counts are of type
usize
.stable vs unstable sorting
Comparison of
Counter::<usize>::most_common_ordered()
when usingslice::sort_by()
andslice::sort_unstable_by()
for counters of sizen
:n
Similar tests except the keys are of type
String
and all keys have length 16:n
Comparison using the counters containing the frequencies of words in Shakespeare's complete works:
T
&str
String
k most common items
These benchmarks use counters of size 10000 with
usize
andString
keys. We compare the times ofk_most_common_ordered()
for various values ofk
against those ofmost_common_ordered()
when using both the stable and unstable sorts.T
usize
String
Similar comparisons using counters of the frequencies of words in Shakespeare's complete works (25469 keys):
T
&str
String