Add method to return the `k` most common items and speed up `most_common_*()` methods #25

clint-white · 2022-07-09T21:53:48Z

Summary

This PR proposes the following changes to improve the flexibility and efficiency of Counter's API for obtaining the most common items in the counter.

Add dedicated method k_most_common_ordered() to return a vector of the k most common items.
Switch from a stable to an unstable sorting algorithm in most_common_tiebreaker().

Motivation

At present Counter offers no way for a user to specifically ask for the k most common items, where k may be less than the length of the counter n. While it is easy enough to do something like the following, such code would be quite inefficient if k is small compared to n:

let mut most_common = counter.most_common();
most_common.truncate(k);

The above code sorts the entire list of n items, which has complexity O(n * log n), when all we need to do is to select the top k most common items and then sort just those k items. Selecting just the top k items can be accomplished efficiently by using a binary heap. The resulting algorithm can be much more efficient than sorting all n items. For a fixed value of k, this algorithm scales with increasing n as n + O(log n), instead of O(n * log n) when sorting all n items. For the extreme case of k = 1, this requires only n - 1 comparisons. In the limit as k approaches n, this algorithm approaches a heapsort of the n items.

This is precisely what Python's implementation in collections.Counter.most_common() does, calling heapq.nlargest(). For a more detailed analysis of the complexity of the algorithm, see these notes in the Python source code. The implementation here is fairly similar.

For counter-rs there is an additional advantage to providing a dedicated API for the k most common items. Since the existing most_common_*() functions return owned values of the key type T and therefore have to clone the keys, a method for obtaining only the top k most common items could get away with cloning only k keys instead of all n.

It would be nice to add a method which implements this more efficient algorithm when only the top k most common items are desired. A check of some dependents of counter-rs shows 2 of the 9 listed crates using Counter::most_common() for the purpose of finding the single most common item: see examples here and here.

Finally, most_common_tiebreaker() uses the stable slice::sort_by() algorithm (based on timsort) instead of slice::sort_unstable_by() (based on quicksort). There is a definite performance improvement to be had by switching to the unstable quicksort algorithm. This is recommended in the documentation in cases where the stability guarantee is not needed. In the case of Counter, there is no reason to preserve the relative pre-sort order of items which compare as equal, since this order is already unspecified due to the hashing function.

Benchmarks

I made some simple benchmarks to compare the times of most_common_ordered() and k_most_common_ordered() for various values of k, as well as to compare slice::sort_by() and slice::sort_unstable_by() in most_common_ordered() with counters of various lengths.

Some of the benchmarks use counters where the counts were just 0..n, so all of the counts are distinct and there are no ties. These cases used two different key types, usize and String, so that we test both with a Copy type and a type where there is a nontrivial cost to cloning the keys. The strings keys all had length 16.

Another set of benchmarks used a more realistic example by making counts of the frequencies of all the words appearing in The Complete Works of William Shakespeare. (Here "words" are continguous ASCII alphabetic characters seperated by non-alphabetic characters.) There are a total of 988,852 words, of which 25,469 are distinct. There are of course many ties, that is words having the same counts, and ties are much more common among low frequency words than among high frequency words. This situation is another advantage for k_most_common_ordered(): it has to break ties less often than does most_common_ordered() and tiebreaking is a more expensive comparision operation. This set of benchmarks used counters with both borrowed (&str) and owned (String) keys.

All times are in units of microseconds. All counts are of type usize.

stable vs unstable sorting

Comparison of Counter::<usize>::most_common_ordered() when using slice::sort_by() and slice::sort_unstable_by() for counters of size n:

`n`	stable	unstable	% change
100	2.8533	2.1542	-25.070
1000	42.303	25.604	-40.206
10000	559.32	294.52	-48.002
100000	6668.8	3426.1	-48.638

Similar tests except the keys are of type String and all keys have length 16:

`n`	stable	unstable	% change
100	5.1853	4.8483	-11.408
1000	64.252	58.435	-11.118
10000	745.59	657.32	-16.144
100000	9100.4	6965.4	-23.460

Comparison using the counters containing the frequencies of words in Shakespeare's complete works:

`T`	stable	unstable	% change
`&str`	4889.7	2814.5	-42.393
`String`	5489.4	4083.6	-25.610

k most common items

These benchmarks use counters of size 10000 with usize and String keys. We compare the times of k_most_common_ordered() for various values of k against those of most_common_ordered() when using both the stable and unstable sorts.

`T`	1	10	100	1000	5000	stable	unstable
`usize`	31.341	32.882	45.239	136.90	366.11	559.32	294.52
`String`	20.259	24.331	37.139	153.75	473.45	745.59	657.32

Similar comparisons using counters of the frequencies of words in Shakespeare's complete works (25469 keys):

`T`	1	10	100	1000	10000	stable	unstable
`&str`	44.445	45.209	64.230	436.42	3929.9	4889.7	2814.5
`String`	44.347	45.695	66.491	475.45	4146.6	5489.4	4083.6

Functionally this method is equivalent to the following: let mut items = counter.most_common_ordered(); items.truncate(k); items However, by using a binary heap to keep only the `k` most common items, the implementation can be more efficient, often much more so, than sorting all the items.

There is no reason to preserve the relative order of items that compare equal, since that order is already unspecified due to the hashing used by the `HashMap`. The unstable sorting algorithm (quicksort) is typically faster than the stable sorting algorithm (timsort) and does not require the temporary storage used by the latter. Benchmarks comparing `Counter::most_common_ordered()` and `Counter::k_most_common_ordered(n)`, where `n` is the length of the counter, found that the heapsort used by `k_most_common_ordered()` was typically faster than the stable sort used by `most_common_ordered()`. However, this is no longer the case when switching to the unstable sort; the quicksort implementation is typically a little faster than the heapsort, so now it is better for `k_most_common_ordered()` to call `most_common_ordered()` with `k >= n`.

coriolinus · 2022-07-09T22:21:26Z

Wow! I don't have time right now to look at this, but I will attempt to get a proper review in within one week. If I bust that deadline, please feel free to drop a note here to remind me.

I hadn't previously used BinaryHeap::peek_mut, and wasn't sure about its semantics when changing the value, so I wanted to ensure that we really weren't going to lose any data. With this test, I am pretty confident in this implementation.

coriolinus

This is really good work! Thank you for contributing it, and in particular for the thorough documentation.

coriolinus · 2022-07-16T10:36:35Z

src/lib.rs

+
+        // Step 3. Sort the items in the heap with the second phases of heapsort.  The number of
+        // comparisons is 2 * k * log2(k) + O(k).
+        heap.into_sorted_vec()


Too bad about rust-lang/rust#76234, huh?

Yeah, exactly. That would be useful here.

Incidentally, I orginally used .into_vec() followed by slice::sort_unstable() instead of .into_sorted_vec() as suggested in the discussion on that PR:

let mut topk = heap.into_vec(); topk.sort_unstable();

In my benchmarks, for small heaps .into_sorted_vec() is just slightly faster, but as the heap size grows it becomes faster to take the vector from the heap and use slice::sort_unstable(). I was on the fence for a while about what to do here. In the end I thought it was a little more direct to just use .into_sorted_vec(). Plus, then I realized that as k approaches n the algorithm becomes a heapsort. So if nothing else it is a little easier to think about what is happening, at least for me.

Anyway, it would certainly be an option to take the vector from the heap and sort it instead of finishing the heapsort in step 3 if you like that approach better. But as long as k is small compared to n it doesn't matter which way we do it, it is much better than sorting the entire counter.

clint-white · 2022-07-16T18:04:09Z

Thanks for reviewing it and for the feedback, as well as for adding the better test.

clint-white added 2 commits July 9, 2022 13:59

coriolinus approved these changes Jul 16, 2022

View reviewed changes

coriolinus merged commit dab9b14 into coriolinus:master Jul 16, 2022

clint-white deleted the k-most-common-ordered branch October 1, 2022 15:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add method to return the `k` most common items and speed up `most_common_*()` methods #25

Add method to return the `k` most common items and speed up `most_common_*()` methods #25

clint-white commented Jul 9, 2022 •

edited

Loading

coriolinus commented Jul 9, 2022

coriolinus left a comment

coriolinus Jul 16, 2022

clint-white Jul 16, 2022

clint-white commented Jul 16, 2022

Add method to return the k most common items and speed up most_common_*() methods #25

Add method to return the k most common items and speed up most_common_*() methods #25

Conversation

clint-white commented Jul 9, 2022 • edited Loading

Summary

Motivation

Benchmarks

stable vs unstable sorting

k most common items

coriolinus commented Jul 9, 2022

coriolinus left a comment

Choose a reason for hiding this comment

coriolinus Jul 16, 2022

Choose a reason for hiding this comment

clint-white Jul 16, 2022

Choose a reason for hiding this comment

clint-white commented Jul 16, 2022

Add method to return the `k` most common items and speed up `most_common_*()` methods #25

Add method to return the `k` most common items and speed up `most_common_*()` methods #25

clint-white commented Jul 9, 2022 •

edited

Loading