-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements to CCHeap #457
Conversation
Committing to these complexities in documentation is not a constraint for representation of heaps, because they are achieved by every well-known representation (for some of them, in amortized time): https://en.wikipedia.org/wiki/Template:Heap_Running_Times - `find_min`: O(1) - `take`: O(log n) - `insert`: O(log n) - `merge`: O(log(m+n)) (excepted binary heaps which only achieve O(m+n)) - `add_seq`: O(n log(m+n)) (trivially, by repeated insertion) + this can be improved to O(log(m) + n), regardless of the representation of heaps (to be done in a later commit) - `of_seq`: O(n log n) (ditto: can be improved to O(n)) Less trivial: - `filter`, `delete_{one,all}`: + O(n) can be achieved for any reasonable representation of heaps, by using `of_seq` and `to_seq` which, as said, can always be made O(n). + With the current implementation, it is not obvious, but the complexity of `filter` and `delete_all` is Θ(n log n); the complexity of `delete_one` is O(n). Indeed, node rebuilding with `_make_node` is in O(1), merging is in Θ(log n), and every element deletion induces one merge; there are heap instances that achieve the worst case Ω(n log n), for instance: x / \ x y / \ ... y / x / \ h y with n/3 occurrences of x, n/3 occurrences of y, a sub-heap h of n/3 elements, and when y is greater than all elements of h; then, deleting all occurrences of x performs the following computation: merge (merge (merge (merge h y) …) y) y where each `merge` takes time Θ(log n).
- for `delete_all` this is a bugfix (physical equality was documented but not implemented) - `delete_one` is unchanged, it already had complexity O(n) and ensured physical equality
- label all tests - decouple tests about different heap functions - random instances now have better coverage of possible cases: + more variability in size (previously, some tests were limited to a fixed size) + high probability of duplicates (previously, the probability of duplicates was negligible, because elements were drawn uniformly from the full `int` range) - the test for `of_list, take_exn` is now more precise (added a duplicate element) - the test for `to_list_sorted` is now more precise (checks that the resulting list is what we want, instead of just checking that it is sorted) - the test for `filter` is now more precise (also checks that no element has been spuriously dropped) - more uniform style for easier reading, using `|>`
75185ff
to
bd42cd6
Compare
bd42cd6
to
a24e1f7
Compare
Building a heap by repeated insertions has one advantage, though: when the input sequence is sorted, then inserting them in reverse order builds a list-like heap where elements are entirely sorted. And it does run in O(n) rather than the worst-case O(n log n). And then, taking elements from the heap runs in O(1) rather than O(log n), converting to a sorted sequence runs in O(n) rather than O(n log n). This is dependent on the actual heap representation. It is true of leftist heaps, but also of pairing heaps and Brodal-Okasaki heaps, at least. We can take advantage of this for building a heap with an efficient structure from an almost-sorted sequence, which I think is a fairly common situation. For one, it improves heap-sorting. When the input is almost-but-not-entirely sorted, I’m not sure how well repeated insertions fare. However, we can combine this idea with the generic technique that I introduced earlier in this PR for building a heap: split the input sequence in already-sorted chunks, convert each chunk to an efficient list-like heap, then merge all heaps with the generic technique of this PR. Then, building a heap from any sequence is in O(n), and the resulting heap records every greater-than relation between successive elements of the input sequence. Then, I haven’t checked the math, but I think than converting the heap to a sorted sequence runs in something like Implementing that, I just pushed a commit that exports new functions |
hmm I think the |
Under the hood, it builds a list whose length is that of the longest sorted subsequence, which may be much less than the total length. But it is not a very strong argument, and I agree that, if we must choose, it sounds better for the interface to focus on the most idiomatic OCaml containers, i.e. lists. However, I am more and more convinced that we should merge |
Sorry it took me so long. Thank you for the contribution! :) |
Several improvements to module
CCHeap
:{add,of}_{list,iter,seq,gen}
), filtering (delete_all, filter
)delete_all
orfilter
doesn’t remove any elementdelete_all
it is a bugfix, as it was specified in doc but not enforceddelete_one
(by using physical equality on the results of recursive calls rather than pairing them with a boolean)delete_one
was already O(n) and already ensured physical equaltyMore details in commit messages.
The main change is the complexity reduction. Naive iteration of insertions is O(n log n), but there is a better generic algorithm (due to Tarjan I believe) that doesn’t depend on the actual heap representation, and that achieves O(n) as long as merge is in O(log n). The improvement of filtering functions is also a corollary of that.
Ideas for future work:
CCHeap.ml
, I thus kept them in sync withCCHeap.mli
, but I think they should go away, shouldn’t they?delete_all
, it would be easy to implement a function such astake_bounded : ~predicate:(elt -> bool) ~bound:(elt -> bool) -> t -> (elt list * t)
which would delete and return (as an unordered list) every element satisfying thepredicate
and thebound
, where thebound
function is monotonic (that is, ifbound x = false
then for any y ≥ x,bound y = false
). For instance,take_bounded ~predicate:(fun x -> x=x0) ~bound:(fun x -> x<=x0)
would delete every occurrence ofx0
, andtake_bounded ~predicate:(fun _ -> true) ~bound:(fun x -> x<x0)
would delete every element less thanx0
. The interest is that it is done in timeO(k + ℓ log(n/ℓ))O(k + k log (n/k)) where n is the size of the heap and k ≤ n is the number of elements satisfying the boundand ℓ ≤ k is the number of deleted elements; this complexity is a O(n), and it is also a O(k log n), which is much smaller than O(n) when k is asymptotically smaller than n. Thus it is never worse, and sometimes it is much better, than an unboundedfilter
. And it cannot be implemented on the user side, because an iteration oftake_min
would achieve O(k log n) but not O(n); the worst case would be O(n log n). It remains to be seen how useful that function is, and what would be the best interface.CCMutHeap
: binary heaps do support a linear-timeadd_list / of_list
, albeit with a different algorithm than for leftist heaps.