Improvements to CCHeap #457

gmevel · 2024-07-31T02:44:52Z

Several improvements to module CCHeap:

improve documentation: better order, more consistent wording, English fixes, more info, time complexity is indicated
reduce time complexity from O(n log n) to O(n) for several operations: building from a sequence of elements ({add,of}_{list,iter,seq,gen}), filtering (delete_all, filter)
enforce physical equality when delete_all or filter doesn’t remove any element
- for delete_all it is a bugfix, as it was specified in doc but not enforced
avoid some boxing in delete_one (by using physical equality on the results of recursive calls rather than pairing them with a boolean)
- note: delete_one was already O(n) and already ensured physical equalty
improve existing tests and add more

More details in commit messages.

The main change is the complexity reduction. Naive iteration of insertions is O(n log n), but there is a better generic algorithm (due to Tarjan I believe) that doesn’t depend on the actual heap representation, and that achieves O(n) as long as merge is in O(log n). The improvement of filtering functions is also a corollary of that.

Ideas for future work:

There are doc-strings in CCHeap.ml, I thus kept them in sync with CCHeap.mli, but I think they should go away, shouldn’t they?
By generalizing the code of delete_all, it would be easy to implement a function such as take_bounded : ~predicate:(elt -> bool) ~bound:(elt -> bool) -> t -> (elt list * t) which would delete and return (as an unordered list) every element satisfying the predicate and the bound, where the bound function is monotonic (that is, if bound x = false then for any y ≥ x, bound y = false). For instance, take_bounded ~predicate:(fun x -> x=x0) ~bound:(fun x -> x<=x0) would delete every occurrence of x0, and take_bounded ~predicate:(fun _ -> true) ~bound:(fun x -> x<x0) would delete every element less than x0. The interest is that it is done in time ~~O(k + ℓ log(n/ℓ))~~ O(k + k log (n/k)) where n is the size of the heap and k ≤ n is the number of elements satisfying the bound ~~and ℓ ≤ k is the number of deleted elements~~; this complexity is a O(n), and it is also a O(k log n), which is much smaller than O(n) when k is asymptotically smaller than n. Thus it is never worse, and sometimes it is much better, than an unbounded filter. And it cannot be implemented on the user side, because an iteration of take_min would achieve O(k log n) but not O(n); the worst case would be O(n log n). It remains to be seen how useful that function is, and what would be the best interface.
I haven’t looked at it thoroughly, but similar work may be done for CCMutHeap: binary heaps do support a linear-time add_list / of_list, albeit with a different algorithm than for leftist heaps.

Committing to these complexities in documentation is not a constraint for representation of heaps, because they are achieved by every well-known representation (for some of them, in amortized time): https://en.wikipedia.org/wiki/Template:Heap_Running_Times - `find_min`: O(1) - `take`: O(log n) - `insert`: O(log n) - `merge`: O(log(m+n)) (excepted binary heaps which only achieve O(m+n)) - `add_seq`: O(n log(m+n)) (trivially, by repeated insertion) + this can be improved to O(log(m) + n), regardless of the representation of heaps (to be done in a later commit) - `of_seq`: O(n log n) (ditto: can be improved to O(n)) Less trivial: - `filter`, `delete_{one,all}`: + O(n) can be achieved for any reasonable representation of heaps, by using `of_seq` and `to_seq` which, as said, can always be made O(n). + With the current implementation, it is not obvious, but the complexity of `filter` and `delete_all` is Θ(n log n); the complexity of `delete_one` is O(n). Indeed, node rebuilding with `_make_node` is in O(1), merging is in Θ(log n), and every element deletion induces one merge; there are heap instances that achieve the worst case Ω(n log n), for instance: x / \ x y / \ ... y / x / \ h y with n/3 occurrences of x, n/3 occurrences of y, a sub-heap h of n/3 elements, and when y is greater than all elements of h; then, deleting all occurrences of x performs the following computation: merge (merge (merge (merge h y) …) y) y where each `merge` takes time Θ(log n).

- for `delete_all` this is a bugfix (physical equality was documented but not implemented) - `delete_one` is unchanged, it already had complexity O(n) and ensured physical equality

- label all tests - decouple tests about different heap functions - random instances now have better coverage of possible cases: + more variability in size (previously, some tests were limited to a fixed size) + high probability of duplicates (previously, the probability of duplicates was negligible, because elements were drawn uniformly from the full `int` range) - the test for `of_list, take_exn` is now more precise (added a duplicate element) - the test for `to_list_sorted` is now more precise (checks that the resulting list is what we want, instead of just checking that it is sorted) - the test for `filter` is now more precise (also checks that no element has been spuriously dropped) - more uniform style for easier reading, using `|>`

gmevel · 2024-08-09T14:23:26Z

Building a heap by repeated insertions has one advantage, though: when the input sequence is sorted, then inserting them in reverse order builds a list-like heap where elements are entirely sorted. And it does run in O(n) rather than the worst-case O(n log n). And then, taking elements from the heap runs in O(1) rather than O(log n), converting to a sorted sequence runs in O(n) rather than O(n log n).

This is dependent on the actual heap representation. It is true of leftist heaps, but also of pairing heaps and Brodal-Okasaki heaps, at least.

We can take advantage of this for building a heap with an efficient structure from an almost-sorted sequence, which I think is a fairly common situation. For one, it improves heap-sorting.

When the input is almost-but-not-entirely sorted, I’m not sure how well repeated insertions fare. However, we can combine this idea with the generic technique that I introduced earlier in this PR for building a heap: split the input sequence in already-sorted chunks, convert each chunk to an efficient list-like heap, then merge all heaps with the generic technique of this PR. Then, building a heap from any sequence is in O(n), and the resulting heap records every greater-than relation between successive elements of the input sequence. Then, I haven’t checked the math, but I think than converting the heap to a sorted sequence runs in something like O(n log(1 + Inv(x1,…,xn)/n), where Inv(x1,…,xn) = #{ (i, j) | i<j, xi>xj } is the number of inversions of the input sequence x1,…,xn. At worst, Inv(x1,…,xn) = O(n²) so the worst-case bound is of course O(n log n). But the fewer inversions, the better the complexity; at best, if there are only a constant number of misplaced elements, then Inv(x1,…,xn) = O(n) and the complexity reduces to O(n).

Implementing that, I just pushed a commit that exports new functions of_iter_almost_sorted and add_iter_almost_sorted. I did not want to multiply the number of functions with variants that take lists or Seq.t or gen, so I only provided the variants with iter, but I admit this interface is not satisfying. Alternatively, if you agree on the change, we can simplify the interface by just changing the implementation of the existing function of_iter / add_iter (hence also impacting the existing variants with lists, Seq.t and gen). It is almost a strict improvement of it: it still works for arbitrary sequences, it still has complexity O(n), but when the input happens to be almost sorted it builds a better-shaped heap that ensures more efficient sorting. On the other hand, it has some overhead, and it allocates a temporary list of length O(length of longest sorted chunk) = O(n) (because it needs to reverse it) [this might be avoided with a TRMC-like hack?].

c-cube · 2024-08-26T15:53:42Z

hmm I think the from_almost_sorted variants should take a list as input since they do it under the hood anyway? This way it's up to the user to judge whether they can get a list cheaply in the first place, or whether to use the more generic version.

gmevel · 2024-09-04T14:00:51Z

Under the hood, it builds a list whose length is that of the longest sorted subsequence, which may be much less than the total length. But it is not a very strong argument, and I agree that, if we must choose, it sounds better for the interface to focus on the most idiomatic OCaml containers, i.e. lists. However, I am more and more convinced that we should merge from_*_almost_sorted into from_* and just provide the latter, which makes the question disappear.

c-cube · 2024-10-02T14:04:51Z

Sorry it took me so long. Thank you for the contribution! :)

gmevel added 12 commits July 26, 2024 22:17

doc/CCHeap: reorder conversion functions

6bd5d3a

doc/CCHeap: uniformize doc of conversion functions

8666faf

doc/CCHeap: fix grammar, improve doc of delete_{one,all}

8349a4d

doc/CCHeap: move filter down

cc2dd6d

perf/CCHeap: heap building in O(n)

806bb8c

perf/CCHeap: filter, delete_all in O(n) and ensure (==)

3f95fd4

- for `delete_all` this is a bugfix (physical equality was documented but not implemented) - `delete_one` is unchanged, it already had complexity O(n) and ensured physical equality

doc/CCHeap: document (==) for merge

6c810eb

CCHeap: avoid boxing in delete_one

fdfc806

tests/CCHeap: reorder tests

92676f5

tests/CCHeap: add tests

13db1c3

gmevel force-pushed the linear-heap-building branch from 75185ff to bd42cd6 Compare July 31, 2024 14:12

doc/CCHeap: fix English, improve wording, add sections

a24e1f7

gmevel force-pushed the linear-heap-building branch from bd42cd6 to a24e1f7 Compare July 31, 2024 15:51

CCHeap: building from almost-sorted sequences

5d31550

Merge branch 'main' into linear-heap-building

3c530f4

c-cube merged commit afb93cf into c-cube:main Oct 2, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements to CCHeap #457

Improvements to CCHeap #457

gmevel commented Jul 31, 2024 •

edited by c-cube

Loading

gmevel commented Aug 9, 2024

c-cube commented Aug 26, 2024

gmevel commented Sep 4, 2024

c-cube commented Oct 2, 2024

Improvements to CCHeap #457

Improvements to CCHeap #457

Conversation

gmevel commented Jul 31, 2024 • edited by c-cube Loading

gmevel commented Aug 9, 2024

c-cube commented Aug 26, 2024

gmevel commented Sep 4, 2024

c-cube commented Oct 2, 2024

gmevel commented Jul 31, 2024 •

edited by c-cube

Loading