Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to CCHeap #457

Merged
merged 15 commits into from
Oct 2, 2024
Merged

Improvements to CCHeap #457

merged 15 commits into from
Oct 2, 2024

Conversation

gmevel
Copy link
Contributor

@gmevel gmevel commented Jul 31, 2024

Several improvements to module CCHeap:

  • improve documentation: better order, more consistent wording, English fixes, more info, time complexity is indicated
  • reduce time complexity from O(n log n) to O(n) for several operations: building from a sequence of elements ({add,of}_{list,iter,seq,gen}), filtering (delete_all, filter)
  • enforce physical equality when delete_all or filter doesn’t remove any element
    • for delete_all it is a bugfix, as it was specified in doc but not enforced
  • avoid some boxing in delete_one (by using physical equality on the results of recursive calls rather than pairing them with a boolean)
    • note: delete_one was already O(n) and already ensured physical equalty
  • improve existing tests and add more

More details in commit messages.

The main change is the complexity reduction. Naive iteration of insertions is O(n log n), but there is a better generic algorithm (due to Tarjan I believe) that doesn’t depend on the actual heap representation, and that achieves O(n) as long as merge is in O(log n). The improvement of filtering functions is also a corollary of that.

Ideas for future work:

  • There are doc-strings in CCHeap.ml, I thus kept them in sync with CCHeap.mli, but I think they should go away, shouldn’t they?
  • By generalizing the code of delete_all, it would be easy to implement a function such as take_bounded : ~predicate:(elt -> bool) ~bound:(elt -> bool) -> t -> (elt list * t) which would delete and return (as an unordered list) every element satisfying the predicate and the bound, where the bound function is monotonic (that is, if bound x = false then for any y ≥ x, bound y = false). For instance, take_bounded ~predicate:(fun x -> x=x0) ~bound:(fun x -> x<=x0) would delete every occurrence of x0, and take_bounded ~predicate:(fun _ -> true) ~bound:(fun x -> x<x0) would delete every element less than x0. The interest is that it is done in time O(k + ℓ log(n/ℓ)) O(k + k log (n/k)) where n is the size of the heap and k ≤ n is the number of elements satisfying the bound and ℓ ≤ k is the number of deleted elements; this complexity is a O(n), and it is also a O(k log n), which is much smaller than O(n) when k is asymptotically smaller than n. Thus it is never worse, and sometimes it is much better, than an unbounded filter. And it cannot be implemented on the user side, because an iteration of take_min would achieve O(k log n) but not O(n); the worst case would be O(n log n). It remains to be seen how useful that function is, and what would be the best interface.
  • I haven’t looked at it thoroughly, but similar work may be done for CCMutHeap: binary heaps do support a linear-time add_list / of_list, albeit with a different algorithm than for leftist heaps.

gmevel added 12 commits July 26, 2024 22:17
Committing to these complexities in documentation is not a constraint
for representation of heaps, because they are achieved by every
well-known representation (for some of them, in amortized time):

https://en.wikipedia.org/wiki/Template:Heap_Running_Times

- `find_min`: O(1)
- `take`: O(log n)
- `insert`: O(log n)
- `merge`: O(log(m+n)) (excepted binary heaps which only achieve O(m+n))
- `add_seq`: O(n log(m+n)) (trivially, by repeated insertion)
  + this can be improved to O(log(m) + n), regardless of the
    representation of heaps (to be done in a later commit)
- `of_seq`: O(n log n) (ditto: can be improved to O(n))

Less trivial:

- `filter`, `delete_{one,all}`:

  + O(n) can be achieved for any reasonable representation of heaps, by
    using `of_seq` and `to_seq` which, as said, can always be made O(n).

  + With the current implementation, it is not obvious, but the
    complexity of `filter` and `delete_all` is Θ(n log n); the
    complexity of `delete_one` is O(n). Indeed, node rebuilding with
    `_make_node` is in O(1), merging is in Θ(log n), and every element
    deletion induces one merge; there are heap instances that achieve
    the worst case Ω(n log n), for instance:

                     x
                    / \
                   x   y
                  / \
                ...  y
                /
               x
              / \
             h   y

    with n/3 occurrences of x, n/3 occurrences of y, a sub-heap h of n/3
    elements, and when y is greater than all elements of h; then,
    deleting all occurrences of x performs the following computation:

        merge (merge (merge (merge h y) …) y) y

    where each `merge` takes time Θ(log n).
- for `delete_all` this is a bugfix
  (physical equality was documented but not implemented)
- `delete_one` is unchanged, it already had complexity O(n)
  and ensured physical equality
- label all tests
- decouple tests about different heap functions
- random instances now have better coverage of possible cases:
  + more variability in size
    (previously, some tests were limited to a fixed size)
  + high probability of duplicates
    (previously, the probability of duplicates was negligible,
    because elements were drawn uniformly from the full `int` range)
- the test for `of_list, take_exn` is now more precise
  (added a duplicate element)
- the test for `to_list_sorted` is now more precise
  (checks that the resulting list is what we want,
  instead of just checking that it is sorted)
- the test for `filter` is now more precise
  (also checks that no element has been spuriously dropped)
- more uniform style for easier reading, using `|>`
@gmevel gmevel force-pushed the linear-heap-building branch from 75185ff to bd42cd6 Compare July 31, 2024 14:12
@gmevel gmevel force-pushed the linear-heap-building branch from bd42cd6 to a24e1f7 Compare July 31, 2024 15:51
@gmevel
Copy link
Contributor Author

gmevel commented Aug 9, 2024

Building a heap by repeated insertions has one advantage, though: when the input sequence is sorted, then inserting them in reverse order builds a list-like heap where elements are entirely sorted. And it does run in O(n) rather than the worst-case O(n log n). And then, taking elements from the heap runs in O(1) rather than O(log n), converting to a sorted sequence runs in O(n) rather than O(n log n).

This is dependent on the actual heap representation. It is true of leftist heaps, but also of pairing heaps and Brodal-Okasaki heaps, at least.

We can take advantage of this for building a heap with an efficient structure from an almost-sorted sequence, which I think is a fairly common situation. For one, it improves heap-sorting.

When the input is almost-but-not-entirely sorted, I’m not sure how well repeated insertions fare. However, we can combine this idea with the generic technique that I introduced earlier in this PR for building a heap: split the input sequence in already-sorted chunks, convert each chunk to an efficient list-like heap, then merge all heaps with the generic technique of this PR. Then, building a heap from any sequence is in O(n), and the resulting heap records every greater-than relation between successive elements of the input sequence. Then, I haven’t checked the math, but I think than converting the heap to a sorted sequence runs in something like O(n log(1 + Inv(x1,…,xn)/n), where Inv(x1,…,xn) = #{ (i, j) | i<j, xi>xj } is the number of inversions of the input sequence x1,…,xn. At worst, Inv(x1,…,xn) = O(n²) so the worst-case bound is of course O(n log n). But the fewer inversions, the better the complexity; at best, if there are only a constant number of misplaced elements, then Inv(x1,…,xn) = O(n) and the complexity reduces to O(n).

Implementing that, I just pushed a commit that exports new functions of_iter_almost_sorted and add_iter_almost_sorted. I did not want to multiply the number of functions with variants that take lists or Seq.t or gen, so I only provided the variants with iter, but I admit this interface is not satisfying. Alternatively, if you agree on the change, we can simplify the interface by just changing the implementation of the existing function of_iter / add_iter (hence also impacting the existing variants with lists, Seq.t and gen). It is almost a strict improvement of it: it still works for arbitrary sequences, it still has complexity O(n), but when the input happens to be almost sorted it builds a better-shaped heap that ensures more efficient sorting. On the other hand, it has some overhead, and it allocates a temporary list of length O(length of longest sorted chunk) = O(n) (because it needs to reverse it) [this might be avoided with a TRMC-like hack?].

@c-cube
Copy link
Owner

c-cube commented Aug 26, 2024

hmm I think the from_almost_sorted variants should take a list as input since they do it under the hood anyway? This way it's up to the user to judge whether they can get a list cheaply in the first place, or whether to use the more generic version.

@gmevel
Copy link
Contributor Author

gmevel commented Sep 4, 2024

Under the hood, it builds a list whose length is that of the longest sorted subsequence, which may be much less than the total length. But it is not a very strong argument, and I agree that, if we must choose, it sounds better for the interface to focus on the most idiomatic OCaml containers, i.e. lists. However, I am more and more convinced that we should merge from_*_almost_sorted into from_* and just provide the latter, which makes the question disappear.

@c-cube c-cube merged commit afb93cf into c-cube:main Oct 2, 2024
1 check passed
@c-cube
Copy link
Owner

c-cube commented Oct 2, 2024

Sorry it took me so long. Thank you for the contribution! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants