Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elaborate on asymptotics of IntMap #957

Merged
merged 1 commit into from
Mar 9, 2024
Merged

Elaborate on asymptotics of IntMap #957

merged 1 commit into from
Mar 9, 2024

Conversation

Bodigrim
Copy link
Contributor

@Bodigrim Bodigrim commented Jul 6, 2023

Recently I had a long discussion elsewhere comparing asymptotics of IntMap to balanced binary trees. I think it's worth writing down in the documentation.

containers/src/Data/IntMap.hs Outdated Show resolved Hide resolved
containers/src/Data/IntMap.hs Outdated Show resolved Hide resolved
Comment on lines 60 to 64
-- * even for extremely unbalanced tree the depth cannot be larger than
-- the number of elements \(n\),
-- * each level of a Patricia tree determines at least one more bit
-- shared by all subelements, so there could not be more
-- than \(W\) levels.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this information is useful for someone looking to use an IntMap. The next paragraph might be useful however.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it helpful to bring a quick exposition without forcing people to look into the paper.

Copy link
Contributor Author

@Bodigrim Bodigrim Jul 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way people often read $O(\min(n, W))$ is "it grows linearly until $n < W$, then remains constant", which is quite a natural reading honestly. It's important to explain that it's more like "it's normally more like $\log N$, which is capped by $W$, but in unbalanced edge cases can grow as fast as $n$, but again capped by $W$", and this is because the depth of the tree is determined by these factors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it helpful to bring a quick exposition without forcing people to look into the paper.

I disagree... I think documentation should either explain well or point to something which does. The two points about the tree only raise more questions. But I'll leave it to treeowl.

It's important to explain that it's more like "it's normally more like log N, which is capped by W, but in unbalanced edge cases can grow as fast as n, but again capped by W"

This is a fair summary that could be documented.

-- shared by all subelements, so there could not be more
-- than \(W\) levels.
--
-- If all \(n\) keys in the tree are less than \(N\),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true, but can be expanded as "If all n keys in the tree are in a contiguous range of size N...", since it is not about the absolute value.

But this is also not the full picture, there are other ways to get O(log N) where N is the number of potential keys. For instance if we only store elements that are k*i for i in [0..N], the statement above suggests it's O(log kN), but it's still O(log N).
The key to get O(log N) is that for every branch in the tree that divides the range, half of the potential elements should fall on either side. But this probably can't be presented nicely to the user without explaining the structure of the tree...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

O(log kN) and O(log N) are the same as long as k does not grow with the growth of N.

The point is not about being precise, but to explain that in a typical scenario IntMap is roughly logarithmic.

Copy link
Contributor

@meooow25 meooow25 Jul 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, I'm only seeing if the situation can be generalized to give the reader more information. But it might be good enough to address only the typical scenario of [0..N].

@Bodigrim
Copy link
Contributor Author

@treeowl any opinion on this?

@treeowl
Copy link
Contributor

treeowl commented Jul 11, 2023

I don't really think there's anything very logarithmic going on here in general. The bounds we give are conservative approximations. If we can give tighter, but still conservative, approximations, all the better. But vague "probably this good unless your data aren't what we like" doesn't seem so valuable.

@Lysxia
Copy link
Contributor

Lysxia commented Jul 14, 2023

The case where your keys are all in a small interval seems like a common enough situation that it's worth pointing out the logarithmic behavior in that case. min(n,64) doesn't actually tell you much because in practice maps contain less than 2^64 elements, so log(n) < 64. There's still an order of magnitude between a balanced tree of depth 10 and a Patricia tree of depth 64. So it's good to point out that cases where an IntMap has depth 64 are actually quite specific.

@treeowl
Copy link
Contributor

treeowl commented Jul 14, 2023

Agreed. In many cases, we can give bounds in terms of the maximum minus the minimum, right? But that is also pretty conservative, I think, since a map with keys like [minBound .. minBound + 10000] ++ [maxBound - 10000 .. maxBound] should be relatively efficient too, if I'm not mistaken. Is there a way we can talk about tree depth that's tighter while still being tractable?

@Bodigrim
Copy link
Contributor Author

Is there a way we can talk about tree depth that's tighter while still being tractable?

@treeowl I'll gladly take specific suggestions, but I'm almost over my time budget for this. I'd say that "each level of a Patricia tree determines at least one more bit shared by all subelements" is descriptive enough for an alert reader.

@Bodigrim
Copy link
Contributor Author

@treeowl I rebased and extended the description, how does it look now? It's purely a documentation change, I'd love to get it decided on one way or another.

@Lysxia
Copy link
Contributor

Lysxia commented Jan 30, 2024

That looks pretty good to me!

@Bodigrim
Copy link
Contributor Author

Bodigrim commented Feb 1, 2024

@treeowl just another reminder to take a look at the proposed documentation change.

@Lysxia Lysxia requested a review from treeowl February 1, 2024 22:12
@treeowl
Copy link
Contributor

treeowl commented Feb 1, 2024

I'll try to have a look tonight!

@Bodigrim
Copy link
Contributor Author

@treeowl one more ping.

@Bodigrim
Copy link
Contributor Author

@treeowl just a gentle reminder to review this PR.

Copy link
Contributor

@treeowl treeowl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry it took so long to review this. I have just a couple questions below.

-- If all \(n\) keys in the tree are between 0 and \(N\),
-- the estimate can be refined to \(O(\min(n, \log N))\). If the set of keys
-- is sufficiently "dense", this becomes \(O(\min(n, \log n))\) or simply
-- the familiar \(O(\log n)\), matching balanced binary trees.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if there are negative keys? Can we give a similarly concise refinement in that case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can probably improve further, but we don't have to do it now.

-- is sufficiently "dense", this becomes \(O(\min(n, \log n))\) or simply
-- the familiar \(O(\log n)\), matching balanced binary trees.
--
-- The most performant scenario for 'IntMap' are keys from a continuous subset,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "continuous" the right word here? Maybe "contiguous"?

@Bodigrim
Copy link
Contributor Author

Bodigrim commented Mar 9, 2024

@treeowl thanks, updated and rebased. Good to go?

@treeowl treeowl merged commit 8f6ef9a into haskell:master Mar 9, 2024
10 checks passed
@treeowl
Copy link
Contributor

treeowl commented Mar 9, 2024

Thanks, and thanks for your patience.

@Bodigrim Bodigrim deleted the intmap-docs branch March 9, 2024 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants