-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: ordered tree-structured maps #1374
Comments
Keeping the comment for edit history; mostly irrelevant with the revised proposal. |
I've updated the proposal to work a bit differently, to provide consistency in the slice operator (ie. if The revision makes it so that the changes are defined only for maps where the key types are orderable, per the Go spec. There's still an open question, which seems like something which must have already been solved in the literature, but I'll defer the research to another day :) |
Hmm, about ordered keys, I think we should not do that (we can argue that it can be considered a language change, even if the syntax is the same, the behavior is not). We must be deterministic, so we can force always the same ordering for the same set of keys, but not an alphabetical one. We could use the Inorder Traversal when traversing the tree on for loops. On the other side, I don't think binary trees are the correct structure for that. Maybe we could use B-Trees (self-balancing Trees). They don't need to be balanced periodically, and they are good for disk storage. |
I've revised the title of the proposal a second time, as well as the description of the proposal, in a hope to draw attention to the fact that the underlying implementation is not important; the proposal is interested with defining an ordering for maps with common, orderable keys and allowing as a consequence a "subset" operation on them. I'll try to give my thoughts with regards to the concerns expressed by @mvertes and @ajnavarro for now. Concern: we should not have this as the implementation for maps; instead we should create a new keyword so people know to expect different behaviour from GoLet me clarify here that currently we do not allow for persistance of maps in realm data in the first place. The map as such is currently a "half-supported" data structure. While I haven't done full research on how map storage could actually work as a hash map, I think it's safe to assume that if we were to store maps as is, as hash maps, loading a map object would entail loading all of its keys and reference to the ObjectIDs of their values at least. The advantage of tree-like structures in this case is that we can leverage their "tree" property and only load the objects we are interested in, and allow for effectively building structures that can hold millions of keys while still performing relatively efficiently when running realms. My concerns with creating a new data structure with a different keyword (say,
An alternative idea, which could solve a majority of concerns, would be that of implementing generics, and as such allowing us to create typed avl trees: I think we need to have a data structure for storing large amounts of data where all operations (of the same type, ie. access/write/replace/delete...) cost the same amount of gas, even though this might not perfectly reflect the "real" CPU cycles. This is to avoid strategic delaying of transactions (to do them when rebalancing is not necessary), and allows us to scale these data structures to be able to work and cost the same with millions of keys, without developers thinking of forking a contract into a reset version of itself just to avoid increasing map insertion costs. (See the "On gas cost and performance" and the following questions before raising any concerns about this). Another reason why I believe this language change makes sense is that the behaviour as explained by this proposal would be purely additive to the behaviour that a Go developer (and the go parser) might expect.
On this one, I have a final last note: with the current proposal for the implementation, there actually is a way for someone to fallback to the normal hashmap behaviour (non-deterministic ordering in for loops, and no subset feature). You simply need to wrap your type around a struct: func main() {
a := make(map[MyKey]int)
a[MyKey{8}] = 11
fmt.Println(a)
}
type MyKey struct{ N int } |
@thehowl can we bring this point up again? Would be great for the UX if we can use a native map type. |
@leohhhn I won't have time for a while to deep-dive on this before test4, most likely. This is very significant, and coming up with a good PR will at least take a week of focused work; I want to close some of the existing "threads" of work I'm following before going after this |
Interesting: https://github.com/rsc/omap |
The following is a proposal to change the underlying implementation of the native
map
data structure to provide sorted key ordering, a map "slicing" operator, and a good way to efficiently store maps in realm storage.This would also entail, as such, the deprecation of the
avl
package in favour of using the type-safe and language native data structure providing the same functionality, all the while supporting non-string keys.This proposal has been decoupled from a larger proposal, codenamed Sogno, on request by @moul. If all outstanding problems are solved, I'll try to work on this separately from the overarching Sogno project in order to hopefully expedite its development and, if deemed useful, guarantee its inclusion in our mainnet launch.
Context and Reasoning
Currently in gno code, we see the two described data structures used very often:
The
avl.Tree
, from packagep/demo/avl
, is an implementation of a self-balacing binary tree, and is one of the earliest packages added to Gno. It has already proved to be undoubtedly useful in Gno code, as it provides two very useful features which improve its usefulness compared to maps:(Instead of loading the whole array, as is the case for slices, we load a logN number of entries.)
However, the way I see it, the implementation suffers of the current problems:
interface{}
. This means that retrieving a value from an avl tree, we necessarily have to do a type assertion.avl.Tree
needs to do rebalancing.On the other hand, the
map[K]T
, directly borrowed from Go, solves some of the problems that we have with Gno, but its semantics (defined in the Go spec) don't make it very useful for storing values as if in a "key-value database":K
andT
)The proposal
I propose to make a language change to Gno, marking one significant change from the standard Go specification, while still retaining parse and ast-level compatibility with Go:
This allows proper sorting over maps as data structures, and is the main point of this proposal. Iteration on non-ordered keys is still pseudo-random.
Allow the unary slice operator (only with two parameters) on a map: the expression
m[x:y]
(with x, y being of the same types as the key andx <= y
) returns a new map, with the values shallow-copied, with all key-value pairs wherex <= key < y
.func key(m map[K]T, n int) K
, where K is a ordered typeThis function returns the corresponding key at the given index of the function. It can be used, for instance, to get the first 50 values of a map:
m[key(m, 0):key(m, 50)]
The natural implementation to implement the above language changes would be a self-balacing tree-ish structre (see section below regarding the implementation). This can still guarantee reasonable performance while giving us the ability to implement the above language changes.
Some deal-breaker implementation requirements:
Thanks to generics in Go, any implementation of maps as binary trees in Gno can still be ported to Go code by making it use a custom data strucutre instead of Go's
map
s; that is to say we can still precompile code.One thing to note is that this proposal only covers map which have orderable keys (ie. you must be able to use the
<
operator between two key values). This means that the behaviour for maps that have non-orderable keys remains unchanged, and the same as Go: iteration is pseudo-random, they cannot be "submapped", and the keys don't have an associated "index" (usable on thekey
function).This may mean that the underlying implementation for non-orderable keys may remain the current one, with all its semantics and
O(1)
access times (if moving to a binary-like structure would degrade performance).Before/after
The underlying implementation
As part of the proposal, I think we want to test out and play with different map implementations, and potentially switch between them automatically or potentially even manually by the program. (I dislike the latter, but it may prove useful to users to be able to change it.)
This proposal was born with the idea of making the implementation of the
avl.Tree
the one which is used for storing map data; however, it is important to understand that for the design of this proposal, the implementation is not that important. Rather, the idea is that of extending the behaviour of Gno maps in a way that is purely additive to the Go couterpart, while adding sorted iteration in for loops, and map "subset" operations.For this reason, I want to have several implementations we can benchmark and test out in usability, so we can have a choice and make conscious decisions. I see the map interface as follows:
Note that there is no explicit requirement for the implementation to be a binary tree or anything in particular: it could even be a simple
[]TypedValue
. Most implementations we'll test out, though, are likely to be from this list, because they are the most efficient data structures in computing which efficiently guarantee:On keys
As you might have noticed from the above code, I'd like to have the types implementing the
Map
interface have keys passed as simple[]byte
.The underlying idea here is that the implementation should not be concerned in the underlying type of the key, but rather just in having a key and sorting it correctly in its internal data representation.
Furthermore, as already explained in depth, the values that can be used as keys must be of a orderable type; this way we don't create arbitrary defintions for sorting structs, interfaces or array types.
We are thus restricted to three classes of elemental types:
<
operator.<
operator.All of these types must be able to be marshaled into a
[]byte
representation, which still sorts perfectly as described above when using lexical ordering.1<<(bits-1)
added to them when marshaling; vice-versa when unmarshalling (this way, we can correctly sort-1 < 0 < 1
).After that, integers can be encoded into a
[]byte
of adequate size (u/int8 -> 1 byte, u/int16 -> 2 bytes...). This needs to be big-endian.fmt.Sprintf("%0308.0308f", f)
😉On gas cost and performance
One other issue I want to briefly talk about is on how map operations should cost in terms of gas. This is part of a much larger issue we're tracking in #1281. Naturally, as said in the issue, the gas cost for operations should be put into proportion to other VM operations; however, I just want to make clear that even if we don't have a time complexity of O(1)1, we should still have map operations cost a constant number of gas. More on this in the following section.
This also entails that map access should cost the same regardless of whether under the hood we're rebalancing the tree or not. If the underlying implementation we're using requires rebalancing only some of the time, then its gas cost should take it into account and take the average of a very large N of insertion operations.
Ideally, also, operation cost for maps with keys of ordered types should be <= than that of keys with unordered types. Otherwise, there is a real economic encouragement not to use them in many situations, and this would probably generate a plethora of "clever hacks" to make map usage less expensive.
One operation which might cost differently, however, is overwriting an existing key, as it is bound to really be much less expensive than inserting a new one. So, all in all, I think this is a good model we can price map operations in gas (maps with ordered keys):
make(map[k]v, int)
m[x]
_, ok := m[x]
(only if significantly different from retrieval)m[x] = y
(m[x]
exists)m[x] = y
(m[x]
does not exist)delete(m, x)
(m[x]
exists; otherwise it should cost like an existance check)m[x:y]
. As this is a copy, this might be variable depending on the number of elements between x and yfor x, y := range m
for x := range m
,for range m
-- they can be simplified to some variation offor x := 0; x < len(m); x++
key(m, x)
.Time complexity considerations
A math-savvy reader might comment that we should not consider access to maps as the same cost; for even if binary search trees guarantee a time complexity of O(log n), the$\lim\limits_{x\rightarrow+\infty}\log x=+\infty$ .3
While this is true, in practice we're working on computers which have limited space. When assessing gas cost, we can make a worst-case scenario calculation which asserts we're working with 2^40 keys (for context, assuming 1 key = 1 byte, which is impossible, this would mean 1 TB for just storing the values of the keys); the worst-case scenario of this would be some multiple of 27.7 ($\log 2^{40}$ ), which is not tremendously larger than a more real-life scenario of 1024 keys, resulting in some multiple 6.9 ($\log 2^{10}$ ).
In other words, we can treat binary search trees as almost O(1), by just making our calculations based on a number of inputs many orders of magnitude higher than what we can reasonably expect people to use; because of the constraints on disk space.
Open problems
[]byte
while keeping their native Go ordering is unclear; likely a matter for some research into what others have done in the past.Status
I'm opening up this issue to kick-start discussion and feedback, as well as listening for ideas to solve the open problems. If all goes well, I'm likely to start work on this in January/February 2024 and try to expeditely come up at least with a proof of concept to benchmark performance between the current gnovm and one with binary search trees (especially on some known "bad" scenarios for binary search trees, such as intensive writes, and fire-and-forget small maps.)
Footnotes
If you're unaware, this is Big-O notation. Wikipedia article on maths are notoriously bad for non-experts; so here is an alternative 100-second video explanation. O(1) means that the algorithm runs in "constant time": no matter how many inputs you give it, it will always finish within a known, "maximum" time. ↩ ↩2 ↩3 ↩4
This means, in practice, integers, floating points and strings. See comparison operators on the Go spec. ↩
Sorry for the math/latex flex. ↩
The text was updated successfully, but these errors were encountered: