Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: ordered tree-structured maps #1374

Open
thehowl opened this issue Nov 14, 2023 · 7 comments
Open

proposal: ordered tree-structured maps #1374

thehowl opened this issue Nov 14, 2023 · 7 comments
Assignees

Comments

@thehowl
Copy link
Member

thehowl commented Nov 14, 2023

The following is a proposal to change the underlying implementation of the native map data structure to provide sorted key ordering, a map "slicing" operator, and a good way to efficiently store maps in realm storage.

This would also entail, as such, the deprecation of the avl package in favour of using the type-safe and language native data structure providing the same functionality, all the while supporting non-string keys.

This proposal has been decoupled from a larger proposal, codenamed Sogno, on request by @moul. If all outstanding problems are solved, I'll try to work on this separately from the overarching Sogno project in order to hopefully expedite its development and, if deemed useful, guarantee its inclusion in our mainnet launch.

Context and Reasoning

Currently in gno code, we see the two described data structures used very often:

The avl.Tree, from package p/demo/avl, is an implementation of a self-balacing binary tree, and is one of the earliest packages added to Gno. It has already proved to be undoubtedly useful in Gno code, as it provides two very useful features which improve its usefulness compared to maps:

  • Alphabetical ordering of keys (side-effect of binary tree)
  • Capability of iterating through indexes (allowing for nice things like pagination)
  • Efficient capability of performing look ups with the underlying realm storage.
    (Instead of loading the whole array, as is the case for slices, we load a logN number of entries.)

However, the way I see it, the implementation suffers of the current problems:

  • It is not type safe. The implementation requires all keys to be strings, and values are stored as interface{}. This means that retrieving a value from an avl tree, we necessarily have to do a type assertion.
    • Of course, this could be solved by an implementation that uses generics; but I think it is safe to assume that if we'll have generics it will be a long and winding road to get there, requiring several core changes to the GnoVM.
  • Its usage is ubiquitous in Gno code, and we recommend its idiomatic use by any Gno learners, even if it is not even a part of the standard library.
  • The implementation is directly in Gno code, and as such entails that a single transaction can end up costing more or less in gas depending on whether the avl.Tree needs to do rebalancing.
    • This could have side effects further down the line, whereby smart gnomes might end up strategically deferring their transactions until they are not required to rebalance the tree to perform it.

On the other hand, the map[K]T, directly borrowed from Go, solves some of the problems that we have with Gno, but its semantics (defined in the Go spec) don't make it very useful for storing values as if in a "key-value database":

  • As a core language construct and type, it is obviously type safe and its operations are definitely typed with the types with which it is constructed (K and T)
    • As such, it allows non-string key types, etc.
  • Its design and underlying implementation is designed to be a hash-map.
    • This is good because it guarantees that look-ups are O(1)1; however it disallows us to run through its key-value pairs in alphabetical order efficiently (unless getting all the keys and sorting them first -- very costly!)
  • Because of the above, you also cannot perform lookups on numerical subsets of its values: you can either run through all key-value pairs (and maybe filter some, but only after you've gotten them all), or get a single value, knowing its key.

The proposal

I propose to make a language change to Gno, marking one significant change from the standard Go specification, while still retaining parse and ast-level compatibility with Go:

  1. Making for-range operations alphabetical for ordered key types2
    This allows proper sorting over maps as data structures, and is the main point of this proposal. Iteration on non-ordered keys is still pseudo-random.
  2. Slice operator on maps with ordered key types, known as a submap
    Allow the unary slice operator (only with two parameters) on a map: the expression m[x:y] (with x, y being of the same types as the key and x <= y) returns a new map, with the values shallow-copied, with all key-value pairs where x <= key < y.
  3. Introducing a new builtin (uverse) function, func key(m map[K]T, n int) K, where K is a ordered type
    This function returns the corresponding key at the given index of the function. It can be used, for instance, to get the first 50 values of a map: m[key(m, 0):key(m, 50)]

The natural implementation to implement the above language changes would be a self-balacing tree-ish structre (see section below regarding the implementation). This can still guarantee reasonable performance while giving us the ability to implement the above language changes.

Some deal-breaker implementation requirements:

  1. The implementation must still be super-efficient for small maps. (Maps can be used in many contexts which are write-heavy and then discarded; it must work reasonably well there, too)
  2. Map values must be perfectly and deterministically merkle-izable and storable in realm data, allowing for the same efficiency as avl.Trees in lookups.

Thanks to generics in Go, any implementation of maps as binary trees in Gno can still be ported to Go code by making it use a custom data strucutre instead of Go's maps; that is to say we can still precompile code.

One thing to note is that this proposal only covers map which have orderable keys (ie. you must be able to use the < operator between two key values). This means that the behaviour for maps that have non-orderable keys remains unchanged, and the same as Go: iteration is pseudo-random, they cannot be "submapped", and the keys don't have an associated "index" (usable on the key function).

This may mean that the underlying implementation for non-orderable keys may remain the current one, with all its semantics and O(1) access times (if moving to a binary-like structure would degrade performance).

Before/after

// Example 1: significant code reduction
// Source: GnoChess

// before:
var user2Games avl.Tree // std.Address -> []*Game
func addToUser2Games(addr std.Address, game *Game) {
	var games []*Game
	v, ok := user2Games.Get(string(addr))
	if ok {
		games = v.([]*Game)
	}
	// game must be at top, because it is the latest ID
	games = append([]*Game{game}, games...)
	user2Games.Set(string(addr), games)
}

// after:
var user2Games map[std.Address][]*Game
func addToUser2Games(addr std.Address, game *Game) {
	// game must be at top, because it is the latest ID
	user2Games[addr] = append([]*Game{game}, user2Games[addr]...)
}
// Example 2: type safety out of the box
// Source: GnoChess

// before:
var playerStore avl.Tree // std.Address -> *Player
func GetPlayer(player string) string {
	addr := parsePlayer(player)
	v, ok := playerStore.Get(addr.String())
	if !ok {
		panic("player not found")
	}
	b, err := v.(*Player).MarshalJSON()
	checkErr(err)
	return string(b)
}

// after:
var playerStore map[std.Address]*Player
func GetPlayer(player string) string {
	addr := parsePlayer(player)
	p, ok := playerStore[addr]
	if !ok {
		panic("player not found")
	}
	b, err := p.MarshalJSON()
	checkErr(err)
	return string(b)
}
// Example 3: iteration
// Source: r/demo/boards

type PostID uint64

// before:
type Board struct {
	threads   avl.Tree // Post.id -> *Post
	// ...
}
func (board *Board) RenderBoard() string {
	str := ""
	str += "\\[[post](" + board.GetPostFormURL() + ")]\n\n"
	if board.threads.Size() > 0 {
		board.threads.Iterate("", "", func(key string, value interface{}) bool {
			if str != "" {
				str += "----------------------------------------\n"
			}
			str += value.(*Post).RenderSummary() + "\n"
			return false
		})
	}
	return str
}

// after:
type Board struct {
	threads   map[PostID]*Post
	// ...
}
func (board *Board) RenderBoard() string {
	str := ""
	str += "\\[[post](" + board.GetPostFormURL() + ")]\n\n"
	for _, thread := range board.threads {
		if str != "" {
			str += "----------------------------------------\n"
		}
		str += thread.RenderSummary() + "\n"
	}
	return str
}
// Example 4: "numerical" iteration
// Source: own example (no uses of IterateByOffset in the wild!)

const perPage = 50

// before:
var posts avl.Tree // post id (string) -> string
func List(page int) string {
	s := ""
	offset := (page-1) * perPage
	if offset > posts.Size() {
		panic("invalid page")
	}
	posts.IterateByOffset(offset, perPage, func (k string, v interface{}) bool {
		s += "## " + k + "\n\n" + v.(string) + "\n\n"
		return false
	})
	return s
}

// after:
var posts map[string]string
func List(page int) string {
	s := ""
	offset := (page-1) * perPage
	max := offset + perPage
	switch {
	case offset > len(posts):
		panic("invalid page")
	case max > len(posts):
		max = len(posts)
	}
	for k, v := range posts[key(posts, offset):key(posts, max)] {
		s += "## " + k + "\n\n" + v + "\n\n"
	}
	return s
}

The underlying implementation

As part of the proposal, I think we want to test out and play with different map implementations, and potentially switch between them automatically or potentially even manually by the program. (I dislike the latter, but it may prove useful to users to be able to change it.)

This proposal was born with the idea of making the implementation of the avl.Tree the one which is used for storing map data; however, it is important to understand that for the design of this proposal, the implementation is not that important. Rather, the idea is that of extending the behaviour of Gno maps in a way that is purely additive to the Go couterpart, while adding sorted iteration in for loops, and map "subset" operations.

For this reason, I want to have several implementations we can benchmark and test out in usability, so we can have a choice and make conscious decisions. I see the map interface as follows:

// This will be the initialiser, used to register a map implementation within
// the gnoVM. size refers to the second parameter in `make`:
// make(map[string]string, 64). If not passed, it's 0.
func NewMap(size int) Map

type Map interface {
	// What you'd expect from Go:
	Get(key []byte) (value *TypedValue, ok bool)
	Has(key []byte) (ok bool) // Optimization shortcut for _, ok := m[v]
	Len() int

	// What is added:
	// Key returns the key at the given index,
	// and is called for the uverse function `key`.
	Key(i int) []byte
	// Slice returns a subset of the map, or a submap.
	// i and j are keys, and must be i <= j.
	// The returned submap contains all values from the parent
	// where the key is `i <= key < j`.
	// The returned map is a (shallow) copy: modifications
	// to the original don't affect the submap.
	// i or j may be nil, which means the key was omitted;
	// m[:] creates a shallow copy of the entire map,
	// m[i:] creates a shallow copy of the map for keys `i <= key`
	// m[:j] creates a shallow copy of the map for keys `key < j`.
	Slice(i, j []byte) (submap Map)
}

Note that there is no explicit requirement for the implementation to be a binary tree or anything in particular: it could even be a simple []TypedValue. Most implementations we'll test out, though, are likely to be from this list, because they are the most efficient data structures in computing which efficiently guarantee:

  • uniqueness of keys
  • sorting of keys
  • O(log n)1 (or better) access

On keys

As you might have noticed from the above code, I'd like to have the types implementing the Map interface have keys passed as simple []byte.

The underlying idea here is that the implementation should not be concerned in the underlying type of the key, but rather just in having a key and sorting it correctly in its internal data representation.

Furthermore, as already explained in depth, the values that can be used as keys must be of a orderable type; this way we don't create arbitrary defintions for sorting structs, interfaces or array types.

We are thus restricted to three classes of elemental types:

All of these types must be able to be marshaled into a []byte representation, which still sorts perfectly as described above when using lexical ordering.

  • Strings are compared with lexical ordering by default, so not much to do here.
  • Signed integers will need to be converted to unsigned and have 1<<(bits-1) added to them when marshaling; vice-versa when unmarshalling (this way, we can correctly sort -1 < 0 < 1).
    After that, integers can be encoded into a []byte of adequate size (u/int8 -> 1 byte, u/int16 -> 2 bytes...). This needs to be big-endian.
  • Floats: I don't have an immediate answer here, and it is an open problem, though I'm sure with a bit of reading a solution can be found. At worst, fmt.Sprintf("%0308.0308f", f) 😉

On gas cost and performance

One other issue I want to briefly talk about is on how map operations should cost in terms of gas. This is part of a much larger issue we're tracking in #1281. Naturally, as said in the issue, the gas cost for operations should be put into proportion to other VM operations; however, I just want to make clear that even if we don't have a time complexity of O(1)1, we should still have map operations cost a constant number of gas. More on this in the following section.

This also entails that map access should cost the same regardless of whether under the hood we're rebalancing the tree or not. If the underlying implementation we're using requires rebalancing only some of the time, then its gas cost should take it into account and take the average of a very large N of insertion operations.

Ideally, also, operation cost for maps with keys of ordered types should be <= than that of keys with unordered types. Otherwise, there is a real economic encouragement not to use them in many situations, and this would probably generate a plethora of "clever hacks" to make map usage less expensive.

One operation which might cost differently, however, is overwriting an existing key, as it is bound to really be much less expensive than inserting a new one. So, all in all, I think this is a good model we can price map operations in gas (maps with ordered keys):

  • Initialization: make(map[k]v, int)
  • Access: m[x]
  • Existance: _, ok := m[x] (only if significantly different from retrieval)
  • Overwrite: m[x] = y (m[x] exists)
  • Insert: m[x] = y (m[x] does not exist)
  • Delete: delete(m, x) (m[x] exists; otherwise it should cost like an existance check)
  • Submap: m[x:y]. As this is a copy, this might be variable depending on the number of elements between x and y
  • For-range single iteration: for x, y := range m
    • Probably different: for x := range m, for range m -- they can be simplified to some variation of for x := 0; x < len(m); x++
  • Key function: key(m, x).

Time complexity considerations

This section delves into a bit of maths.

The summary for those rolling their eyes right now is that while it is true that binary search tree access does not generally run in constant time (O(1)1), the scenario for a trillion keys will only be 4 times worse than that for a thousand keys; a trillion keys is also only 10 times worse than for only 16 keys.

A math-savvy reader might comment that we should not consider access to maps as the same cost; for even if binary search trees guarantee a time complexity of O(log n), the $\lim\limits_{x\rightarrow+\infty}\log x=+\infty$.3

While this is true, in practice we're working on computers which have limited space. When assessing gas cost, we can make a worst-case scenario calculation which asserts we're working with 2^40 keys (for context, assuming 1 key = 1 byte, which is impossible, this would mean 1 TB for just storing the values of the keys); the worst-case scenario of this would be some multiple of 27.7 ($\log 2^{40}$), which is not tremendously larger than a more real-life scenario of 1024 keys, resulting in some multiple 6.9 ($\log 2^{10}$).

In other words, we can treat binary search trees as almost O(1), by just making our calculations based on a number of inputs many orders of magnitude higher than what we can reasonably expect people to use; because of the constraints on disk space.

Open problems

  1. How to marshal floats into []byte while keeping their native Go ordering is unclear; likely a matter for some research into what others have done in the past.

Status

I'm opening up this issue to kick-start discussion and feedback, as well as listening for ideas to solve the open problems. If all goes well, I'm likely to start work on this in January/February 2024 and try to expeditely come up at least with a proof of concept to benchmark performance between the current gnovm and one with binary search trees (especially on some known "bad" scenarios for binary search trees, such as intensive writes, and fire-and-forget small maps.)

Footnotes

  1. If you're unaware, this is Big-O notation. Wikipedia article on maths are notoriously bad for non-experts; so here is an alternative 100-second video explanation. O(1) means that the algorithm runs in "constant time": no matter how many inputs you give it, it will always finish within a known, "maximum" time. 2 3 4

  2. This means, in practice, integers, floating points and strings. See comparison operators on the Go spec.

  3. Sorry for the math/latex flex.

@thehowl thehowl added 📦 🤖 gnovm Issues or PRs gnovm related gno-proposal Gnolang change proposals labels Nov 14, 2023
@thehowl thehowl self-assigned this Nov 14, 2023
@thehowl thehowl moved this to 🌟 Wanted for Launch in 🚀 The Launch [DEPRECATED] Nov 14, 2023
@thehowl
Copy link
Member Author

thehowl commented Nov 14, 2023

Keeping the comment for edit history; mostly irrelevant with the revised proposal.

@thehowl
Copy link
Member Author

thehowl commented Nov 15, 2023

I've updated the proposal to work a bit differently, to provide consistency in the slice operator (ie. if m[a] is map access by key, then m[a:b] should work on keys, not indexes) and to rewrite some sections to maintain 100% backwards compatibility with Go (so removing all the part about disallowing pointer/interface keys) and to avoid accidentally defining a canonical order for non-orderable types (in Go, the only orderable types are strings, integer types and float types). Orderable means that you can use the < operator on them.

The revision makes it so that the changes are defined only for maps where the key types are orderable, per the Go spec. There's still an open question, which seems like something which must have already been solved in the literature, but I'll defer the research to another day :)

@ajnavarro
Copy link
Contributor

Hmm, about ordered keys, I think we should not do that (we can argue that it can be considered a language change, even if the syntax is the same, the behavior is not).

We must be deterministic, so we can force always the same ordering for the same set of keys, but not an alphabetical one. We could use the Inorder Traversal when traversing the tree on for loops.

On the other side, I don't think binary trees are the correct structure for that. Maybe we could use B-Trees (self-balancing Trees). They don't need to be balanced periodically, and they are good for disk storage.

@thehowl thehowl changed the title proposal: maps as binary trees proposal: sortable maps Nov 22, 2023
@thehowl thehowl changed the title proposal: sortable maps proposal: ordered tree-structured maps Nov 22, 2023
@thehowl
Copy link
Member Author

thehowl commented Nov 22, 2023

I've revised the title of the proposal a second time, as well as the description of the proposal, in a hope to draw attention to the fact that the underlying implementation is not important; the proposal is interested with defining an ordering for maps with common, orderable keys and allowing as a consequence a "subset" operation on them.

I'll try to give my thoughts with regards to the concerns expressed by @mvertes and @ajnavarro for now.

Concern: we should not have this as the implementation for maps; instead we should create a new keyword so people know to expect different behaviour from Go

Let me clarify here that currently we do not allow for persistance of maps in realm data in the first place. The map as such is currently a "half-supported" data structure. While I haven't done full research on how map storage could actually work as a hash map, I think it's safe to assume that if we were to store maps as is, as hash maps, loading a map object would entail loading all of its keys and reference to the ObjectIDs of their values at least.

The advantage of tree-like structures in this case is that we can leverage their "tree" property and only load the objects we are interested in, and allow for effectively building structures that can hold millions of keys while still performing relatively efficiently when running realms.

My concerns with creating a new data structure with a different keyword (say, dict[K]V), are the following:

  • We have two "duplicate" data structures that mostly do the same thing usage-wise, the map and the dict, except the dict allows fewer key types and the map cannot be saved in a global variable for realm storage.
    • This will definitely be confusing for newcomers, and slightly confusing for seasoned Gophers.
  • We lose compatibility with Go tooling, like Go's native go/parser, for parsing Gno source code. This is a step that we might want to do at one point, but I think that step will need one where the benefits clearly outweigh the drawbacks. The major drawbacks is that we would have to modify all Go tooling to use our own parser instead of Go's: think even of a simple gofmt.

An alternative idea, which could solve a majority of concerns, would be that of implementing generics, and as such allowing us to create typed avl trees: avl.Tree[std.Address, *User]. However, I have one more concern about this which I specified in the OP:

I think we need to have a data structure for storing large amounts of data where all operations (of the same type, ie. access/write/replace/delete...) cost the same amount of gas, even though this might not perfectly reflect the "real" CPU cycles. This is to avoid strategic delaying of transactions (to do them when rebalancing is not necessary), and allows us to scale these data structures to be able to work and cost the same with millions of keys, without developers thinking of forking a contract into a reset version of itself just to avoid increasing map insertion costs. (See the "On gas cost and performance" and the following questions before raising any concerns about this).

Another reason why I believe this language change makes sense is that the behaviour as explained by this proposal would be purely additive to the behaviour that a Go developer (and the go parser) might expect.

  • Map ordering is random specifically because you cannot rely on their order for anything. I'm guessing some people might have created ways to generate random data based off the iteration of maps, but I don't see that as a practical usage of a map for which the random ordering should be kept.
    • As such, lexical ordering actually happens to be a case which the Go programmer needs to expect (!). This is because "random ordering" might actually entail that by pure chance, the ordering with which the map is iterated is actually the lexical ordering of its keys. Of course, this is highly improbable as maps grow larger, but it proves my point.
  • The other features, such as extending the slicing operator and the key function, are also purely additive and can be supported with no modifications to the Go parser (and as such will also be natively understood by go tooling). Proof for the slicing operator.

On this one, I have a final last note: with the current proposal for the implementation, there actually is a way for someone to fallback to the normal hashmap behaviour (non-deterministic ordering in for loops, and no subset feature). You simply need to wrap your type around a struct:

func main() {
	a := make(map[MyKey]int)
	a[MyKey{8}] = 11
	fmt.Println(a)
}

type MyKey struct{ N int }

https://go.dev/play/p/ZCi3fVm8hr2

@leohhhn
Copy link
Contributor

leohhhn commented Apr 2, 2024

@thehowl can we bring this point up again? Would be great for the UX if we can use a native map type.

@thehowl
Copy link
Member Author

thehowl commented May 8, 2024

@leohhhn I won't have time for a while to deep-dive on this before test4, most likely. This is very significant, and coming up with a good PR will at least take a week of focused work; I want to close some of the existing "threads" of work I'm following before going after this

@thehowl
Copy link
Member Author

thehowl commented Oct 30, 2024

Interesting: https://github.com/rsc/omap

@salmad3 salmad3 mentioned this issue Nov 17, 2024
21 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🌟 Wanted for Launch
Status: Coffee Slot
Development

No branches or pull requests

3 participants