Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add adr-001 for node key refactoring #608

Merged
merged 29 commits into from
Feb 21, 2023
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
4f9a004
add adr
cool-develope Nov 1, 2022
8646a9a
small fix
cool-develope Nov 1, 2022
6e5b081
remove child hashes
cool-develope Nov 1, 2022
4ee7804
small fix
cool-develope Nov 1, 2022
9468ee2
add migration
cool-develope Nov 1, 2022
0c2d610
add pruning
cool-develope Nov 1, 2022
1459a18
Update docs/architecture/adr-001-node-key-refactoring.md
cool-develope Nov 1, 2022
d8d0bf9
Update docs/architecture/adr-001-node-key-refactoring.md
cool-develope Nov 1, 2022
88885f1
suggestions
cool-develope Nov 1, 2022
5272de4
suggestions
cool-develope Nov 3, 2022
006d76c
update the struct
cool-develope Nov 4, 2022
7cc7280
Update docs/architecture/adr-001-node-key-refactoring.md
cool-develope Nov 8, 2022
4e6044f
Update adr-001-node-key-refactoring.md
cool-develope Nov 8, 2022
0bf7486
orphans
cool-develope Nov 9, 2022
a0dcc0e
Merge branch 'master' into 592/adr
cool-develope Nov 9, 2022
ee11dff
revert removing root store
cool-develope Nov 10, 2022
d9f7d2e
path update
cool-develope Nov 30, 2022
10184ac
small fix
cool-develope Nov 30, 2022
5f94844
Update adr-001-node-key-refactoring.md
cool-develope Nov 30, 2022
85a90e7
Merge branch 'master' into 592/adr
cool-develope Nov 30, 2022
8fb87b6
small fix
cool-develope Dec 2, 2022
6bbf7f9
add prune method
cool-develope Dec 2, 2022
204d881
Update docs/architecture/adr-001-node-key-refactoring.md
cool-develope Jan 18, 2023
f743311
Merge branch 'master' into 592/adr
cool-develope Jan 18, 2023
bf85d92
resolve conflicts
cool-develope Feb 17, 2023
b064ede
Update adr-001-node-key-refactoring.md
cool-develope Feb 17, 2023
6095bc4
Merge branch 'master' into 592/adr
cool-develope Feb 17, 2023
7873267
Merge branch 'master' into 592/adr
cool-develope Feb 21, 2023
fb0f182
comments
cool-develope Feb 21, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/architecture/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,5 @@ If recorded decisions turned out to be lacking, convene a discussion, record the
and then modify the code to match.

## ADR Table of Contents

- [ADR 001: Node Key Refactoring](./adr-001-node-key-refactoring.md)
96 changes: 96 additions & 0 deletions docs/architecture/adr-001-node-key-refactoring.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# ADR ADR-001: Node Key Refactoring

## Changelog

- 2022-10-31: First draft

## Status

Proposed

## Context

The original node key of IAVL is a hash of the node and it does not take advantage of data locality on disk. The nodes are stored in a random location of the disk due to the random hash value, so it needs to do a random search of the disk to find the node.
cool-develope marked this conversation as resolved.
Show resolved Hide resolved

The `orphans` are used to manage the removed nodes in the current version and allow to deletion of the removed nodes for the specific version from the disk through the `DeleteVersion`. It needs to track every time when updating the tree and also requires extra storage to store `orphans`, but there are not many use cases of `DeleteVersion`. There are two use cases, the first one is the rollback of the tree and the second one is to remove the unnecessary old nodes.
cool-develope marked this conversation as resolved.
Show resolved Hide resolved

## Decision

- Use the sequenced integer ID as a node key like `bigendian(nodeKey)` format.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is nodeKey computed, I remember it's sth like "version+path"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it is a just sequenced integer.
for example, we assign the nodeKey of the new node as tree.nonce + 1 every time when create node

Copy link
Collaborator

@yihuang yihuang Nov 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems better to use "version/seq", where seq is only unique inside the version.

    • no need to store the nonces at all? Can iterate versions directly, rollback would be trivial.
  • prefix compression of low level db will help to reduce key size.
  • make root node identified with a well known seq number, so we can find it by version directly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here, the nodeKey is unique globally.

Copy link
Collaborator

@yihuang yihuang Nov 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"version/seq" is globally unique as well, seq itself is locally unique within the version. For each version, the nodes are written in a batch anyway, we only need to maintain the seq in memory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know there are some advantages of version|seq, I am just worried it means we need to update leftNodeKey and rightNodeKey following this update and it leads to extra encode/decode executions and finally requires more storage to store the node data

Why do we need to update the left and right node keys?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here, leftNodeKey and rightNodeKey refer to the children node key in the db. If we update the node key as version|seq, we should also update left and right node keys, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meaning that every time we have a new version we update all referenced node keys to refer to that version rather than the original version where they were created? That seems really complex. I feel like I'm maybe not understanding something really basic here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yihuang , I agree with you. If we keep the local nonce as int32, there is not much increase in storage size because we can remove the version from node body writes.

Copy link
Collaborator

@yihuang yihuang Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think the size difference should be minimal, assuming version is int64:

  • node key, 8bytes vs 8+4bytes, but with prefix compression, the amortized difference should be smaller, local nonce only use 2 bytes in most cases, so the 10 bytes prefix is shared and compressed away.

  • node body, we can remove the version field, save a varint(version) + 1 tag byte, usually 4 to 5bytes for production chains.

  • leftNodeKey and rightNodeKey, varint(global nonce) vs varint(version << 32 + local nonce), or varint(version) + varint(local nonce) + 1 more byte to tag the extra field.
    Say there's 1000000 blocks, and 2000 new nodes for each block, the new global nonce would be 1000000 * 2000, the two encodings:

    • global nonce: len(varint(1000000 * 2000)) == 5
    • two fields: len(varint(1000000)) + len(varint(2000)) + 1 == 6
    • one field: len(varint('uint64', (1000000<<32)+2000)) == 8

    I think two fields wins, the difference is only 1 byte, and no bitwise operations:

    leftNodeVersion: uint64
    leftNodeNonce: uint32
    rightNodeVersion: uint64
    rightNodeNonce: uint32
    

- Remove the `leftHash` and `rightHash` fields, and instead store `hash` field.
- Remove the `version` field from the node structure.
- Remove the `orphans` from the tree.

New node structure
tac0turtle marked this conversation as resolved.
Show resolved Hide resolved

```go
type Node struct {
key []byte
value []byte
hash []byte // keep this field in the storage
Copy link
Collaborator

@yihuang yihuang Nov 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess one consequence is proof generation is slower, previously we can get child hash directly, now we need to load the child node?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using the ics23 proof, which requires size, version, and height so I believe it doesn't lead to any delays.

leftHash []byte // will remove
rightHash []byte // will remove
cool-develope marked this conversation as resolved.
Show resolved Hide resolved
nodeKey int64 // new field, use as a node key
leftNodeKey int64 // new field, need to store in the storage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are using nonce as nodeKey, how are we getting data locality when traversing or writing the tree to disk?

I thought that the whole idea with the path encoded as a key was that writes would be sequential, contributing to lower frequency of compactions.

Copy link
Collaborator

@yihuang yihuang Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nodes are all immutable, the nodeKey is sequential, so all the new nodes are written sequentially.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the tree path describes the data locality well. The problem is how to keep the path when rotating the tree.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we need to split the leftNodeKey into leftNodeVersion and leftNodeNonce, likewise for rightNodeKey.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, I updated the node struct

rightNodeKey int64 // new field, need to store in the storage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need these fields, if we already have leftNode and rightNode?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leftNode and rightNode is only meaningful on memory side, we need these fields to get children from the storage side.

version int64 // will remove
size int64
leftNode *Node
rightNode *Node
subtreeHeight int8
persisted bool
}
```

New tree structure

```go
type MutableTree struct {
*ImmutableTree
lastSaved *ImmutableTree
nonce int64 // new field to track the current ID
orphans map[int64]int64 // will remove
versions map[int64]bool
allRootLoaded bool
unsavedFastNodeAdditions map[string]*fastnode.Node
unsavedFastNodeRemovals map[string]interface{}
ndb *nodeDB
skipFastStorageUpgrade bool

mtx sync.Mutex
}
```

## Consequences
cool-develope marked this conversation as resolved.
Show resolved Hide resolved
cool-develope marked this conversation as resolved.
Show resolved Hide resolved

### Migration
Copy link
Collaborator

@yihuang yihuang Nov 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative way to migrate is using the state streamer, resync the chain, and recreate the new iavl db with the state changes, then switch.
Iterate iavl tree version by version sounds pretty slow, lots of random db access.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change, so validators need to update at once (at least 67% of them). So we need a migration process anyway.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not consensus breaking, since the root hashes don't change, validators don't need to update at the same time?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably depends on the migration mechanism. I think the one described below won't guarantee the same IAVL tree.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, today IAVL key is a hash, so this update will change the keys stored in the IAVL, hence it will change an order in the tree, and the tree Merkle Hash.

Copy link
Collaborator

@yihuang yihuang Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it can migrate IAVL nodes one to one, change references between them, but the hashes of each node is not changed.


We can migrate nodes one by one by iterating the version.

- Iterate the version in order, and get the root node for the specific version.
cool-develope marked this conversation as resolved.
Show resolved Hide resolved
- Iterate the tree and assign the `nodeKey` to nodes which the node version equals.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Migrate a archive node and pruned node will end up with different node keys, but it's probably ok.

Copy link
Collaborator

@yihuang yihuang Nov 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent node nonces could make the state-sync snapshot inconsistent as well?
Then I think using the path as identity is more deterministic than sequential nonce.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the global nonce they would be different but version | local nonce should be same, right?

Copy link
Collaborator

@yihuang yihuang Nov 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, in migration, it depends on the traversal order, this is easy to specify, like preorder+ascending. but when handling the insertion, we should also specify the precise logic of nonce assignments, to allow alternative implementations be compatible with each other.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the inconsistency of nodeKey is not a big problem, the only thing we need is to match the proof.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can test this on a live network with state sync and a simple migration to see if it works to be sure

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the inconsistency of nodeKey is not a big problem, the only thing we need is to match the proof.

it depends on how we export the state sync snapshot I think, if we re-map the nodekey, then the stored one don't matter.

Copy link
Collaborator

@yihuang yihuang Nov 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, in migration, it depends on the traversal order, this is easy to specify, like preorder+ascending. but when handling the insertion, we should also specify the precise logic of nonce assignments, to allow alternative implementations be compatible with each other.

I think the nonce assignment should happens in the SaveBranch, which should do a preorder ascending traversal, and prunes the historical branches. So the logic is pretty precise and deterministic here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Shall we remove the old node?
  2. I don't think we should update past versions. If "past" versions will be recalculated then valid proofs issued for a past version will not work any more. This could be an issue for IBC and relayers. cc: @ebuchman

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. yes, we will provide the pruning functionality, it is useful to remove the old unnecessary nodes and reduce the storage.
  2. We won't update the version itself, and just assign the nonce.

We will implement the `Import` functionality for the original version.

### Positive

Using the sequenced integer ID, we take advantage of data locality in the bTree and it leads to performance improvements. Also, it can reduce the node size in the storage.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nowadays, all the "mainstream" db backends are lsm tree, but it should help there too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessng by bTree we refer to IAVL tree? However, I agree that "bTree" seems confusing in this context because the alternative to LSM is the first thing that comes to mind

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the major benefits might also be reduced compactions since we always commit semi-sorted data (by nonce) to the underlying LSM.

Since we commit versioned data, there wouldn't be a need for as many merge operations as today.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will update this part and add more description of LSM


Removing orphans also provides performance improvements including memory and storage saving. Also, it makes it easy to rollback the tree. Because we will keep the sequenced segment IDs for the specific version, and we can remove all nodes for which the `nodeKey` is greater than the specified integer value.

### Negative

It requires extra storage to store the node because it should keep `leftNodeKey` and `rightNodeKey` to iterate the tree. Instead, we can delete the `version`, `leftHash`, and `rightHash` fields in the node and reduce the key size.

It can't delete the old nodes for the specific version due to removing orphans. We introduce a new way to prune old versions.

For example, when a user wants to prune the previous 500 versions every 1000 blocks
cool-develope marked this conversation as resolved.
Show resolved Hide resolved
- We assume the pruning is completed for `n`th version and the last nonce of `n`th version is `x`.
- We iterate the tree from the `n+501`th root node and pick only nodes which the nodeKey is in `[(n+1)th version first nonce, (n+500)th version the last nonce]`.
- For those nodes, we re-assign the nodeKey from `x+1` in order.

## References

- https://github.com/cosmos/iavl/issues/548
- https://github.com/cosmos/iavl/issues/137
- https://github.com/cosmos/iavl/issues/571