Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix data race related to VersionExists #36

Merged
merged 2 commits into from
Mar 22, 2022

Conversation

p0mvn
Copy link
Member

@p0mvn p0mvn commented Mar 22, 2022

Background

@czarcas7ic reminded me of the race tests in the SDK. So after checking the new snapshot/pruning PR - osmosis-labs/cosmos-sdk#140, I found one failing. Interestingly, it was never failing prior to that PR. Apparently, the refactor allowed us to detect it.

The problem was in VersionExists where we were using an RLock while modifying a common resource. The only other method where using RLock still made sense was AvailableVersions(). However, since it is the only one and is not used frequently, I made a guess that performance overhead from having an RWLock would be greater than from having a regular Lock everywhere:

name                                                                  old time/op    new time/op    delta
Medium/goleveldb-100000-100-16-40/query-no-in-tree-guarantee-fast-16    3.07µs ± 8%    2.97µs ± 3%    ~     (p=0.421 n=5+5)
Medium/goleveldb-100000-100-16-40/query-no-in-tree-guarantee-slow-16    10.0µs ± 3%    10.1µs ± 5%    ~     (p=0.222 n=5+5)
Medium/goleveldb-100000-100-16-40/query-hits-fast-16                     355ns ± 6%     338ns ± 2%  -4.82%  (p=0.016 n=5+5)
Medium/goleveldb-100000-100-16-40/query-hits-slow-16                    13.6µs ± 3%    13.4µs ± 3%    ~     (p=0.421 n=5+5)
Medium/goleveldb-100000-100-16-40/iteration-fast-16                     63.1ms ± 5%    66.9ms ± 6%    ~     (p=0.151 n=5+5)
Medium/goleveldb-100000-100-16-40/iteration-slow-16                      1.28s ±10%     1.23s ± 1%    ~     (p=0.190 n=5+4)
Medium/goleveldb-100000-100-16-40/update-16                              178µs ±21%     172µs ±19%    ~     (p=0.690 n=5+5)
Medium/goleveldb-100000-100-16-40/block-16                              23.6ms ± 4%    23.1ms ± 5%    ~     (p=0.421 n=5+5)

name                                                                  old alloc/op   new alloc/op   delta
Medium/goleveldb-100000-100-16-40/query-no-in-tree-guarantee-fast-16      814B ± 0%      814B ± 0%    ~     (all equal)
Medium/goleveldb-100000-100-16-40/query-no-in-tree-guarantee-slow-16    1.41kB ± 1%    1.41kB ± 1%    ~     (p=0.222 n=5+5)
Medium/goleveldb-100000-100-16-40/query-hits-fast-16                     0.00B          0.00B         ~     (all equal)
Medium/goleveldb-100000-100-16-40/query-hits-slow-16                    1.99kB ± 0%    1.99kB ± 0%    ~     (p=0.714 n=5+5)
Medium/goleveldb-100000-100-16-40/iteration-fast-16                     29.3MB ± 0%    29.3MB ± 0%    ~     (p=0.333 n=5+5)
Medium/goleveldb-100000-100-16-40/iteration-slow-16                      276MB ± 0%     276MB ± 0%  +0.02%  (p=0.029 n=4+4)
Medium/goleveldb-100000-100-16-40/update-16                             51.9kB ± 8%    51.7kB ± 9%    ~     (p=1.000 n=5+5)
Medium/goleveldb-100000-100-16-40/block-16                              6.46MB ± 3%    6.33MB ± 1%    ~     (p=0.151 n=5+5)

name                                                                  old allocs/op  new allocs/op  delta
Medium/goleveldb-100000-100-16-40/query-no-in-tree-guarantee-fast-16      16.0 ± 0%      16.0 ± 0%    ~     (all equal)
Medium/goleveldb-100000-100-16-40/query-no-in-tree-guarantee-slow-16      24.0 ± 0%      24.0 ± 0%    ~     (all equal)
Medium/goleveldb-100000-100-16-40/query-hits-fast-16                      0.00           0.00         ~     (all equal)
Medium/goleveldb-100000-100-16-40/query-hits-slow-16                      34.0 ± 0%      33.6 ± 2%    ~     (p=0.333 n=4+5)
Medium/goleveldb-100000-100-16-40/iteration-fast-16                       523k ± 0%      523k ± 0%    ~     (p=0.238 n=4+5)
Medium/goleveldb-100000-100-16-40/iteration-slow-16                      4.71M ± 0%     4.71M ± 0%    ~     (p=0.200 n=4+4)
Medium/goleveldb-100000-100-16-40/update-16                                559 ±15%       562 ±17%    ~     (p=0.841 n=5+5)
Medium/goleveldb-100000-100-16-40/block-16                               72.5k ± 3%     71.9k ± 2%    ~     (p=0.548 n=5+5)

Confirmed that this change in conjunction with another SDK side change fixes the race.

This might be related to the app hash bug because this VersionExists method is used during commit, some queries, and proofs. However, it is impossible to tell for sure if this was the true cause.

The failing race test:

roman@akhtariev:~/projects/cosmos-sdk (roman/snap-prune)$ go test -timeout 30s -run ^TestLoadVersionPruning$ github.com/cosmos/cosmos-sdk/baseapp -race
==================
WARNING: DATA RACE
Read at 0x00c000e81ce0 by goroutine 49:
  runtime.mapaccess2_fast64()
      /usr/local/go/src/runtime/map_fast64.go:52 +0x0
  github.com/cosmos/iavl.(*MutableTree).VersionExists()
      /home/roman/projects/cosmos-sdk/vendor/github.com/cosmos/iavl/mutable_tree.go:78 +0x104
  github.com/cosmos/cosmos-sdk/store/iavl.(*Store).VersionExists()
      /home/roman/projects/cosmos-sdk/store/iavl/store.go:160 +0x51
  github.com/cosmos/cosmos-sdk/store/iavl.(*Store).GetImmutable()
      /home/roman/projects/cosmos-sdk/store/iavl/store.go:108 +0x54
  github.com/cosmos/cosmos-sdk/store/iavl.(*Store).Export()
      /home/roman/projects/cosmos-sdk/store/iavl/store.go:239 +0x4b
  github.com/cosmos/cosmos-sdk/store/rootmulti.(*Store).Snapshot.func2()
      /home/roman/projects/cosmos-sdk/store/rootmulti/store.go:729 +0x68e

Previous write at 0x00c000e81ce0 by goroutine 36:
  runtime.mapassign_fast64()
      /usr/local/go/src/runtime/map_fast64.go:92 +0x0
  github.com/cosmos/iavl.(*MutableTree).VersionExists()
      /home/roman/projects/cosmos-sdk/vendor/github.com/cosmos/iavl/mutable_tree.go:83 +0x1b2
  github.com/cosmos/iavl.(*MutableTree).SaveVersion()
      /home/roman/projects/cosmos-sdk/vendor/github.com/cosmos/iavl/mutable_tree.go:701 +0x14a
  github.com/cosmos/cosmos-sdk/store/iavl.(*Store).Commit()
      /home/roman/projects/cosmos-sdk/store/iavl/store.go:127 +0x139
  github.com/cosmos/cosmos-sdk/store/rootmulti.(*Store).commitStores()
      /home/roman/projects/cosmos-sdk/store/rootmulti/store.go:968 +0x18b
  github.com/cosmos/cosmos-sdk/store/rootmulti.(*Store).Commit()
      /home/roman/projects/cosmos-sdk/store/rootmulti/store.go:406 +0x153
  github.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).Commit()
      /home/roman/projects/cosmos-sdk/baseapp/abci.go:308 +0x303
  github.com/cosmos/cosmos-sdk/baseapp.TestLoadVersionPruning()
      /home/roman/projects/cosmos-sdk/baseapp/baseapp_test.go:437 +0xa14
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1259 +0x22f
  testing.(*T).Run·dwrap·21()
      /usr/local/go/src/testing/testing.go:1306 +0x47

Goroutine 49 (running) created at:
  github.com/cosmos/cosmos-sdk/store/rootmulti.(*Store).Snapshot()
      /home/roman/projects/cosmos-sdk/store/rootmulti/store.go:696 +0x729
  github.com/cosmos/cosmos-sdk/snapshots.(*Manager).Create()
      /home/roman/projects/cosmos-sdk/snapshots/manager.go:158 +0x217
  github.com/cosmos/cosmos-sdk/snapshots.(*Manager).snapshot()
      /home/roman/projects/cosmos-sdk/snapshots/manager.go:317 +0x168
  github.com/cosmos/cosmos-sdk/snapshots.(*Manager).SnapshotIfApplicable·dwrap·9()
      /home/roman/projects/cosmos-sdk/snapshots/manager.go:306 +0x47

Goroutine 36 (running) created at:
  testing.(*T).Run()
      /usr/local/go/src/testing/testing.go:1306 +0x726
  testing.runTests.func1()
      /usr/local/go/src/testing/testing.go:1598 +0x99
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1259 +0x22f
  testing.runTests()
      /usr/local/go/src/testing/testing.go:1596 +0x7ca
  testing.(*M).Run()
      /usr/local/go/src/testing/testing.go:1504 +0x9d1
  main.main()
      _testmain.go:147 +0x22b
==================
--- FAIL: TestLoadVersionPruning (0.02s)
    testing.go:1152: race detected during execution of test
FAIL
FAIL    github.com/cosmos/cosmos-sdk/baseapp    0.523s
FAIL

@ValarDragon
Copy link
Member

Oh wow great catch, version Exists was writing during the read lock! Ok sweet, LGTM, could fix the problem! Lets try to get into a release with the snapshot manager ASAP?

@p0mvn p0mvn merged commit 2b3c847 into dev/iavl_data_locality Mar 22, 2022
@p0mvn
Copy link
Member Author

p0mvn commented Mar 22, 2022

Yes, updating the SDK side right now. The rest of the comments are addressed

p0mvn added a commit that referenced this pull request Mar 28, 2022
* fix data race related to VersionExists

* use regular lock instead of RW in mutable_tree.go
faddat pushed a commit to faddat/iavl that referenced this pull request Apr 11, 2022
* Add unbounded key string to KeyFormat

* Add test vectors for unbounded length keys

* Add some notes

* update .gitignore

* Add FastNode struct

* WIP: make Get work with new FastNode

* when retrieving fastnode fails, return errors vs. panic

* add comments clarifying what index represents

* make the linter happy

* Add small tweaks to fast_node.go

* add TODO & small linter tweaks

* Update comment

* update fast node cache in set

* add debugging output when falling back to original logic

* return error instead of panic

* WIP: refactor set ops to work with fast store

* update Set of mutable tree, begin unit testing

* update GetVersioned to check fast nodes before trying the immutable

* check fast node version before nil value check in get of immutable tree

* fix small bugs and typos, continue writing unit tests for Set

* unit test saveFastNodeVersion

* simplify storing unsaved fast nodes

* resolve a bug with not writing prefix for fast node to disk

* remove fast nodes from disk on save and clear fast cache when version is deleted, fix all tests but random and with index

* resolve an issue with randomized tests caused by the fast node cache not being cleared when latest version is saved

* split unsaved fast node changes into additions and removals

* save fast node removals

* move fast node cache clearing to nodedb

* use regular logic only when fast node version is greater than immutable tree'

* clean up tree_random_test.go

* clear unsaved fast node removals on rollback

* fix randomized test failures caused by a typo in ndb DeleteVersion for loop

* implement GetFast method to preserve Get with correct index return, convert unit tests from Get to GetFast where applicable

* ensure Get and GetFast return the same values in tree_random_test.go

* test fast node cache is live in random tree tests

* improve mutable tree unit tests related to new functionality

* clean up tree_test.go

* implement GetVersionedFast to preserve the index in GetVersioned

* restore accidentally deleted test in mutable tree test

* spurious whitespace

* refactor mutable tree

* fix comment in mutable tree

* add benchmark results

* avoid redundant full tree search in GetFast of immutable tree when fast node is nil and tree is latest

* fix naming for bench test get on random keys

* use method for get latestversio in get fast

* optimize GetFast, perform a refactor to ensure that fast nodes on disk are matched and better test

* add latest bench

* Fast Node Iteration (osmosis-labs#7)

* propagate errors from nodedb and avoid panicking

* begin implementing fast node iteration

* resolve rebase issue in random tests

* fix iteration to deserialize fast node for extracting the correct value

* finalzie iteration and let all unit tests pass

* add benchmarks

* merge GetVersioned and GetVersionedFast

* remove fast node deletion from DeleteVersion and DeleteVersionRange and benchmark

* fix and unit test iteration on mutable and immutable trees

* implement tests for iterator and iterate, begin testing fast iterator

* fix and unit test fast iterator

* refactor iterate methods of mutable and immutable trees

* resolve some warnings

* remove old bench results

* refactor bench tests for iteration

* Fast Cache Migration (osmosis-labs#9)

* implement nodedb changes to set and get chain version from the database

* implement and unit test upgrade to fast storage in mutable tree

* refactor for auto upgrade to fast version in mutable tree contructor, load version and lazy load version

* use proper functionality for getting latest version

* remove unused error

* fix comment

* use fast version value in tests

* spurious tab

* fix style problems and remove redundant code in tests

* Rename Get to GetWithIndex and GetFast to Get

* refactor and clean up bench tests

* remove extra byte from fast node length

* clean up immutable tree

* refactor nil tree or ndb error for the iterators and their tests

* avoid exporting methods for getting unsaved additions and removals

* refactor fast upgrade to read from tree instead of traversing disk nodes and orphans, update unit tests

* remove unneeded comment

* refer to storage version consistently across the project

* fix more warnings

* optimize removal of fast nodes from cache on deletion

* small changes in teh mutable tree

* correct storage version key

* auto set fast version in SaveVersion

* avoid exporting new methods of the immutable tree

* go fmt

* Fix comment in fast iterator

Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com>

* add extra comment for domain in fast iterator

Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com>

* add comments for moving the iterator before the first element

* add comment for describing what mirror is in assertIterator of testutils

* fix checking the mirror for descending iterator in tests

* Update testutils_test.go with a comment

Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com>

* Update benchmarks/bench_test.go with a comment for runKnownQueriesFast

Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com>

* Update benchmarks/bench_test.go with a comment for runQueriesFast

Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com>

* export IsFastCacheEnabled and add an assert in bench tests

* Update comment immutable_tree.go

Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com>

* Update comment for migration in mutable_tree.go

Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com>

* simlify Iterate in mutable tree, add debug log for

* Fast Cache - Downgrade - reupgrade protection and other improvements (osmosis-labs#12)

* add leaf hash to fast node and unit test

* refactor get with index and get by index, fix migration in load version and lazy load version

* use Get in GetVersioned of mutable tree

* refactor non membership proof to use fast storage if available

* bench non-membership proof

* fix bench tests to work with the new changes

* add downgrade-reupgrade protection and unit test

* remove leaf hash from fast node

* resolve multithreading bug related to iterators not being closed

* clean up

* use correct tree in bench tests

* add cache to tree used to bench non membership proofs

* add benc tests for GetWithIndex and GetByIndex

* revert GetWithIndex and GetByIndex

* remove unused import

* unit test re-upgrade protection and fix small issues

* remove redundant setStorageVersion method

* fix bug with appending to live stage version to storage version and nit test

* add comment for setFastStorageVersionToBatch

* refactor and improve unit tests for reupgrade protection

* rename ndb's isFastStorageEnabled to hasUpgradedToFastStorage and add comments

* comment out new implementation for GetNonMembershipProof

* update comments in nodedb to reflect the difference between hasUpgradedToFastStorage and shouldForceFastStorageUpdate

* refactor nodedb tests

* downgrade tendermint to 0.34.14 - osmosis's latest cosmos sdk does not support 0.35.0

* fix bug where fast storage was not enabled when version 0 was attempted to be loaded

* implement unsaved fast iterator to be used in mutable tree (osmosis-labs#16)

* address comments from unsaved fast iterator PR

* expose isUpgradeable method on mutable tree and unit test (osmosis-labs#17)

* expose isUpgradeable method on mutable tree and unit test

* go fmt

* resolve problems with rebasing

* update CHANGELOG.md

* tendermint v0.35.0

* fix String() bench

* fix duplication lint in iterator_test.go

* fix lint for tree.ndb.DeleteFastNode error return not checked

* fix Error return value of `ndb.resetBatch` is not checked

* fix Error return value of `ndb.traversePrefix` is not checked

* fix Error: struct of size 64 bytes could be of size 56 bytes

* fix Error: `comitted` is a misspelling of `committed`

* fix issues in basic_test.go

* address comments in fast_iterator.go

* address comments in immutable tree

* address comments in iterator.go

* address comments in key_format.go

* address remaining comments

* fix Error: Error return value of `ndb.batch.Write` is not checked

* fix Error: receiver name t should be consistent with previous receiver name tree for MutableTree

* fix Error: struct of size 48 bytes could be of size 40 bytes

* go fmt

* more linter fixes

* fix remaining linter problems

* upgrade tm-db and comment out broken bencher databases

* skip iterations for BenchmarkLevelDBLargeData bench

* attempt to fix linter

* force GC, no cache during migration, auto heap profile (osmosis-labs#19)

* force GC, no cache during migration, auto heap profile

* resolve a potential deadlock from racing between reset and stop

* fix small lint issue

* remove logs and pprof logic

* remove unused libraries

* add comment explaining the reason for RAM optimizations

* sync access to fast node cache to avoid concurrent write fatal error (osmosis-labs#23)

* update go.mod and go.sum to match master versions

* revert osmosis-labs#23 (sync access to fast node cache), fix bug related to old height export (osmosis-labs#33)

* Revert "sync access to fast node cache to avoid concurrent write fatal error (osmosis-labs#23)"

This reverts commit 2a1daf4.

* return correct iterator in mutable tree

* fix concurrent map panic when querying and comittting concurrently

* avoid clearing fast node cache during pruning (osmosis-labs#35)

* fix data race related to VersionExists (osmosis-labs#36)

* fix data race related to VersionExists

* use regular lock instead of RW in mutable_tree.go

* hardcode fast node cache size to  100k

* go fmt

* restore proof_ics23.go

* fix linter

Co-authored-by: ValarDragon <dojha12@gmail.com>
Co-authored-by: jtieri <37750742+jtieri@users.noreply.github.com>
Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com>
Co-authored-by: Roman Akhtariev <34196718+akhtariev@users.noreply.github.com>
Co-authored-by: Roman <34196718+r0mvn@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants