fix(baseapp): fix race condition in state #11102

troian · 2022-02-02T19:19:42Z

as the state is not threadsafe when rpc and grpc are both active
race condition occurs

WARNING: DATA RACE
Write at 0x00c002a6cb50 by goroutine 125:
  github.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).BeginBlock()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/cosmos/cosmos-sdk/baseapp/abci.go:187 +0x9b4
  github.com/ovrclk/akash/app.(*AkashApp).BeginBlock()
      <autogenerated>:1 +0x90
  github.com/tendermint/tendermint/abci/client.(*localClient).BeginBlockSync()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/tendermint/tendermint/abci/client/local_client.go:280 +0x120
  github.com/tendermint/tendermint/proxy.(*appConnConsensus).BeginBlockSync()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/tendermint/tendermint/proxy/app_conn.go:81 +0x8c
  github.com/tendermint/tendermint/state.execBlockOnProxyApp()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/tendermint/tendermint/state/execution.go:307 +0x480
  github.com/tendermint/tendermint/state.(*BlockExecutor).ApplyBlock()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/tendermint/tendermint/state/execution.go:140 +0x180
  github.com/tendermint/tendermint/consensus.(*State).finalizeCommit()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/tendermint/tendermint/consensus/state.go:1635 +0xda8
  github.com/tendermint/tendermint/consensus.(*State).tryFinalizeCommit()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/tendermint/tendermint/consensus/state.go:1546 +0x468
  github.com/tendermint/tendermint/consensus.(*State).enterCommit.func1()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/tendermint/tendermint/consensus/state.go:1481 +0x11c
  github.com/tendermint/tendermint/consensus.(*State).enterCommit()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/tendermint/tendermint/consensus/state.go:1519 +0x1264
  github.com/tendermint/tendermint/consensus.(*State).addVote()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/tendermint/tendermint/consensus/state.go:2132 +0x11fc
  github.com/tendermint/tendermint/consensus.(*State).tryAddVote()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/tendermint/tendermint/consensus/state.go:1930 +0x48
  github.com/tendermint/tendermint/consensus.(*State).handleMsg()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/tendermint/tendermint/consensus/state.go:838 +0x51c
  github.com/tendermint/tendermint/consensus.(*State).receiveRoutine()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/tendermint/tendermint/consensus/state.go:782 +0x5a0

Previous read at 0x00c002a6cb50 by goroutine 145:
  github.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).createQueryContext()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/cosmos/cosmos-sdk/baseapp/abci.go:648 +0x3c4
  github.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).RegisterGRPCServer.func1()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/cosmos/cosmos-sdk/baseapp/grpcserver.go:50 +0x2c0
  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x78
  github.com/grpc-ecosystem/go-grpc-middleware/recovery.UnaryServerInterceptor.func1()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/grpc-ecosystem/go-grpc-middleware/recovery/interceptors.go:33 +0xb8
  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x78
  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:34 +0xf4
  github.com/cosmos/cosmos-sdk/x/bank/types._Query_AllBalances_Handler()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/cosmos/cosmos-sdk/x/bank/types/query.pb.go:943 +0x1bc
  github.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).RegisterGRPCServer.func2()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/cosmos/cosmos-sdk/baseapp/grpcserver.go:80 +0x124
  google.golang.org/grpc.(*Server).processUnaryRPC()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/google.golang.org/grpc/server.go:1210 +0x11b4
  google.golang.org/grpc.(*Server).handleStream()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/google.golang.org/grpc/server.go:1533 +0xed8
  google.golang.org/grpc.(*Server).serveStreams.func1.2()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/google.golang.org/grpc/server.go:871 +0xa4

Signed-off-by: Artur Troian troian.ap@gmail.com

as state is not threadsafe when rpc and grpc are both active race condition occurs Signed-off-by: Artur Troian <troian.ap@gmail.com>

troian · 2022-02-02T19:20:54Z

@AmauryM in continue to the discussion in #10997

amaury1093

this LTGM, did you try concurrent grpc queries along with the other PR?

tac0turtle · 2022-02-03T11:52:45Z

Is this issue present on master? I think its present in the other pr because we added grpc routing to the client.

amaury1093

I can't replicate this data race on 0.44.5, 0.45 or master.

Here's the simple script I used (run a node in another terminal):

#!/usr/bin/env bash
set -e

# Set variables
CFG_DIR=~/.simapp
BUILD_CMD=./build/simd
ALICE=alice
CHAIN_ID=my-chain

ALICE_ADDRESS=$($BUILD_CMD keys show $ALICE -a --keyring-backend test)

for i in {1..10000}
do
   echo $i
   grpcurl -plaintext -d "{\"address\":\"$ALICE_ADDRESS\"}" localhost:9090 cosmos.bank.v1beta1.Query/AllBalances &
done

And with some additional logging in the app, I can see that BeginBlock and grpc queries are handled concurrently.

Let's investigate a bit more before making this R4R. @troian Are you seeing data races on v0.44.5, or only after your PR?

troian · 2022-02-03T13:05:13Z

@AmauryM yes. we see it on 0.44.5.
sorry, I don't have much time for now to detail the issue on master

this PR with #10997 fixes the race issue for us.

I'm not sure this runs in parallels, just sequentially 10k requests. isn't it?

for i in {1..10000}
do
   echo $i
   grpcurl -plaintext -d "{\"address\":\"$ALICE_ADDRESS\"}" localhost:9090 cosmos.bank.v1beta1.Query/AllBalances &
done

amaury1093 · 2022-02-03T15:43:10Z

yes. we see it on 0.44.5.

OK. Is there a way you can make a repro? e.g. a script or some code to make the data race happen

I'm not sure this runs in parallels,

It's in parallel, because of the & at the end.

troian · 2022-02-03T15:51:36Z

It's in parallel, because of the & at the end.
it just detaches from the parent process. it may finish before the next test starts.

also, I'm not sure this test will reproduce the issue. take a look at my comment with stack trace in the #10997. the RPC and gRPC are trying to access non-thread-safe object

amaury1093 · 2022-02-03T17:23:41Z

OK. Could you explain/paste a script on how you reproduced the data race?

What I'm trying to understand is how we didn't find this data race in previous 0.44.x testing phases.

amaury1093 · 2022-02-04T14:57:47Z

Note: It seems that there's maybe even another data race in v0.44.5: #11114 :(

amaury1093

After #11117 I got convinced there are data race issues in v0.44+, so overall I am okay to merge this PR 👍, seems Query is indeed potentially accessing checkState the same time when BeginBlock is modifying it.

I never personally reproduced the data race though. @troian Do you think you can provide a small test like https://github.com/cosmos/cosmos-sdk/pull/11117/files#diff-9ab8b6b1ae348e450f51d4a110e504d9aee67d848a997128a39f914a0acfa7f7R807?

amaury1093 · 2022-02-08T11:26:32Z

baseapp/state.go

+// WithContext update context of the state
+func (st *state) WithContext(ctx sdk.Context) {
+	defer st.lock.Unlock()
+	st.lock.Lock()


Let's also add a changelog entry

baseapp/state.go

peterbourgon · 2022-02-09T01:31:29Z

baseapp/state.go

+	defer st.lock.RUnlock()
+	st.lock.RLock()


Suggested change

defer st.lock.RUnlock()

st.lock.RLock()

st.lock.RLock()

defer st.lock.RUnlock()

comment why?

You have to lock the mutex before you can unlock it :)

peterbourgon · 2022-02-09T01:32:23Z

baseapp/state.go

@@ -17,5 +20,15 @@ func (st *state) CacheMultiStore() sdk.CacheMultiStore {

 // Context returns the Context of the state.
 func (st *state) Context() sdk.Context {
+	defer st.lock.RUnlock()
+	st.lock.RLock()
+
 	return st.ctx


This is not actually safe, as the Context type has fields which have reference semantics.

@peterbourgon it seems the Context type needs some sort of deep copy.
thoughts

tho more I dig into it - the more I am convinced this is a dirty hack and eventually it will blow
BeginBlock for example
if there are two servers listening (socket and grpc) they may (and will) call BeginBlock simultaneously.
correct if I'm getting it wrong

Yep! I think you're right.

As far as I can see, BaseApp's exported methods — including but not limited to BeginBlock — can absolutely be called by concurrent goroutines. This means that they must ensure that anything they read or write is synchronized. But that's not happening. BaseApp —and many, many, many other components in the SDK — permit unsynchronized reads and writes on their encapsulated values, and consequently violate Go's memory model. Many of these soundness errors remain undetected or overlooked because the current, specific execution paths happen to not trigger them most of the time.

There are a lot of pathological issues in the Cosmos SDK, as well as the sdk.Context type specifically, which make this kind of bug hard to fix.

In this case, at a high level: contexts are supposed to be request-scoped, but here — and in many other places, too — the context value is long-lived. I might be missing something, but that seems to be a clear design error. Neither a state nor a BaseApp nor anything else with a lifetime beyond an individual request should maintain a context value.

More concretely: as with most types in the SDK, methods on the sdk.Context are — probably incorrectly — defined on a value receiver. That means every method call creates and operates on a (shallow) copy of the original value. This thrashes the GC, but more importantly it makes synchronization of any field with reference semantics more or less impossible. Even if you have a mutex in the type — which the context doesn't — those mutexes would get copied, and so wouldn't actually provide mutual exclusion on the values they'd be meant to protect.

And then there's all of the basically unsolvable problems created by the SDK's endemic misuse of panic as an error handling mechanism. But that's an entirely different discussion.

I don't see how to fix the data race without addressing these problems. Probably at least a few more, too.

--

I certainly haven't done a deep-dive on this code, and so I'm not speaking from an informed place. But based on what I do understand, it seems that the right approach to fixing this problem is to eliminate the state type altogether, including the deliverState and checkState fields in the BaseApp, in order to eliminate the long-lived context value. Then, look at whatever was writing-to and reading-from those state values to figure out what stuff they actually needed from the contexts. Capture that information specifically, in a separate and synchronized type, in the BaseApp struct.

Or, I dunno. Maybe I got it all wrong.

thanks @peterbourgon for looking over and confirming
I'll give it some thinking

As the SDK exists today, it's not meant to or designed to be executed concurrently -- we assume Tendermint places the relevant locks in it's reactors prior to executing ABCI calls, which it does. The issue arrises here, at least as far as I can tell, from direct client gRPC queries being executed while the state machine is executing ABCI call(s) that can contain various writes, which is really outside the scope or domain of Tendermint.

So while I see the idea proposed here with Context(), I don't think it's the correct approach, although I do appreciate the efforts @troian!

We need to take a step back and think of a different approach to allowing direct gRPC queries while the state machine is executing ABCI calls. For simplicity's sake, forget Tendermint even exists at this point. I think there are two ways we can protect reads and writes:

Either by taking a revised approach to the use of the state context as you attempted (maybe it just needs a bit more thought), OR

Using a RW mutex on BaseApp itself, where most ABCI calls use a write lock and we only obtain a read lock upon Query.

Construct a standalone app instance for quering which only shares the low-level db handler?

That also might be an option, but im not sure it'll be an app. Rather We might have to refactor Baseapp#Query.

@alexanderbez is this issue still a thing on later cosmos sdk 0.47+? or someone had opportunity to address it?

troian · 2022-02-09T01:38:27Z

After #11117 I got convinced there are data race issues in v0.44+, so overall I am okay to merge this PR 👍, seems Query is indeed potentially accessing checkState the same time when BeginBlock is modifying it.

I never personally reproduced the data race though. @troian Do you think you can provide a small test like https://github.com/cosmos/cosmos-sdk/pull/11117/files#diff-9ab8b6b1ae348e450f51d4a110e504d9aee67d848a997128a39f914a0acfa7f7R807?

sure, i'll try to make one, cannot promise it very soon tho, quite a busy schedule :(

robert-zaremba · 2022-02-09T09:52:39Z

Is this PR still in draft? We want to make a release today (if possible) and we need this PR merged

Co-authored-by: Peter Bourgon <peterbourgon@users.noreply.github.com>

troian · 2022-02-09T12:33:11Z

@robert-zaremba, as @peterbourgon mentioned the mutex inContext() does not guard sdk.Context as it has quite a few values provided by reference. So the PR needs more work

without digging in to much, deep copy seems to be an option, tho there might be logic relying on references

github-actions · 2022-03-27T00:04:35Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

peterbourgon · 2022-03-29T18:45:48Z

baseapp/state.go

 	return st.ctx
 }
+
+// WithContext update context of the state
+func (st *state) WithContext(ctx sdk.Context) {


WithX methods typically leave the receiver unmodified and return a copy with the requested changes. Should this be e.g. SetContext?

robert-zaremba · 2022-03-30T11:55:56Z

@troian any hope you can continue working on this PR?

troian · 2022-03-30T12:10:01Z

@robert-zaremba i don't this PR will be merged as there are some serious things about thread safety revealed.
checkout my conversation with @peterbourgon above

tac0turtle · 2022-05-09T10:34:42Z

@AmauryM could you turn this PR into an issue then we close it for the time being?

github-actions · 2022-07-24T00:06:13Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

fix(baseapp): fix race condition in state

60d40fa

as state is not threadsafe when rpc and grpc are both active race condition occurs Signed-off-by: Artur Troian <troian.ap@gmail.com>

amaury1093 reviewed Feb 3, 2022

View reviewed changes

amaury1093 suggested changes Feb 3, 2022

View reviewed changes

amaury1093 approved these changes Feb 8, 2022

View reviewed changes

amaury1093 reviewed Feb 8, 2022

View reviewed changes

peterbourgon reviewed Feb 9, 2022

View reviewed changes

baseapp/state.go Outdated Show resolved Hide resolved

peterbourgon reviewed Feb 9, 2022

View reviewed changes

comment why?

56b1d4f

Co-authored-by: Peter Bourgon <peterbourgon@users.noreply.github.com>

github-actions bot added the stale label Mar 27, 2022

troian removed the stale label Mar 27, 2022

peterbourgon reviewed Mar 29, 2022

View reviewed changes

troian mentioned this pull request Apr 5, 2022

feat: use gRPC for queries #10997

Closed

3 tasks

adu-crypto mentioned this pull request Jun 3, 2022

Problem: slow queries can slow down consensus state machine evmos/ethermint#1007

Closed

github-actions bot added the stale label Jul 24, 2022

github-actions bot closed this Jul 31, 2022

tac0turtle deleted the fix-race-queries branch February 16, 2023 00:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(baseapp): fix race condition in state #11102

fix(baseapp): fix race condition in state #11102

troian commented Feb 2, 2022 •

edited by amaury1093

Loading

troian commented Feb 2, 2022

amaury1093 left a comment

tac0turtle commented Feb 3, 2022

amaury1093 left a comment •

edited

Loading

troian commented Feb 3, 2022

amaury1093 commented Feb 3, 2022

troian commented Feb 3, 2022

amaury1093 commented Feb 3, 2022

amaury1093 commented Feb 4, 2022 •

edited

Loading

amaury1093 left a comment •

edited

Loading

amaury1093 Feb 8, 2022

peterbourgon Feb 9, 2022

troian Feb 9, 2022

peterbourgon Feb 9, 2022

peterbourgon Feb 9, 2022

troian Feb 9, 2022

troian Mar 30, 2022

troian Mar 30, 2022

peterbourgon Mar 30, 2022 •

edited

Loading

troian Mar 30, 2022

alexanderbez Jun 7, 2022

yihuang Jun 8, 2022

alexanderbez Jun 8, 2022

troian Aug 11, 2023

troian commented Feb 9, 2022

robert-zaremba commented Feb 9, 2022

troian commented Feb 9, 2022 •

edited

Loading

github-actions bot commented Mar 27, 2022

peterbourgon Mar 29, 2022

troian Mar 30, 2022

robert-zaremba commented Mar 30, 2022

troian commented Mar 30, 2022

tac0turtle commented May 9, 2022

github-actions bot commented Jul 24, 2022

fix(baseapp): fix race condition in state #11102

fix(baseapp): fix race condition in state #11102

Conversation

troian commented Feb 2, 2022 • edited by amaury1093 Loading

troian commented Feb 2, 2022

amaury1093 left a comment

Choose a reason for hiding this comment

tac0turtle commented Feb 3, 2022

amaury1093 left a comment • edited Loading

Choose a reason for hiding this comment

troian commented Feb 3, 2022

amaury1093 commented Feb 3, 2022

troian commented Feb 3, 2022

amaury1093 commented Feb 3, 2022

amaury1093 commented Feb 4, 2022 • edited Loading

amaury1093 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peterbourgon Mar 30, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

troian commented Feb 9, 2022

robert-zaremba commented Feb 9, 2022

troian commented Feb 9, 2022 • edited Loading

github-actions bot commented Mar 27, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robert-zaremba commented Mar 30, 2022

troian commented Mar 30, 2022

tac0turtle commented May 9, 2022

github-actions bot commented Jul 24, 2022

troian commented Feb 2, 2022 •

edited by amaury1093

Loading

amaury1093 left a comment •

edited

Loading

amaury1093 commented Feb 4, 2022 •

edited

Loading

amaury1093 left a comment •

edited

Loading

peterbourgon Mar 30, 2022 •

edited

Loading

troian commented Feb 9, 2022 •

edited

Loading