-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(baseapp): fix race condition in state #11102
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,12 +1,15 @@ | ||
package baseapp | ||
|
||
import ( | ||
"sync" | ||
|
||
sdk "github.com/cosmos/cosmos-sdk/types" | ||
) | ||
|
||
type state struct { | ||
ms sdk.CacheMultiStore | ||
ctx sdk.Context | ||
lock sync.RWMutex | ||
ms sdk.CacheMultiStore | ||
ctx sdk.Context | ||
} | ||
|
||
// CacheMultiStore calls and returns a CacheMultiStore on the state's underling | ||
|
@@ -17,5 +20,15 @@ func (st *state) CacheMultiStore() sdk.CacheMultiStore { | |
|
||
// Context returns the Context of the state. | ||
func (st *state) Context() sdk.Context { | ||
defer st.lock.RUnlock() | ||
st.lock.RLock() | ||
|
||
return st.ctx | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not actually safe, as the Context type has fields which have reference semantics. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ouch There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @peterbourgon it seems the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. tho more I dig into it - the more I am convinced this is a dirty hack and eventually it will blow There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep! I think you're right. As far as I can see, BaseApp's exported methods — including but not limited to BeginBlock — can absolutely be called by concurrent goroutines. This means that they must ensure that anything they read or write is synchronized. But that's not happening. BaseApp —and many, many, many other components in the SDK — permit unsynchronized reads and writes on their encapsulated values, and consequently violate Go's memory model. Many of these soundness errors remain undetected or overlooked because the current, specific execution paths happen to not trigger them most of the time. There are a lot of pathological issues in the Cosmos SDK, as well as the sdk.Context type specifically, which make this kind of bug hard to fix. In this case, at a high level: contexts are supposed to be request-scoped, but here — and in many other places, too — the context value is long-lived. I might be missing something, but that seems to be a clear design error. Neither a state nor a BaseApp nor anything else with a lifetime beyond an individual request should maintain a context value. More concretely: as with most types in the SDK, methods on the sdk.Context are — probably incorrectly — defined on a value receiver. That means every method call creates and operates on a (shallow) copy of the original value. This thrashes the GC, but more importantly it makes synchronization of any field with reference semantics more or less impossible. Even if you have a mutex in the type — which the context doesn't — those mutexes would get copied, and so wouldn't actually provide mutual exclusion on the values they'd be meant to protect. And then there's all of the basically unsolvable problems created by the SDK's endemic misuse of I don't see how to fix the data race without addressing these problems. Probably at least a few more, too. -- I certainly haven't done a deep-dive on this code, and so I'm not speaking from an informed place. But based on what I do understand, it seems that the right approach to fixing this problem is to eliminate the Or, I dunno. Maybe I got it all wrong. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thanks @peterbourgon for looking over and confirming There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As the SDK exists today, it's not meant to or designed to be executed concurrently -- we assume Tendermint places the relevant locks in it's reactors prior to executing ABCI calls, which it does. The issue arrises here, at least as far as I can tell, from direct client gRPC queries being executed while the state machine is executing ABCI call(s) that can contain various writes, which is really outside the scope or domain of Tendermint. So while I see the idea proposed here with We need to take a step back and think of a different approach to allowing direct gRPC queries while the state machine is executing ABCI calls. For simplicity's sake, forget Tendermint even exists at this point. I think there are two ways we can protect reads and writes:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That also might be an option, but im not sure it'll be an app. Rather We might have to refactor There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @alexanderbez is this issue still a thing on later cosmos sdk 0.47+? or someone had opportunity to address it? |
||
} | ||
|
||
// WithContext update context of the state | ||
func (st *state) WithContext(ctx sdk.Context) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 💯 |
||
st.lock.Lock() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's also add a changelog entry |
||
defer st.lock.Unlock() | ||
st.ctx = ctx | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have to lock the mutex before you can unlock it :)