-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
memdb: prevent iterator invalidation #1563
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: ekexium <eke@fastmail.com>
8ac9c7a
to
1a6d100
Compare
Signed-off-by: ekexium <eke@fastmail.com>
Signed-off-by: ekexium <eke@fastmail.com>
faf4853
to
4c16a14
Compare
Signed-off-by: ekexium <eke@fastmail.com>
916238e
to
e1a3b5a
Compare
Signed-off-by: ekexium <eke@fastmail.com>
74a0617
to
29fc98e
Compare
Signed-off-by: ekexium <eke@fastmail.com>
Signed-off-by: ekexium <eke@fastmail.com>
Signed-off-by: ekexium <eke@fastmail.com>
7faf0e1
to
941d94b
Compare
internal/unionstore/memdb_rbt.go
Outdated
@@ -175,3 +201,12 @@ func (db *rbtDBWithContext) SnapshotIterReverse(upper, lower []byte) Iterator { | |||
func (db *rbtDBWithContext) SnapshotGetter() Getter { | |||
return db.RBT.SnapshotGetter() | |||
} | |||
|
|||
func (db *rbtDBWithContext) BatchedSnapshotIter(lower, upper []byte, reverse bool) Iterator { | |||
// TODO: implement this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments should be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment means to implement the batched iterator. Currently it's only an alias to the original snapshot iter. I've modified it to make it clear
checkReverse(newArtDBWithContext(), 64) | ||
} | ||
|
||
func TestBatchedSnapshotIterEdgeCase(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have case to cover the memdb is changed bettween two fillbatch
operations and errors shoul be reported?
internal/unionstore/memdb_art.go
Outdated
if it.db.GetSnapshot() != it.snapshot { | ||
return errors.Errorf( | ||
"snapshot changed between batches, expected=%v, actual=%v", | ||
it.snapshot, | ||
it.db.GetSnapshot(), | ||
) | ||
} | ||
|
||
it.db.RLock() | ||
defer it.db.RUnlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be possible race condition bettwen the GetSnapshot
check and RLock
, it seems we need a CAS like semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reconsidered this and it looks like we don't need to store and compare snapshots at all: SnapshotSeqNo
is enough to guarantee the "snapshot" doesn't change because of its definition.
Though the race of snapshotSeqNo
shall be considered another problem
We presumed there cannot be concurrent accesses to SnapshotSeqNo
as it should only be written in staging methods, which are performed when finishing statements.
Signed-off-by: ekexium <eke@fastmail.com>
@@ -52,6 +52,13 @@ type ART struct { | |||
lastTraversedNode atomic.Uint64 | |||
hitCount atomic.Uint64 | |||
missCount atomic.Uint64 | |||
|
|||
// The counter of every write operation, used to invalidate iterators that were created before the write operation. | |||
WriteSeqNo int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The WriteSeqNo
seems not to have the concurrency invariance like SnapshotSeqNo
below, should atomic variable be used for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of the sequence number is to check interleaving read and write operations, but not concurrent access. Even if we change it to an atomic variable, concurrent read and write can still corrupt the iterator.
If there exists data race on this variable, there must be a bug on the caller side.
I've updated the comment to explain the rationale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to sort out the current issues:
- memdb has concurrent usage scenarios, such as pessimistic lock setting key flags while membuf read is performing a memdb snapshot read.
- The underlying implementation structure of memdb does not support concurrency and needs to be properly used by the upper layer with mutex protection. This is the issue being addressed by the PR at executor: in-txn statement read from MemBuffer's snapshot pingcap/tidb#59219.
- The situation where the memdb iterator is invalidated by interleaving writes, which is the purpose of introducing a mechanism like write sequence in this PR.
Implementation level:
lastTraversedNode
,hitCount
, andmissCount
, several internal states of ART, are implemented as atomic variables. These states are modified by write operations, and the WriteSeqNo below will also be modified by write operations, requiring the upper layer to handle concurrency correctly.
This part of the code implementation is confusing, should we make the internal states of ART to be single-threaded, while allowing concurrent operations and protection to be handled at the upper layer of MemDB
? So there is no need to consider data races or introduce atomic variables in the ART code.
/cc @you06
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One difference between WriteSeqNo
and lastTraversedNode
is that the latter one is updated during read operations. Concurrent read operations are allowed (by design?). So even if we acquire an Rlock of memdb for every read operation lastTraversedNode
still needs to be implemented by an atomic variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the traverse in snapshot-get do not cache the last-traversed node(maybe it's not worth cache for a snapshot-get), I think the art.lastTraversedNode
can also be a non-atomic variable, but hitCount
and missCount
should be atomic.
addr, lf := snap.tree.traverse(key, false) |
Signed-off-by: ekexium <eke@fastmail.com>
a5e313d
to
b15c57a
Compare
ref pingcap/tidb#59153
To prevent potential misuse and iterator invalidation, modify the iterators provided by ART memdb as follows:
Iter
andIterReverse
now comes with an extra check: it is invalidated immediately by any write operation to the memdb after the creation of the iterator. Attempting to use such an invalidated iterator will result in a panic.SnapshotIter
andSnapshotIterReverse
will be replaced byBatchedSnapshotIter
.2.1.
SnapshotIter
is different fromIter
that it can be valid after write operations, but only becomes invalid if a write operation modifies the "snapshot".2.2. We need to introduce
BatchedSnapshotIter
instead of directly modifyingSnapshotIter
becauseSnapshotIter
maintains internal states and pointers. Consider a situation where a write operation causes changes to the internal data structure making the pointers invalid, while the snapshot should remain valid.2.3.
SnapshotIter
andSnapshotIterReverse
are not removed now for compatibility.RBT is unchanged as it is no longer used.
Pipelined MemDB still doesn't support iterators as it was.
Performance
Iterator microbenchmark
TiDB union scan executor
BatchedSnapshotIter
SnapshotIter