Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memdb: prevent iterator invalidation #1563

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

ekexium
Copy link
Contributor

@ekexium ekexium commented Jan 23, 2025

ref pingcap/tidb#59153

To prevent potential misuse and iterator invalidation, modify the iterators provided by ART memdb as follows:

  1. Iter and IterReverse now comes with an extra check: it is invalidated immediately by any write operation to the memdb after the creation of the iterator. Attempting to use such an invalidated iterator will result in a panic.
  2. SnapshotIter and SnapshotIterReverse will be replaced by BatchedSnapshotIter.
    2.1. SnapshotIter is different from Iter that it can be valid after write operations, but only becomes invalid if a write operation modifies the "snapshot".
    2.2. We need to introduce BatchedSnapshotIter instead of directly modifying SnapshotIter because SnapshotIter maintains internal states and pointers. Consider a situation where a write operation causes changes to the internal data structure making the pointers invalid, while the snapshot should remain valid.
    2.3. SnapshotIter and SnapshotIterReverse are not removed now for compatibility.

RBT is unchanged as it is no longer used.
Pipelined MemDB still doesn't support iterators as it was.

Performance

Iterator microbenchmark

go test -run=^$ -bench=BenchmarkSnapshotIter -benchtime=3s
goos: linux
goarch: amd64
pkg: github.com/tikv/client-go/v2/internal/unionstore
cpu: AMD Ryzen 9 9900X 12-Core Processor            
BenchmarkSnapshotIter/RBT-SnapshotIter-24                     1783           1826587 ns/op             144 B/op               2 allocs/op
BenchmarkSnapshotIter/ART-SnapshotIter-24                     1870           1718771 ns/op             496 B/op              11 allocs/op
BenchmarkSnapshotIter/ART-BatchedSnapshotIter-24              1120           2964170 ns/op          417461 B/op             370 allocs/op
BenchmarkSnapshotIter/ART-ForEachInSnapshot-24                1771           1832084 ns/op             496 B/op              11 allocs/op
PASS
ok          github.com/tikv/client-go/v2/internal/unionstore        14.598s

TiDB union scan executor
BatchedSnapshotIter

go test -run=^$ -bench=BenchmarkUnionScanRead -benchtime=10s
    3999           3063035 ns/op          793197 B/op           17578 allocs/op

SnapshotIter

go test -run=^$ -bench=BenchmarkUnionScanRead -benchtime=10s
    3862           2990516 ns/op          386942 B/op           17434 allocs/op

Copy link

ti-chi-bot bot commented Jan 23, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from ekexium, ensuring that each of them provides their approval before proceeding. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. dco-signoff: yes Indicates the PR's author has signed the dco. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 23, 2025
Signed-off-by: ekexium <eke@fastmail.com>
@ekexium ekexium force-pushed the memdb-prevent-iter-invalidation branch from 8ac9c7a to 1a6d100 Compare January 23, 2025 08:02
Signed-off-by: ekexium <eke@fastmail.com>
Signed-off-by: ekexium <eke@fastmail.com>
@ekexium ekexium force-pushed the memdb-prevent-iter-invalidation branch from faf4853 to 4c16a14 Compare January 23, 2025 11:30
Signed-off-by: ekexium <eke@fastmail.com>
@ekexium ekexium force-pushed the memdb-prevent-iter-invalidation branch from 916238e to e1a3b5a Compare January 23, 2025 12:27
Signed-off-by: ekexium <eke@fastmail.com>
@ekexium ekexium force-pushed the memdb-prevent-iter-invalidation branch from 74a0617 to 29fc98e Compare January 24, 2025 05:54
@ekexium ekexium marked this pull request as ready for review January 24, 2025 05:54
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 24, 2025
@ekexium ekexium requested review from you06 and cfzjywxk January 24, 2025 05:54
Signed-off-by: ekexium <eke@fastmail.com>
internal/unionstore/arena/arena.go Show resolved Hide resolved
internal/unionstore/art/art.go Outdated Show resolved Hide resolved
internal/unionstore/memdb_art.go Show resolved Hide resolved
internal/unionstore/arena/arena.go Outdated Show resolved Hide resolved
internal/unionstore/memdb_art.go Outdated Show resolved Hide resolved
internal/unionstore/union_store.go Show resolved Hide resolved
Signed-off-by: ekexium <eke@fastmail.com>
Signed-off-by: ekexium <eke@fastmail.com>
@ekexium ekexium force-pushed the memdb-prevent-iter-invalidation branch from 7faf0e1 to 941d94b Compare February 10, 2025 06:56
@@ -175,3 +201,12 @@ func (db *rbtDBWithContext) SnapshotIterReverse(upper, lower []byte) Iterator {
func (db *rbtDBWithContext) SnapshotGetter() Getter {
return db.RBT.SnapshotGetter()
}

func (db *rbtDBWithContext) BatchedSnapshotIter(lower, upper []byte, reverse bool) Iterator {
// TODO: implement this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments should be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment means to implement the batched iterator. Currently it's only an alias to the original snapshot iter. I've modified it to make it clear

checkReverse(newArtDBWithContext(), 64)
}

func TestBatchedSnapshotIterEdgeCase(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have case to cover the memdb is changed bettween two fillbatch operations and errors shoul be reported?

internal/unionstore/memdb_art.go Outdated Show resolved Hide resolved
Comment on lines 246 to 255
if it.db.GetSnapshot() != it.snapshot {
return errors.Errorf(
"snapshot changed between batches, expected=%v, actual=%v",
it.snapshot,
it.db.GetSnapshot(),
)
}

it.db.RLock()
defer it.db.RUnlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be possible race condition bettwen the GetSnapshot check and RLock, it seems we need a CAS like semantics.

Copy link
Contributor Author

@ekexium ekexium Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reconsidered this and it looks like we don't need to store and compare snapshots at all: SnapshotSeqNo is enough to guarantee the "snapshot" doesn't change because of its definition.

Though the race of snapshotSeqNo shall be considered another problem
We presumed there cannot be concurrent accesses to SnapshotSeqNo as it should only be written in staging methods, which are performed when finishing statements.

Signed-off-by: ekexium <eke@fastmail.com>
@@ -52,6 +52,13 @@ type ART struct {
lastTraversedNode atomic.Uint64
hitCount atomic.Uint64
missCount atomic.Uint64

// The counter of every write operation, used to invalidate iterators that were created before the write operation.
WriteSeqNo int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WriteSeqNo seems not to have the concurrency invariance like SnapshotSeqNo below, should atomic variable be used for it?

Copy link
Contributor Author

@ekexium ekexium Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of the sequence number is to check interleaving read and write operations, but not concurrent access. Even if we change it to an atomic variable, concurrent read and write can still corrupt the iterator.
If there exists data race on this variable, there must be a bug on the caller side.
I've updated the comment to explain the rationale.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to sort out the current issues:

  1. memdb has concurrent usage scenarios, such as pessimistic lock setting key flags while membuf read is performing a memdb snapshot read.
  2. The underlying implementation structure of memdb does not support concurrency and needs to be properly used by the upper layer with mutex protection. This is the issue being addressed by the PR at executor: in-txn statement read from MemBuffer's snapshot pingcap/tidb#59219.
  3. The situation where the memdb iterator is invalidated by interleaving writes, which is the purpose of introducing a mechanism like write sequence in this PR.

Implementation level:

  1. lastTraversedNode, hitCount, and missCount, several internal states of ART, are implemented as atomic variables. These states are modified by write operations, and the WriteSeqNo below will also be modified by write operations, requiring the upper layer to handle concurrency correctly.

This part of the code implementation is confusing, should we make the internal states of ART to be single-threaded, while allowing concurrent operations and protection to be handled at the upper layer of MemDB? So there is no need to consider data races or introduce atomic variables in the ART code.
/cc @you06

Copy link
Contributor Author

@ekexium ekexium Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One difference between WriteSeqNo and lastTraversedNode is that the latter one is updated during read operations. Concurrent read operations are allowed (by design?). So even if we acquire an Rlock of memdb for every read operation lastTraversedNode still needs to be implemented by an atomic variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the traverse in snapshot-get do not cache the last-traversed node(maybe it's not worth cache for a snapshot-get), I think the art.lastTraversedNode can also be a non-atomic variable, but hitCount and missCount should be atomic.

addr, lf := snap.tree.traverse(key, false)

Signed-off-by: ekexium <eke@fastmail.com>
@ekexium ekexium force-pushed the memdb-prevent-iter-invalidation branch from a5e313d to b15c57a Compare February 12, 2025 08:14
Signed-off-by: ekexium <eke@fastmail.com>
@ti-chi-bot ti-chi-bot bot requested a review from you06 February 12, 2025 08:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dco-signoff: yes Indicates the PR's author has signed the dco. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants