-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: batch evaluation should operate on a consistent storage snapshot #55461
Labels
A-kv-transactions
Relating to MVCC and the transactional model.
A-storage
Relating to our storage engine (Pebble) on-disk storage.
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-kv
KV Team
Comments
nvanbenschoten
added
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
A-storage
Relating to our storage engine (Pebble) on-disk storage.
A-kv-transactions
Relating to MVCC and the transactional model.
labels
Oct 12, 2020
sumeerbhola
added a commit
to sumeerbhola/pebble
that referenced
this issue
Dec 23, 2020
sumeerbhola
added a commit
to sumeerbhola/pebble
that referenced
this issue
Dec 23, 2020
sumeerbhola
added a commit
to cockroachdb/pebble
that referenced
this issue
Dec 23, 2020
sumeerbhola
added a commit
to sumeerbhola/cockroach
that referenced
this issue
Jan 6, 2021
For pebbleBatch and pebbleReadOnly, all iterators without timestamp hints see the same underlying engine state, with no interface change for the caller. This is done by cloning the first created pebble.Iterator. intentInterleavingIter explicitly requests a clone, when it creates two iterators, which ensures the consistency guarantee applies to all Reader implementations. Informs cockroachdb#55461 Informs cockroachdb#41720 Release note: None
sumeerbhola
added a commit
to sumeerbhola/cockroach
that referenced
this issue
Jan 7, 2021
For pebbleBatch and pebbleReadOnly, all iterators without timestamp hints see the same underlying engine state, with no interface change for the caller. This is done by cloning the first created pebble.Iterator. intentInterleavingIter explicitly requests a clone, when it creates two iterators, which ensures the consistency guarantee applies to all Reader implementations. Informs cockroachdb#55461 Informs cockroachdb#41720 Release note: None
sumeerbhola
added a commit
to sumeerbhola/cockroach
that referenced
this issue
Jan 8, 2021
For pebbleBatch and pebbleReadOnly, all iterators without timestamp hints see the same underlying engine state, with no interface change for the caller. This is done by cloning the first created pebble.Iterator. intentInterleavingIter explicitly requests a clone, when it creates two iterators, which ensures the consistency guarantee applies to all Reader implementations. Informs cockroachdb#55461 Informs cockroachdb#41720 Release note: None
craig bot
pushed a commit
that referenced
this issue
Jan 10, 2021
58515: storage: use consistent iterators when needed, and when possible r=sumeerbhola a=sumeerbhola For pebbleBatch and pebbleReadOnly, all iterators without timestamp hints see the same underlying engine state, with no interface change for the caller. This is done by cloning the first created pebble.Iterator. intentInterleavingIter explicitly requests a clone, when it creates two iterators, which ensures the consistency guarantee applies to all Reader implementations. Informs #55461 Informs #41720 Release note: None 58623: sql: implement SQL syntax for REGIONAL BY ROW ON col r=ajstorm a=otan Implement the syntax and aliases for `REGIONAL BY ROW ON col`, which return an unimplemented error as they are not yet implemented. Release note: None Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com> Co-authored-by: Oliver Tan <otan@cockroachlabs.com>
Note to self: evaluation now does use a snapshot, but the snapshot is captured lazily. This issue is asking for the snapshot to be captured eagerly so, for example, if we were to drop latches earlier, we'd be guaranteed to have captured a snapshot before that point. |
sumeerbhola
added a commit
to sumeerbhola/cockroach
that referenced
this issue
Jun 24, 2021
This is relevant for read evaluation cases where we want to release latches before evaluation. The new Reader.PinEngineStateForIterators method would be called before releasing latches. This pinning does not apply to iterators with timestamp hints, similar to how ConsistentIterators does not apply to them. So this does not help ExportRequest evaluation for incremental backups. To address this we would want a guarantee from the caller that the timestamp hints will not change for the lifetime of the Reader that is making a promise of ConsistentIterators. Informs cockroachdb#55461,cockroachdb#66485 Release note: None
sumeerbhola
added a commit
to sumeerbhola/cockroach
that referenced
this issue
Jun 25, 2021
This is relevant for read evaluation cases where we want to release latches before evaluation. The new Reader.PinEngineStateForIterators method would be called before releasing latches. This pinning does not apply to iterators with timestamp hints, similar to how ConsistentIterators does not apply to them. So this does not help ExportRequest evaluation for incremental backups. To address this we would want a guarantee from the caller that the timestamp hints will not change for the lifetime of the Reader that is making a promise of ConsistentIterators. Informs cockroachdb#55461,cockroachdb#66485 Release note: None
craig bot
pushed a commit
that referenced
this issue
Jun 28, 2021
66845: storage: add Reader method to pin iterators at current engine state r=sumeerbhola a=sumeerbhola This is relevant for read evaluation cases where we want to release latches before evaluation. The new Reader.PinEngineStateForIterators method would be called before releasing latches. This pinning does not apply to iterators with timestamp hints, similar to how ConsistentIterators does not apply to them. So this does not help ExportRequest evaluation for incremental backups. To address this we would want a guarantee from the caller that the timestamp hints will not change for the lifetime of the Reader that is making a promise of ConsistentIterators. Informs #55461,#66485 Release note: None 66885: sql: add ReType to resolveAndRequireType to fix vector engine panic r=cucaroach a=cucaroach Fixes #66708 The vector engine needs exact type coercion, specifically when piping computed column values into downstream operators. Without this fix the computed column was left as an int64 instead of cast back to the required int16 type. Exposed by sqlsmith, kudos to @michae2 for the reduce help Release note: None Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com> Co-authored-by: Tommy Reilly <treilly@cockroachlabs.com>
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Jun 30, 2021
This was originally introduced to work around limitations in the `storage.Reader` interface, where only a `RocksDB` instance could be passed to `engine.ExportToSst`. Since then, a lot has changed, and this is no longer needed. Removing this is important, as it appears to undermine cockroachdb#55461 and make cockroachdb#66485 difficult.
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Jul 1, 2021
This change removes the Engine method from EvalContext. Removing this is important, as its existence appears to undermine cockroachdb#55461 and make cockroachdb#66485 difficult. The first place where this was used was in EndTxn's evaluation function. I don't see any reason for this. In fact, we had a TODO to fix this, which we could have addressed years ago. The second place where this was used was in RecomputeStats's evaluation function. There, it was used for two reasons. First, it was used because `storage.Batch` used to not provide a consistent view of data. They now do. It was also used to evade spanset assertions, which this commit addresses in a better way.
craig bot
pushed a commit
that referenced
this issue
Jul 1, 2021
66917: kvcoord: assert sanity when tracking in-flight write r=andreimatei a=andreimatei Release note: None 67093: kv: remove spanset.GetDBEngine r=nvanbenschoten a=nvanbenschoten This was originally introduced to work around limitations in the `storage.Reader` interface, where only a `RocksDB` instance could be passed to `engine.ExportToSst`. Since then, a lot has changed, and this is no longer needed. Removing this is important, as it appears to undermine #55461 and make #66485 difficult. Co-authored-by: Andrei Matei <andrei@cockroachlabs.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
craig bot
pushed a commit
that referenced
this issue
Jul 6, 2021
67121: kv: remove EvalContext.Engine r=nvanbenschoten a=nvanbenschoten This change removes the Engine method from EvalContext. Removing this is important, as its existence appears to undermine #55461 and make #66485 difficult. The first place where this was used was in EndTxn's evaluation function. I don't see any reason for this. In fact, we had a TODO to fix this, which we could have addressed years ago. The second place where this was used was in RecomputeStats's evaluation function. There, it was used for two reasons. First, it was used because `storage.Batch` used to not provide a consistent view of data. They now do. It was also used to evade spanset assertions, which this commit addresses in a better way. 67146: c-deps: remove protobuf r=dt a=dt Release note: none. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: David Taylor <tinystatemachine@gmail.com>
aayushshah15
added a commit
to aayushshah15/cockroach
that referenced
this issue
Feb 9, 2022
…tion This commit makes it such that we eagerly pin the engine state of the `Reader` created during the evaluation of read-only requests. Generally, reads will hold latches throughout the course of their evaluation (particularly, while they do their `MVCCScan`). Mainly, this commit paves the way for us to move to a world where we avoid holding latches during the MVCCScan. Additionally it also lets us make MVCC garbage collection latchless as described in cockroachdb#55293. There are a few notable changes in this patch: 1. Pinning the engine state eagerly runs into cockroachdb#70974. To resolve this, the closed timestamp of the `Replica` is now captured at the time the `EvalContext` is created, and not during the command evaluation of `QueryResolvedTimestampRequest`. 2. `EvalContext` now has a `ImmutableEvalContext` embedded into it. The `ImmutableEvalContext` is supposed to encapsulate state that must not change after the `EvalContext` is created. The closed timestamp of the replica is part of the `ImmutableEvalContext`. 3. `Replica` no longer fully implements the `EvalContext` interface. Instead, it implements everything but `GetClosedTimestamp()` (which is implemented by `ImmutableEvalContext` instead). Relates to cockroachdb#55293 Resolves cockroachdb#55461 Resolves cockroachdb#70974 Release note: None
aayushshah15
added a commit
to aayushshah15/cockroach
that referenced
this issue
Feb 14, 2022
…tion This commit makes it such that we eagerly pin the engine state of the `Reader` created during the evaluation of read-only requests. Generally, reads will hold latches throughout the course of their evaluation (particularly, while they do their `MVCCScan`). Mainly, this commit paves the way for us to move to a world where we avoid holding latches during the MVCCScan. Additionally it also lets us make MVCC garbage collection latchless as described in cockroachdb#55293. There are a few notable changes in this patch: 1. Pinning the engine state eagerly runs into cockroachdb#70974. To resolve this, the closed timestamp of the `Replica` is now captured at the time the `EvalContext` is created, and not during the command evaluation of `QueryResolvedTimestampRequest`. 2. `EvalContext` now has a `ImmutableEvalContext` embedded into it. The `ImmutableEvalContext` is supposed to encapsulate state that must not change after the `EvalContext` is created. The closed timestamp of the replica is part of the `ImmutableEvalContext`. 3. `Replica` no longer fully implements the `EvalContext` interface. Instead, it implements everything but `GetClosedTimestamp()` (which is implemented by `ImmutableEvalContext` instead). Relates to cockroachdb#55293 Resolves cockroachdb#55461 Resolves cockroachdb#70974 Release note: None
aayushshah15
added a commit
to aayushshah15/cockroach
that referenced
this issue
Mar 23, 2022
…tion This commit makes it such that we eagerly pin the engine state of the `Reader` created during the evaluation of read-only requests. Today, reads will hold latches throughout the course of their evaluation (particularly, while they do their `MVCCScan`). This commit paves the way for us to move to a world where we avoid holding latches during the MVCCScan. Additionally it also lets us make MVCC garbage collection latchless as described in cockroachdb#55293. There are a few notable changes in this patch: 1. Pinning the engine state eagerly runs into cockroachdb#70974. To resolve this, the closed timestamp of the `Replica` is now captured at the time the `EvalContext` is created, and not during the command evaluation of `QueryResolvedTimestampRequest`. 2. `EvalContext` now has a `ImmutableEvalContext` embedded into it. The `ImmutableEvalContext` is supposed to encapsulate state that must not change after the `EvalContext` is created. The closed timestamp of the replica is part of the `ImmutableEvalContext`. 3. `Replica` no longer fully implements the `EvalContext` interface. Instead, it implements everything but `GetClosedTimestamp()` (which is implemented by `ImmutableEvalContext` instead). Relates to cockroachdb#55293 Resolves cockroachdb#55461 Resolves cockroachdb#70974 Release note: None
aayushshah15
added a commit
to aayushshah15/cockroach
that referenced
this issue
Mar 26, 2022
…tion This commit makes it such that we eagerly pin the engine state of the `Reader` created during the evaluation of read-only requests. Today, reads will hold latches throughout the course of their evaluation (particularly, while they do their `MVCCScan`). This commit paves the way for us to move to a world where we avoid holding latches during the MVCCScan. Additionally it also lets us make MVCC garbage collection latchless as described in cockroachdb#55293. There are a few notable changes in this patch: 1. Pinning the engine state eagerly runs into cockroachdb#70974. To resolve this, the closed timestamp of the `Replica` is now captured at the time the `EvalContext` is created, and not during the command evaluation of `QueryResolvedTimestampRequest`. 2. `EvalContext` now has a `ImmutableEvalContext` embedded into it. The `ImmutableEvalContext` is supposed to encapsulate state that must not change after the `EvalContext` is created. The closed timestamp of the replica is part of the `ImmutableEvalContext`. 3. `Replica` no longer fully implements the `EvalContext` interface. Instead, it implements everything but `GetClosedTimestamp()` (which is implemented by `ImmutableEvalContext` instead). Relates to cockroachdb#55293 Resolves cockroachdb#55461 Resolves cockroachdb#70974 Release note: None
aayushshah15
added a commit
to aayushshah15/cockroach
that referenced
this issue
Mar 29, 2022
…tion This commit makes it such that we eagerly pin the engine state of the `Reader` created during the evaluation of read-only requests. Today, reads will hold latches throughout the course of their evaluation (particularly, while they do their `MVCCScan`). This commit paves the way for us to move to a world where we avoid holding latches during the MVCCScan. Additionally it also lets us make MVCC garbage collection latchless as described in cockroachdb#55293. There are a few notable changes in this patch: 1. Pinning the engine state eagerly runs into cockroachdb#70974. To resolve this, the closed timestamp of the `Replica` is now captured at the time the `EvalContext` is created, and not during the command evaluation of `QueryResolvedTimestampRequest`. 2. `EvalContext` now has a `ImmutableEvalContext` embedded into it. The `ImmutableEvalContext` is supposed to encapsulate state that must not change after the `EvalContext` is created. The closed timestamp of the replica is part of the `ImmutableEvalContext`. 3. `Replica` no longer fully implements the `EvalContext` interface. Instead, it implements everything but `GetClosedTimestamp()` (which is implemented by `ImmutableEvalContext` instead). Relates to cockroachdb#55293 Resolves cockroachdb#55461 Resolves cockroachdb#70974 Release note: None
aayushshah15
added a commit
to aayushshah15/cockroach
that referenced
this issue
Mar 31, 2022
…tion This commit makes it such that we eagerly pin the engine state of the `Reader` created during the evaluation of read-only requests. Today, reads will hold latches throughout the course of their evaluation (particularly, while they do their `MVCCScan`). This commit paves the way for us to move to a world where we avoid holding latches during the MVCCScan. Additionally it also lets us make MVCC garbage collection latchless as described in cockroachdb#55293. There are a few notable changes in this patch: 1. Pinning the engine state eagerly runs into cockroachdb#70974. To resolve this, the closed timestamp of the `Replica` is now captured at the time the `EvalContext` is created, and not during the command evaluation of `QueryResolvedTimestampRequest`. 2. `EvalContext` now has a `ImmutableEvalContext` embedded into it. The `ImmutableEvalContext` is supposed to encapsulate state that must not change after the `EvalContext` is created. The closed timestamp of the replica is part of the `ImmutableEvalContext`. 3. `Replica` no longer fully implements the `EvalContext` interface. Instead, it implements everything but `GetClosedTimestamp()` (which is implemented by `ImmutableEvalContext` instead). Relates to cockroachdb#55293 Resolves cockroachdb#55461 Resolves cockroachdb#70974 Release note: None
craig bot
pushed a commit
that referenced
this issue
Apr 5, 2022
…79409 #79427 #79428 #79433 #79444 76312: kvserver, batcheval: pin Engine state during read-only command evaluation r=aayushshah15 a=aayushshah15 This commit makes it such that we eagerly pin the engine state of the `Reader` created during the evaluation of read-only requests. Generally, reads will hold latches throughout the course of their evaluation (particularly, while they do their `MVCCScan`). Mainly, this commit paves the way for us to move to a world where we avoid holding latches during the MVCCScan. Additionally, it also lets us make MVCC garbage collection latchless as described in #55293. There are a few notable changes in this patch: 1. Pinning the engine state eagerly runs into #70974. To resolve this, the closed timestamp of the `Replica` is now captured at the time the `EvalContext` is created, and not during the command evaluation of `QueryResolvedTimestampRequest`. 2. `EvalContext` now has a `ImmutableEvalContext` embedded into it. The `ImmutableEvalContext` is supposed to encapsulate state that must not change after the `EvalContext` is created. The closed timestamp of the replica is part of the `ImmutableEvalContext`. 3. `Replica` no longer fully implements the `EvalContext` interface. Instead, it implements everything but `GetClosedTimestamp()` (which is implemented by `ImmutableEvalContext` instead). Relates to #55293 Resolves #55461 Resolves #70974 Release note: None 78652: sql: implement to_reg* builtins r=otan a=e-mbrown Resolves #77838 This commit implements the `to_regclass`, `to_regnamespace`, `to_regproc`, `to_regprocedure`, `to_regrole`, and `to_regtype` builtins. Release note (<category, see below>): The `to_regclass`, `to_regnamespace`, `to_regproc`, `to_regprocedure`, `to_regrole`, and `to_regtype` builtin functions are now supported, improving compatibility with PostgreSQL. 79022: server/status: add running non-idle jobs metric r=darinpp a=darinpp Previously serverless was using the sql jobs running metric to determine if a tenant process is idle and can be shut down. With the introduction of continiously running jobs this isn't a good indicator anymore. A recent addition is a per job metrics that show running or idle. The auto scaler doesn't care about the individual jobs and only cares about the total number of jobs that a running but haven't reported as being idle. The pull rate is also very high so the retriving all the individual running/idle metrics for each job type isn't optimal. So this PR adds a single metric that just aggregates and tracks the total count of jobs running and not idle. Release justification: Bug fixes and low-risk updates to new functionality Release note: None Will be re-based once #79021 is merged 79157: cli: tweak slow decommission message r=knz a=cameronnunez Release note: None 79313: opt: do not push LIMIT into the scan of a virtual table r=msirek a=msirek Fixes #78578 Previously, a LIMIT operation could be pushed into the scan of a virtual table with an ORDER BY clause. This was inadequate because in-order scans of virtual indexes aren't supported. When an index that should provide the order requested by a query is used, a sort is actually produced under the covers: ``` EXPLAIN(vec) SELECT oid, typname FROM pg_type ORDER BY OID; info ---------------------------------- │ └ Node 1 └ *colexec.sortOp └ *sql.planNodeToRowSource ``` Functions `CanLimitFilteredScan` and `GenerateLimitedScans` are modified to avoid pushing LIMIT operations into ordered scans of virtual indexes. Release justification: Low risk fix for incorrect results in queries involving virtual system tables. Release note (bug fix): LIMIT queries with an ORDER BY clause which scan the index of a virtual system tables, such as `pg_type`, could previously return incorrect results. This is corrected by teaching the optimizer that LIMIT operations cannot be pushed into ordered scans of virtual indexes. 79346: ccl/sqlproxyccl: add rebalancer queue for connection rebalancing r=JeffSwenson a=jaylim-crl #### ccl/sqlproxyccl: add rebalancer queue for rebalance requests This commit adds a rebalancer queue implementation to the balancer component. The queue will be used for rebalance requests for the connection migration work. This is done to ensure a centralized location that invokes the TransferConnection method on the connection handles. Doing this also enables us to limit the number of concurrent transfers within the proxy. Release note: None #### ccl/sqlproxyccl: run rebalancer queue processor in the background The previous commit added a rebalancer queue. This commit connects the queue to the balancer, and runs the queue processor in the background. By the default, we limit up to 100 concurrent transfers at any point in time, and each transfer will be retried up to 3 times. Release note: None Jira issue: CRDB-14727 79362: kv: remove stale comment in processOneChange r=nvanbenschoten a=nvanbenschoten The comment was added in 2fb56bd and hasn't been accurate since 5178559. Jira issue: CRDB-14753 79368: ccl/sqlproxyccl: include DRAINING pods in the directory cache r=JeffSwenson a=jaylim-crl Previously, #67452 removed DRAINING pods from the directory cache. This commit adds that back. The connector will now need to filter for RUNNING pods manually before invoking the balancer. This is needed so that we could track DRAINING pods, and wait until 60 seconds has elapsed before transferring connections away from them. To support that, we also update the Pod's proto definition to include a StateTimestamp field to reprevent that timestamp that the state field was last updated. The plan is to have a polling mechanism every X seconds to check DRAINING pods, and use that information to start migrating connections. Release note: None Jira issue: CRDB-14759 79386: colexec: remove redundant benchmarks r=yuzefovich a=yuzefovich This commit finishes the transition of some of the benchmarks in the colexec package started in 22.1 cycle. Fixes: #75106. Release note: None Jira issue: CRDB-14783 79409: sql: refactor deps tests to use bazel r=yuzefovich a=yuzefovich This commit refactors most `VerifyNoImports` dependency tests in the sql folder to use the newly introduced bazel test utilities. Release note: None Jira issue: CRDB-14814 79427: backupccl: allow cluster restore from different tenant r=dt a=stevendanna This removes a prohibition for cluster restores with mismatched tenant IDs since we believe they are now correct as of #73831 This allows users to take a cluster backup in a tenant and restore it into another tenant. The new tenant_settings table needs special care since it may exist in the source tenant but not the target tenant when the source tenant is the system tenant. In this change, we throw an error in the case of a non-empty tenant_settings table being restored into a non-system tenant. This is a bit user-unfriendly since we detect this error rather late in the restore process. Release note: None Jira issue: CRDB-14844 79428: backupccl: Refactor encryption utility functions into their own file. r=benbardin a=benbardin Release note: None Jira issue: CRDB-14845 79433: sql: use new ALTER TENANT syntax in tests r=stetvendanna a=rafiss Release note: None 79444: roachtest: warmup follower-reads for fixed duration, not fixed number of ops r=nvanbenschoten a=nvanbenschoten Fixes #78596. This change switches the warmup phase of the follower-read roachtest suite from running a fixed number of operations (100) to running for a fixed duration (15s). This should ensure that the single-region variant of the test is given sufficient time to warm up follower reads immediately after one of its nodes is restarted. Before this change, the single-region variant was only being given about 500ms after startup to catch up on the closed timestamp, which made the test flaky. Release justification: testing only Co-authored-by: Aayush Shah <aayush.shah15@gmail.com> Co-authored-by: e-mbrown <ebsonari@gmail.com> Co-authored-by: Darin Peshev <darinp@gmail.com> Co-authored-by: Cameron Nunez <cameron@cockroachlabs.com> Co-authored-by: Mark Sirek <sirek@cockroachlabs.com> Co-authored-by: Jay <jay@cockroachlabs.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Steven Danna <danna@cockroachlabs.com> Co-authored-by: Ben Bardin <bardin@cockroachlabs.com> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
sumeerbhola
added a commit
to sumeerbhola/cockroach
that referenced
this issue
May 13, 2022
We allow a key (with multiple versions) to have multiple intents, under the condition that at most one of the intents is uncommitted. To aid this behavior we introduce a commitMap, that maintains a logical set of TxnIDs of transactions that were "simple" in their behavior, and have committed, where simple is defined as all the following conditions: - No savepoint rollbacks: len(intent.IgnoredSeqNums)==0 - Single epoch, i.e., TxnMeta.Epoch==0 - Never pushed, i.e., TxnMeta.MinTimestamp==TxnMeta.WriteTimestamp For such transactions, their provisional values can be considered committed with the current version and local timestamp, i.e., we need no additional information about the txn other than the TxnID. Adding to commitMap: The earliest a txn can be added to the commitMap is when the transaction is in STAGING and has verified all the conditions for commit. That is, this can be done before the transition to COMMITTED is replicated. For the node that has the leaseholder of the range containing the txn record, this requires no external communication. For other nodes with intents for the txn, one could piggyback this information on the RPCs for async intent resolution, and add to the the commitMap before doing the intent resolution -- this piggybacking would incur 2 consensus rounds of contention. If we are willing to send the RPC earlier, it will be contention for 1 consensus round only. Note that these RPCs should also remove locks from the non-persistent lock table data-structure so should send information about the keys (like in a LockUpdate but won't remove the durable lock state). Removal from commitMap: The commitMap can be considered a cache of TxnIDs. It is helpful to have a txn in the cache until its intents have been resolved. Additionally, latchless intent resolution must pin a txn in the map before it releases latches and unpin when the intent resolution has been applied on the leaseholder. This pinning behavior is needed to ensure the correctness of the in-memory concurrency.lockTable, which must maintain the property that the replicated locks known to it are a subset of the persistent replicated locks. We are assuming here that there is no lockTable on followers. Why "simple-committed": - We don't want to have to coordinate intent resolution of these multiple intents, by mandating that the resolution happen in any particular order. - We want to guarantee that even if the commitMap is cleared (since it is a cache), we can maintain the invariant that a caller iterating over a key sees at most one intent. As we will illustrate below, providing this guarantee requires us to limit the commitMap to only contain simple-committed txns. Consider a key with timestamps t5, t4, t3, t2, t1 in decreasing order and intents for t5, t4, t2, with corresponding txns txn5, ... txn1. We consider the disposition of an intent to be either unknown or simple-committed. In this example, the disposition of the intent for t4 and t2 is simple-committed solely based on the fact that there is at least one version (provisional or otherwise) more recent that the timestamp of the intent. That is, at most one intent, the one for t5, has a disposition that needs to rely on knowledge that is not self-contained in the history. For t5, we must rely on the commitMap to decide whether is unknown or simple-committed. It is possible that some user of the intentInterleavingIter saw t5 as simple-committed and a later user sees it as unknown disposition, if the txn5 got removed from the commitMap -- such regression is harmless since the latter user will simply have to do intent resolution. Note that the intent for t5 could get resolved before those for t4 and t2, and that is also fine since the disposition of t4, t2 stays simple-committed. If txn5 is aborted and the intent for t5 removed, and txn4 is no longer in the commitMap, the disposition of t4 could change to unknown. This is also acceptable, since t5 was only serving as a "local promise" that t4 was committed, which is simply an optimization. There is still a less efficient globally available "promise" that t4 is committed, and intent resolution of t4 is how we will enact that promise. Maintaining the above guarantees requires that historical versions must not be garbage collected without resolving intents. This is acceptable since GC is not latency sensitive. This PR only introduces the changes for the intentInterleavingIter. It excludes: - A sophisticated commitMap data-structure. There are code comments sketching out what properties are desirable. - The iterForKeyVersions implementation used for intent resolution, now that we can have multiple intents. This will be straightforward. - The changes for latchless intent resolution. These should be straightforward once we have a commitMap with pinning support and we extend cockroachdb#55461 to use a consistent storage snapshot for intent resolution. - The KV layer changes to send lists of intent keys for simple-committed txns to the various ranges, so that they can add to their commitMap and remove from the in-memory lockTable. Informs cockroachdb#66867 Release note: None
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-kv-transactions
Relating to MVCC and the transactional model.
A-storage
Relating to our storage engine (Pebble) on-disk storage.
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-kv
KV Team
Neither
Engine.NewReadOnly
norEngine.NewBatch
actually grabs a stable storage snapshot. Instead, they wait until iterator creation to do so, and then cache the iterator across multiple uses. This leads to two issues.First, it is unclear when the storage snapshot is captured, making it difficult to coordinate with other state changes without more aggressive serialization using latches or the
readOnlyCmdMu
. Ideally, users would be able to grab a read-only/batch, check a few conditions, and then decide whether evaluation should proceed using the storage snapshot. This comes up in the context of replica destruction and in the context of MVCC GC.Second, both of these objects cache multiple types of iterators, including a prefix iterator and a non-prefix iterator. This means that neither reader actually provides a stable storage snapshot at any point. Users have to be ready for inconsistencies to arise between different iterators pulled from the same read-only/batch. This leads to bugs like #47219, where state read from one iterator may not agree with state read from another.
Related Slack thread: https://cockroachlabs.slack.com/archives/CAC6K3SLU/p1602087652111000.
Jira issue: CRDB-3659
The text was updated successfully, but these errors were encountered: