kv: batch evaluation should operate on a consistent storage snapshot #55461

nvanbenschoten · 2020-10-12T20:03:48Z

Neither Engine.NewReadOnly nor Engine.NewBatch actually grabs a stable storage snapshot. Instead, they wait until iterator creation to do so, and then cache the iterator across multiple uses. This leads to two issues.

First, it is unclear when the storage snapshot is captured, making it difficult to coordinate with other state changes without more aggressive serialization using latches or the readOnlyCmdMu. Ideally, users would be able to grab a read-only/batch, check a few conditions, and then decide whether evaluation should proceed using the storage snapshot. This comes up in the context of replica destruction and in the context of MVCC GC.

Second, both of these objects cache multiple types of iterators, including a prefix iterator and a non-prefix iterator. This means that neither reader actually provides a stable storage snapshot at any point. Users have to be ready for inconsistencies to arise between different iterators pulled from the same read-only/batch. This leads to bugs like #47219, where state read from one iterator may not agree with state read from another.

Related Slack thread: https://cockroachlabs.slack.com/archives/CAC6K3SLU/p1602087652111000.

Jira issue: CRDB-3659

The text was updated successfully, but these errors were encountered:

Needed for cockroachdb/cockroach#41720 and cockroachdb/cockroach#55461

For pebbleBatch and pebbleReadOnly, all iterators without timestamp hints see the same underlying engine state, with no interface change for the caller. This is done by cloning the first created pebble.Iterator. intentInterleavingIter explicitly requests a clone, when it creates two iterators, which ensures the consistency guarantee applies to all Reader implementations. Informs cockroachdb#55461 Informs cockroachdb#41720 Release note: None

58515: storage: use consistent iterators when needed, and when possible r=sumeerbhola a=sumeerbhola For pebbleBatch and pebbleReadOnly, all iterators without timestamp hints see the same underlying engine state, with no interface change for the caller. This is done by cloning the first created pebble.Iterator. intentInterleavingIter explicitly requests a clone, when it creates two iterators, which ensures the consistency guarantee applies to all Reader implementations. Informs #55461 Informs #41720 Release note: None 58623: sql: implement SQL syntax for REGIONAL BY ROW ON col r=ajstorm a=otan Implement the syntax and aliases for `REGIONAL BY ROW ON col`, which return an unimplemented error as they are not yet implemented. Release note: None Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com> Co-authored-by: Oliver Tan <otan@cockroachlabs.com>

andreimatei · 2021-06-10T18:31:39Z

Note to self: evaluation now does use a snapshot, but the snapshot is captured lazily. This issue is asking for the snapshot to be captured eagerly so, for example, if we were to drop latches earlier, we'd be guaranteed to have captured a snapshot before that point.

This is relevant for read evaluation cases where we want to release latches before evaluation. The new Reader.PinEngineStateForIterators method would be called before releasing latches. This pinning does not apply to iterators with timestamp hints, similar to how ConsistentIterators does not apply to them. So this does not help ExportRequest evaluation for incremental backups. To address this we would want a guarantee from the caller that the timestamp hints will not change for the lifetime of the Reader that is making a promise of ConsistentIterators. Informs cockroachdb#55461,cockroachdb#66485 Release note: None

@michae2

66845: storage: add Reader method to pin iterators at current engine state r=sumeerbhola a=sumeerbhola This is relevant for read evaluation cases where we want to release latches before evaluation. The new Reader.PinEngineStateForIterators method would be called before releasing latches. This pinning does not apply to iterators with timestamp hints, similar to how ConsistentIterators does not apply to them. So this does not help ExportRequest evaluation for incremental backups. To address this we would want a guarantee from the caller that the timestamp hints will not change for the lifetime of the Reader that is making a promise of ConsistentIterators. Informs #55461,#66485 Release note: None 66885: sql: add ReType to resolveAndRequireType to fix vector engine panic r=cucaroach a=cucaroach Fixes #66708 The vector engine needs exact type coercion, specifically when piping computed column values into downstream operators. Without this fix the computed column was left as an int64 instead of cast back to the required int16 type. Exposed by sqlsmith, kudos to @michae2 for the reduce help Release note: None Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com> Co-authored-by: Tommy Reilly <treilly@cockroachlabs.com>

This was originally introduced to work around limitations in the `storage.Reader` interface, where only a `RocksDB` instance could be passed to `engine.ExportToSst`. Since then, a lot has changed, and this is no longer needed. Removing this is important, as it appears to undermine cockroachdb#55461 and make cockroachdb#66485 difficult.

This change removes the Engine method from EvalContext. Removing this is important, as its existence appears to undermine cockroachdb#55461 and make cockroachdb#66485 difficult. The first place where this was used was in EndTxn's evaluation function. I don't see any reason for this. In fact, we had a TODO to fix this, which we could have addressed years ago. The second place where this was used was in RecomputeStats's evaluation function. There, it was used for two reasons. First, it was used because `storage.Batch` used to not provide a consistent view of data. They now do. It was also used to evade spanset assertions, which this commit addresses in a better way.

66917: kvcoord: assert sanity when tracking in-flight write r=andreimatei a=andreimatei Release note: None 67093: kv: remove spanset.GetDBEngine r=nvanbenschoten a=nvanbenschoten This was originally introduced to work around limitations in the `storage.Reader` interface, where only a `RocksDB` instance could be passed to `engine.ExportToSst`. Since then, a lot has changed, and this is no longer needed. Removing this is important, as it appears to undermine #55461 and make #66485 difficult. Co-authored-by: Andrei Matei <andrei@cockroachlabs.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

67121: kv: remove EvalContext.Engine r=nvanbenschoten a=nvanbenschoten This change removes the Engine method from EvalContext. Removing this is important, as its existence appears to undermine #55461 and make #66485 difficult. The first place where this was used was in EndTxn's evaluation function. I don't see any reason for this. In fact, we had a TODO to fix this, which we could have addressed years ago. The second place where this was used was in RecomputeStats's evaluation function. There, it was used for two reasons. First, it was used because `storage.Batch` used to not provide a consistent view of data. They now do. It was also used to evade spanset assertions, which this commit addresses in a better way. 67146: c-deps: remove protobuf r=dt a=dt Release note: none. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: David Taylor <tinystatemachine@gmail.com>

…tion This commit makes it such that we eagerly pin the engine state of the `Reader` created during the evaluation of read-only requests. Generally, reads will hold latches throughout the course of their evaluation (particularly, while they do their `MVCCScan`). Mainly, this commit paves the way for us to move to a world where we avoid holding latches during the MVCCScan. Additionally it also lets us make MVCC garbage collection latchless as described in cockroachdb#55293. There are a few notable changes in this patch: 1. Pinning the engine state eagerly runs into cockroachdb#70974. To resolve this, the closed timestamp of the `Replica` is now captured at the time the `EvalContext` is created, and not during the command evaluation of `QueryResolvedTimestampRequest`. 2. `EvalContext` now has a `ImmutableEvalContext` embedded into it. The `ImmutableEvalContext` is supposed to encapsulate state that must not change after the `EvalContext` is created. The closed timestamp of the replica is part of the `ImmutableEvalContext`. 3. `Replica` no longer fully implements the `EvalContext` interface. Instead, it implements everything but `GetClosedTimestamp()` (which is implemented by `ImmutableEvalContext` instead). Relates to cockroachdb#55293 Resolves cockroachdb#55461 Resolves cockroachdb#70974 Release note: None

…tion This commit makes it such that we eagerly pin the engine state of the `Reader` created during the evaluation of read-only requests. Today, reads will hold latches throughout the course of their evaluation (particularly, while they do their `MVCCScan`). This commit paves the way for us to move to a world where we avoid holding latches during the MVCCScan. Additionally it also lets us make MVCC garbage collection latchless as described in cockroachdb#55293. There are a few notable changes in this patch: 1. Pinning the engine state eagerly runs into cockroachdb#70974. To resolve this, the closed timestamp of the `Replica` is now captured at the time the `EvalContext` is created, and not during the command evaluation of `QueryResolvedTimestampRequest`. 2. `EvalContext` now has a `ImmutableEvalContext` embedded into it. The `ImmutableEvalContext` is supposed to encapsulate state that must not change after the `EvalContext` is created. The closed timestamp of the replica is part of the `ImmutableEvalContext`. 3. `Replica` no longer fully implements the `EvalContext` interface. Instead, it implements everything but `GetClosedTimestamp()` (which is implemented by `ImmutableEvalContext` instead). Relates to cockroachdb#55293 Resolves cockroachdb#55461 Resolves cockroachdb#70974 Release note: None

…79409 #79427 #79428 #79433 #79444 76312: kvserver, batcheval: pin Engine state during read-only command evaluation r=aayushshah15 a=aayushshah15 This commit makes it such that we eagerly pin the engine state of the `Reader` created during the evaluation of read-only requests. Generally, reads will hold latches throughout the course of their evaluation (particularly, while they do their `MVCCScan`). Mainly, this commit paves the way for us to move to a world where we avoid holding latches during the MVCCScan. Additionally, it also lets us make MVCC garbage collection latchless as described in #55293. There are a few notable changes in this patch: 1. Pinning the engine state eagerly runs into #70974. To resolve this, the closed timestamp of the `Replica` is now captured at the time the `EvalContext` is created, and not during the command evaluation of `QueryResolvedTimestampRequest`. 2. `EvalContext` now has a `ImmutableEvalContext` embedded into it. The `ImmutableEvalContext` is supposed to encapsulate state that must not change after the `EvalContext` is created. The closed timestamp of the replica is part of the `ImmutableEvalContext`. 3. `Replica` no longer fully implements the `EvalContext` interface. Instead, it implements everything but `GetClosedTimestamp()` (which is implemented by `ImmutableEvalContext` instead). Relates to #55293 Resolves #55461 Resolves #70974 Release note: None 78652: sql: implement to_reg* builtins r=otan a=e-mbrown Resolves #77838 This commit implements the `to_regclass`, `to_regnamespace`, `to_regproc`, `to_regprocedure`, `to_regrole`, and `to_regtype` builtins. Release note (<category, see below>): The `to_regclass`, `to_regnamespace`, `to_regproc`, `to_regprocedure`, `to_regrole`, and `to_regtype` builtin functions are now supported, improving compatibility with PostgreSQL. 79022: server/status: add running non-idle jobs metric r=darinpp a=darinpp Previously serverless was using the sql jobs running metric to determine if a tenant process is idle and can be shut down. With the introduction of continiously running jobs this isn't a good indicator anymore. A recent addition is a per job metrics that show running or idle. The auto scaler doesn't care about the individual jobs and only cares about the total number of jobs that a running but haven't reported as being idle. The pull rate is also very high so the retriving all the individual running/idle metrics for each job type isn't optimal. So this PR adds a single metric that just aggregates and tracks the total count of jobs running and not idle. Release justification: Bug fixes and low-risk updates to new functionality Release note: None Will be re-based once #79021 is merged 79157: cli: tweak slow decommission message r=knz a=cameronnunez Release note: None 79313: opt: do not push LIMIT into the scan of a virtual table r=msirek a=msirek Fixes #78578 Previously, a LIMIT operation could be pushed into the scan of a virtual table with an ORDER BY clause. This was inadequate because in-order scans of virtual indexes aren't supported. When an index that should provide the order requested by a query is used, a sort is actually produced under the covers: ``` EXPLAIN(vec) SELECT oid, typname FROM pg_type ORDER BY OID; info ---------------------------------- │ └ Node 1 └ *colexec.sortOp └ *sql.planNodeToRowSource ``` Functions `CanLimitFilteredScan` and `GenerateLimitedScans` are modified to avoid pushing LIMIT operations into ordered scans of virtual indexes. Release justification: Low risk fix for incorrect results in queries involving virtual system tables. Release note (bug fix): LIMIT queries with an ORDER BY clause which scan the index of a virtual system tables, such as `pg_type`, could previously return incorrect results. This is corrected by teaching the optimizer that LIMIT operations cannot be pushed into ordered scans of virtual indexes. 79346: ccl/sqlproxyccl: add rebalancer queue for connection rebalancing r=JeffSwenson a=jaylim-crl #### ccl/sqlproxyccl: add rebalancer queue for rebalance requests This commit adds a rebalancer queue implementation to the balancer component. The queue will be used for rebalance requests for the connection migration work. This is done to ensure a centralized location that invokes the TransferConnection method on the connection handles. Doing this also enables us to limit the number of concurrent transfers within the proxy. Release note: None #### ccl/sqlproxyccl: run rebalancer queue processor in the background The previous commit added a rebalancer queue. This commit connects the queue to the balancer, and runs the queue processor in the background. By the default, we limit up to 100 concurrent transfers at any point in time, and each transfer will be retried up to 3 times. Release note: None Jira issue: CRDB-14727 79362: kv: remove stale comment in processOneChange r=nvanbenschoten a=nvanbenschoten The comment was added in 2fb56bd and hasn't been accurate since 5178559. Jira issue: CRDB-14753 79368: ccl/sqlproxyccl: include DRAINING pods in the directory cache r=JeffSwenson a=jaylim-crl Previously, #67452 removed DRAINING pods from the directory cache. This commit adds that back. The connector will now need to filter for RUNNING pods manually before invoking the balancer. This is needed so that we could track DRAINING pods, and wait until 60 seconds has elapsed before transferring connections away from them. To support that, we also update the Pod's proto definition to include a StateTimestamp field to reprevent that timestamp that the state field was last updated. The plan is to have a polling mechanism every X seconds to check DRAINING pods, and use that information to start migrating connections. Release note: None Jira issue: CRDB-14759 79386: colexec: remove redundant benchmarks r=yuzefovich a=yuzefovich This commit finishes the transition of some of the benchmarks in the colexec package started in 22.1 cycle. Fixes: #75106. Release note: None Jira issue: CRDB-14783 79409: sql: refactor deps tests to use bazel r=yuzefovich a=yuzefovich This commit refactors most `VerifyNoImports` dependency tests in the sql folder to use the newly introduced bazel test utilities. Release note: None Jira issue: CRDB-14814 79427: backupccl: allow cluster restore from different tenant r=dt a=stevendanna This removes a prohibition for cluster restores with mismatched tenant IDs since we believe they are now correct as of #73831 This allows users to take a cluster backup in a tenant and restore it into another tenant. The new tenant_settings table needs special care since it may exist in the source tenant but not the target tenant when the source tenant is the system tenant. In this change, we throw an error in the case of a non-empty tenant_settings table being restored into a non-system tenant. This is a bit user-unfriendly since we detect this error rather late in the restore process. Release note: None Jira issue: CRDB-14844 79428: backupccl: Refactor encryption utility functions into their own file. r=benbardin a=benbardin Release note: None Jira issue: CRDB-14845 79433: sql: use new ALTER TENANT syntax in tests r=stetvendanna a=rafiss Release note: None 79444: roachtest: warmup follower-reads for fixed duration, not fixed number of ops r=nvanbenschoten a=nvanbenschoten Fixes #78596. This change switches the warmup phase of the follower-read roachtest suite from running a fixed number of operations (100) to running for a fixed duration (15s). This should ensure that the single-region variant of the test is given sufficient time to warm up follower reads immediately after one of its nodes is restarted. Before this change, the single-region variant was only being given about 500ms after startup to catch up on the closed timestamp, which made the test flaky. Release justification: testing only Co-authored-by: Aayush Shah <aayush.shah15@gmail.com> Co-authored-by: e-mbrown <ebsonari@gmail.com> Co-authored-by: Darin Peshev <darinp@gmail.com> Co-authored-by: Cameron Nunez <cameron@cockroachlabs.com> Co-authored-by: Mark Sirek <sirek@cockroachlabs.com> Co-authored-by: Jay <jay@cockroachlabs.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Steven Danna <danna@cockroachlabs.com> Co-authored-by: Ben Bardin <bardin@cockroachlabs.com> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>

We allow a key (with multiple versions) to have multiple intents, under the condition that at most one of the intents is uncommitted. To aid this behavior we introduce a commitMap, that maintains a logical set of TxnIDs of transactions that were "simple" in their behavior, and have committed, where simple is defined as all the following conditions: - No savepoint rollbacks: len(intent.IgnoredSeqNums)==0 - Single epoch, i.e., TxnMeta.Epoch==0 - Never pushed, i.e., TxnMeta.MinTimestamp==TxnMeta.WriteTimestamp For such transactions, their provisional values can be considered committed with the current version and local timestamp, i.e., we need no additional information about the txn other than the TxnID. Adding to commitMap: The earliest a txn can be added to the commitMap is when the transaction is in STAGING and has verified all the conditions for commit. That is, this can be done before the transition to COMMITTED is replicated. For the node that has the leaseholder of the range containing the txn record, this requires no external communication. For other nodes with intents for the txn, one could piggyback this information on the RPCs for async intent resolution, and add to the the commitMap before doing the intent resolution -- this piggybacking would incur 2 consensus rounds of contention. If we are willing to send the RPC earlier, it will be contention for 1 consensus round only. Note that these RPCs should also remove locks from the non-persistent lock table data-structure so should send information about the keys (like in a LockUpdate but won't remove the durable lock state). Removal from commitMap: The commitMap can be considered a cache of TxnIDs. It is helpful to have a txn in the cache until its intents have been resolved. Additionally, latchless intent resolution must pin a txn in the map before it releases latches and unpin when the intent resolution has been applied on the leaseholder. This pinning behavior is needed to ensure the correctness of the in-memory concurrency.lockTable, which must maintain the property that the replicated locks known to it are a subset of the persistent replicated locks. We are assuming here that there is no lockTable on followers. Why "simple-committed": - We don't want to have to coordinate intent resolution of these multiple intents, by mandating that the resolution happen in any particular order. - We want to guarantee that even if the commitMap is cleared (since it is a cache), we can maintain the invariant that a caller iterating over a key sees at most one intent. As we will illustrate below, providing this guarantee requires us to limit the commitMap to only contain simple-committed txns. Consider a key with timestamps t5, t4, t3, t2, t1 in decreasing order and intents for t5, t4, t2, with corresponding txns txn5, ... txn1. We consider the disposition of an intent to be either unknown or simple-committed. In this example, the disposition of the intent for t4 and t2 is simple-committed solely based on the fact that there is at least one version (provisional or otherwise) more recent that the timestamp of the intent. That is, at most one intent, the one for t5, has a disposition that needs to rely on knowledge that is not self-contained in the history. For t5, we must rely on the commitMap to decide whether is unknown or simple-committed. It is possible that some user of the intentInterleavingIter saw t5 as simple-committed and a later user sees it as unknown disposition, if the txn5 got removed from the commitMap -- such regression is harmless since the latter user will simply have to do intent resolution. Note that the intent for t5 could get resolved before those for t4 and t2, and that is also fine since the disposition of t4, t2 stays simple-committed. If txn5 is aborted and the intent for t5 removed, and txn4 is no longer in the commitMap, the disposition of t4 could change to unknown. This is also acceptable, since t5 was only serving as a "local promise" that t4 was committed, which is simply an optimization. There is still a less efficient globally available "promise" that t4 is committed, and intent resolution of t4 is how we will enact that promise. Maintaining the above guarantees requires that historical versions must not be garbage collected without resolving intents. This is acceptable since GC is not latency sensitive. This PR only introduces the changes for the intentInterleavingIter. It excludes: - A sophisticated commitMap data-structure. There are code comments sketching out what properties are desirable. - The iterForKeyVersions implementation used for intent resolution, now that we can have multiple intents. This will be straightforward. - The changes for latchless intent resolution. These should be straightforward once we have a commitMap with pinning support and we extend cockroachdb#55461 to use a consistent storage snapshot for intent resolution. - The KV layer changes to send lists of intent keys for simple-committed txns to the various ranges, so that they can add to their commitMap and remove from the in-memory lockTable. Informs cockroachdb#66867 Release note: None

nvanbenschoten added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-storage Relating to our storage engine (Pebble) on-disk storage. A-kv-transactions Relating to MVCC and the transactional model. labels Oct 12, 2020

sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Dec 23, 2020

db: add an Iterator.Clone to iterate over the same db state

e741909

Needed for cockroachdb/cockroach#41720 and cockroachdb/cockroach#55461

sumeerbhola mentioned this issue Dec 23, 2020

db: add an Iterator.Clone to iterate over the same db state cockroachdb/pebble#1033

Merged

sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Dec 23, 2020

db: add an Iterator.Clone to iterate over the same db state

b101242

Needed for cockroachdb/cockroach#41720 and cockroachdb/cockroach#55461

sumeerbhola added a commit to cockroachdb/pebble that referenced this issue Dec 23, 2020

db: add an Iterator.Clone to iterate over the same db state

108cd02

Needed for cockroachdb/cockroach#41720 and cockroachdb/cockroach#55461

nvanbenschoten mentioned this issue Jan 5, 2021

kvserver: purging outdated replicas during migrations run into nil pointers #58378

Closed

sumeerbhola mentioned this issue Jan 6, 2021

storage: use consistent iterators when needed, and when possible #58515

Merged

nvanbenschoten mentioned this issue Jun 15, 2021

kv: release latches before evaluation for read-only requests #66485

Closed

sumeerbhola mentioned this issue Jun 24, 2021

storage: add Reader method to pin iterators at current engine state #66845

Merged

andreimatei mentioned this issue Jun 29, 2021

kvserver: no sync between splits and follower reads can cause invalid reads #67016

Closed

nvanbenschoten mentioned this issue Jun 30, 2021

kv: remove spanset.GetDBEngine #67093

Merged

nvanbenschoten mentioned this issue Jul 1, 2021

kv: remove EvalContext.Engine #67121

Merged

nvanbenschoten mentioned this issue Oct 1, 2021

kv: QueryResolvedTimestamp iterator snapshot may be captured before closed timestamp #70974

Closed

erikgrinaker mentioned this issue Oct 29, 2021

kvserver: add MVCC-compliant AddSSTable variant #72085

Merged

blathers-crl bot added the T-kv KV Team label Dec 1, 2021

aayushshah15 self-assigned this Jan 18, 2022

aayushshah15 mentioned this issue Feb 9, 2022

kvserver, batcheval: pin Engine state during read-only command evaluation #76312

Merged

nvanbenschoten mentioned this issue Feb 16, 2022

kv: disallow GC requests that bump GC threshold and GC expired versions #76410

Merged

craig bot closed this as completed in 0a54b3d Apr 5, 2022

sumeerbhola mentioned this issue May 13, 2022

[WIP] storage: intentInterleavingIter support for multiple intents #81259

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: batch evaluation should operate on a consistent storage snapshot #55461

kv: batch evaluation should operate on a consistent storage snapshot #55461

nvanbenschoten commented Oct 12, 2020 •

edited by cockroach-jira-scripts

Loading

andreimatei commented Jun 10, 2021

kv: batch evaluation should operate on a consistent storage snapshot #55461

kv: batch evaluation should operate on a consistent storage snapshot #55461

Comments

nvanbenschoten commented Oct 12, 2020 • edited by cockroach-jira-scripts Loading

andreimatei commented Jun 10, 2021

nvanbenschoten commented Oct 12, 2020 •

edited by cockroach-jira-scripts

Loading