kv: release latches before evaluation for read-only requests #66485

nvanbenschoten · 2021-06-15T16:34:42Z

Related to, but separate from #25579.
Related to #41720 (comment).
Original discussion in https://forum.cockroachlabs.com/t/why-do-we-keep-read-commands-in-the-command-queue/360.

Currently, read-only requests hold latches across evaluation. These are read-only latches, and so they do not conflict with writes at higher timestamps than the read. However, they do still block writes at lower timestamps than the read (at least until we address #25579) and do still block non-MVCC writes like range splits. This is a problem, as the more we allow requests to block on one another, the more we open ourselves up to unexpected tail latency under rare sequences of events like we saw in #66338, where an Export (MVCC read), blocked a split (non-MVCC write), blocked a large set of writes (MVCC write).

We should re-evaluate this decision and determine whether we can drop latches earlier for reads, before they spend any time reading data. Doing so would rely on #55461. The high-level idea is that we could grab an LSM snapshot and bump the timestamp cache, and then drop the read's latches (and readOnlyCmdMu) before proceeding to evaluation.

For context, there are two reasons why we didn't do this in the past, and both have to do with wanting to wait to bump the timestamp cache until after evaluation:

we don't want to bump the timestamp cache immediately in case the read runs into intents during evaluation and is forced to retry after pushing/waiting. Bumping the timestamp cache early could cause some form of starvation to writers who lay down and intent and then want to re-write it in a subsequent retry. If readers always bumped the timestamp cache before finding the conflicting intent then they could prevent the writer from every completing.
we don't want to bump the timestamp cache for limited scans if we don't know the full extent of what they will (or won't) read. Doing so could cause false-contention.

These concerns are still both valid to some degree, but things have also changed over the past few years.

Regarding the starvation of writers, this is no longer as severe of a problem as it once was. This is because transactions can now refresh their reads if their write timestamp is bumped. While contention can still prevent a read refresh from succeeding and force a transaction to perform a full restart, this is no longer a given like it once was. Additionally, with the lockTable, we will soon have the ability to check for conflicting locks over a span of keys without performing a full scan of the data. This opens up the opportunity to scan the lockTable upfront, while holding latches, and then only proceed to bump the timestamp cache if no conflicts are found.

The second reason still seems legitimate. We don't want to bump the timestamp cache during a scan if we don't know the full width of the scan ahead of time. Doing so would often lead to limited scans (SELECT * FROM t LIMIT 1) bumping the timestamp cache across the entire range, which would trip up otherwise uncontended writes. We have recently landed changes to optimistically latch and lock for limited scans that don't know until evaluation how far they will need to scan. This feels related to me, though I don't have a clear picture about how to take advantage of optimistic evaluation to avoid holding latches across evaluation.

However, one thing that has changed regarding limited scans is that we now have a concept of a closed timestamp, which prevents writes below some fixed interval that lags present time. If a limited scan is run below this timestamp, as ExportRequests often do, there is no benefit for bumping the timestamp cache after evaluation.

So with all of these changes, it seems appropriate that we reconsider the synchronization of read-only requests.

/cc. @andreimatei

Jira issue: CRDB-8063

The text was updated successfully, but these errors were encountered:

sumeerbhola · 2021-06-24T14:23:07Z

Additionally, with the lockTable, we will soon have the ability to check for conflicting locks over a span of keys without performing a full scan of the data. This opens up the opportunity to scan the lockTable upfront, while holding latches, and then only proceed to bump the timestamp cache if no conflicts are found.

(copying the related comment from https://github.com/cockroachlabs/support/issues/1013#issuecomment-858898819)
The current thinking is not to do this in the general case. There is a cost to scanning the lock table key space in the engine twice (the evaluation needs intent interleaving too). So we will only scan the in-memory lockTable data-structure which is an incomplete picture of the intents.

However, for potentially long running reads that don't want any intents and are not transactional or care about uncertainty (and so don't need intent interleaving during evaluation), we can have a different path that does scan the lock table up front since it will only happen once. I believe this applies to ExportRequest.

This is relevant for read evaluation cases where we want to release latches before evaluation. The new Reader.PinEngineStateForIterators method would be called before releasing latches. This pinning does not apply to iterators with timestamp hints, similar to how ConsistentIterators does not apply to them. So this does not help ExportRequest evaluation for incremental backups. To address this we would want a guarantee from the caller that the timestamp hints will not change for the lifetime of the Reader that is making a promise of ConsistentIterators. Informs cockroachdb#55461,cockroachdb#66485 Release note: None

andreimatei · 2021-06-24T15:47:18Z

The current thinking is not to do this in the general case. There is a cost to scanning the lock table key space in the engine twice (the evaluation needs intent interleaving too). So we will only scan the in-memory lockTable data-structure which is an incomplete picture of the intents.

I'd like to understand this. You're saying that a read that scans the lock table (scan 1), finds 5 locks which it resolves, and then proceeds to scan the regular key space (scan 2) will need to also scan the locks in parallel with the regular key scan? Because the resolving of the locks will not actually wait for the replication of the cleanup of the respective intents before letting the read do the regular key scan?
The way I imagined we'd handle this is by the reader keeping track of the locks that it resolved as part of scan 1, and interpolating them into scan 2 without actually needing to scan the locks again. No?

sumeerbhola · 2021-06-24T17:00:34Z

You're saying that a read that scans the lock table (scan 1), finds 5 locks which it resolves, and then proceeds to scan the regular key space (scan 2) will need to also scan the locks in parallel with the regular key scan?

Correct. This is mainly to not overcomplicate the code since the evaluation needs to see one own's intents and do uncertainty checking, that the lock checking does not need to do.

The way I imagined we'd handle this is by the reader keeping track of the locks that it resolved as part of scan 1, and interpolating them into scan 2 without actually needing to scan the locks again. No?

One way to make this work efficiently is to do a scan and keep in-memory state for all the intents that can be relevant, for lock conflicts, read-your-own-writes, uncertainty checking. We would discard the in-memory state if it became too large, so this would be just the fast path. Then use a subset for the lockTable, and all of it for the request evaluation.

sumeerbhola · 2021-06-24T18:53:03Z

One way to make this work efficiently is to do a scan and keep in-memory state for all the intents that can be relevant, for lock conflicts, read-your-own-writes, uncertainty checking. We would discard the in-memory state if it became too large, so this would be just the fast path. Then use a subset for the lockTable, and all of it for the request evaluation.

My concern was that we don't want frequent failures on this fast path. I think we can use MVCCStats.IntentCount as a guide. And we're already writing specialized iterators for intent interleaving (e.g. the recent PR for faster ranged intent resolution) so this fits in nicely with that direction.

This is relevant for read evaluation cases where we want to release latches before evaluation. The new Reader.PinEngineStateForIterators method would be called before releasing latches. This pinning does not apply to iterators with timestamp hints, similar to how ConsistentIterators does not apply to them. So this does not help ExportRequest evaluation for incremental backups. To address this we would want a guarantee from the caller that the timestamp hints will not change for the lifetime of the Reader that is making a promise of ConsistentIterators. Informs cockroachdb#55461,cockroachdb#66485 Release note: None

@michae2

66845: storage: add Reader method to pin iterators at current engine state r=sumeerbhola a=sumeerbhola This is relevant for read evaluation cases where we want to release latches before evaluation. The new Reader.PinEngineStateForIterators method would be called before releasing latches. This pinning does not apply to iterators with timestamp hints, similar to how ConsistentIterators does not apply to them. So this does not help ExportRequest evaluation for incremental backups. To address this we would want a guarantee from the caller that the timestamp hints will not change for the lifetime of the Reader that is making a promise of ConsistentIterators. Informs #55461,#66485 Release note: None 66885: sql: add ReType to resolveAndRequireType to fix vector engine panic r=cucaroach a=cucaroach Fixes #66708 The vector engine needs exact type coercion, specifically when piping computed column values into downstream operators. Without this fix the computed column was left as an int64 instead of cast back to the required int16 type. Exposed by sqlsmith, kudos to @michae2 for the reduce help Release note: None Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com> Co-authored-by: Tommy Reilly <treilly@cockroachlabs.com>

This was originally introduced to work around limitations in the `storage.Reader` interface, where only a `RocksDB` instance could be passed to `engine.ExportToSst`. Since then, a lot has changed, and this is no longer needed. Removing this is important, as it appears to undermine cockroachdb#55461 and make cockroachdb#66485 difficult.

This change removes the Engine method from EvalContext. Removing this is important, as its existence appears to undermine cockroachdb#55461 and make cockroachdb#66485 difficult. The first place where this was used was in EndTxn's evaluation function. I don't see any reason for this. In fact, we had a TODO to fix this, which we could have addressed years ago. The second place where this was used was in RecomputeStats's evaluation function. There, it was used for two reasons. First, it was used because `storage.Batch` used to not provide a consistent view of data. They now do. It was also used to evade spanset assertions, which this commit addresses in a better way.

66917: kvcoord: assert sanity when tracking in-flight write r=andreimatei a=andreimatei Release note: None 67093: kv: remove spanset.GetDBEngine r=nvanbenschoten a=nvanbenschoten This was originally introduced to work around limitations in the `storage.Reader` interface, where only a `RocksDB` instance could be passed to `engine.ExportToSst`. Since then, a lot has changed, and this is no longer needed. Removing this is important, as it appears to undermine #55461 and make #66485 difficult. Co-authored-by: Andrei Matei <andrei@cockroachlabs.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

67121: kv: remove EvalContext.Engine r=nvanbenschoten a=nvanbenschoten This change removes the Engine method from EvalContext. Removing this is important, as its existence appears to undermine #55461 and make #66485 difficult. The first place where this was used was in EndTxn's evaluation function. I don't see any reason for this. In fact, we had a TODO to fix this, which we could have addressed years ago. The second place where this was used was in RecomputeStats's evaluation function. There, it was used for two reasons. First, it was used because `storage.Batch` used to not provide a consistent view of data. They now do. It was also used to evade spanset assertions, which this commit addresses in a better way. 67146: c-deps: remove protobuf r=dt a=dt Release note: none. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: David Taylor <tinystatemachine@gmail.com>

…tents Related to cockroachdb#80706. Related to cockroachdb#66485. This commit makes a slight modification to `pebbleMVCCScanner` that changes how it determines whether an intent is uncertain or not. Instead of consulting the version timestamp in the `MVCCMetadata` struct and comparing that against the scan's uncertainty interval, the `pebbleMVCCScanner` now scans through the uncertainty interval and uses the intent's provisional value's timestamp to determine uncertainty. The `pebbleMVCCScanner` was actually already doing this scan to compute uncertainty for other committed values in its uncertainty interval if it found that the intent was not uncertain. However, after this change, it also relies on the scan to compute uncertainty for the intent itself. This is safe, because the calling code knows that the intent has a higher timestamp than the scan, so there is no risk that the scan adds the provisional value to its result set. This change is important for two reasons: 1. it avoids the need to store the `local_timestamp` field (introduced in cockroachdb#80706) in the `MVCCMetadata` struct. 2. it moves us closer in the direction of using `MVCCMetadata` values (ts=0, essentially locks protecting provisional values) to determine read-write conflicts but then using versioned provisional values to determine uncertainty. Doing so allows us to decompose a KV scan into a separate lock table scan to detect read-write conflicts and a MVCC scan to accumulate a result set while checking for uncertainty. This will be important for cockroachdb#66485.

81125: storage: use provisional version value to determine uncertainty of intents r=sumeerbhola a=nvanbenschoten Related to #80706. Related to #66485. This commit makes a slight modification to `pebbleMVCCScanner` that changes how it determines whether an intent is uncertain or not. Instead of consulting the version timestamp in the `MVCCMetadata` struct and comparing that against the scan's uncertainty interval, the `pebbleMVCCScanner` now scans through the uncertainty interval and uses the intent's provisional value's timestamp to determine uncertainty. The `pebbleMVCCScanner` was actually already doing this scan to compute uncertainty for other committed values in its uncertainty interval if it found that the intent was not uncertain. However, after this change, it also relies on the scan to compute uncertainty for the intent itself. This is safe, because the calling code knows that the intent has a higher timestamp than the scan, so there is no risk that the scan adds the provisional value to its result set. This change is important for two reasons: 1. it avoids the need to store the `local_timestamp` field (introduced in #80706) in the `MVCCMetadata` struct. 2. it moves us closer in the direction of using `MVCCMetadata` values (ts=0, essentially locks protecting provisional values) to determine read-write conflicts but then using versioned provisional values to determine uncertainty. Doing so allows us to decompose a KV scan into a separate lock table scan to detect read-write conflicts and a MVCC scan to accumulate a result set while checking for uncertainty. This will be important for #66485. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

…luation This commit introduces a change to the way certain types of read-only requests are evaluated. Traditionally, read-only requests have held their latches throughout their execution. This commit allows certain qualifying reads to be able to release their latches earlier. At a high level, reads may attempt to resolve all conflicts upfront by performing a sort of "validation" phase before they perform their MVCC scan. This validation phase performs a scan of the lock table keyspace in order to find any conflicting intents that may need to be resolved before the actual evaluation of the request over the MVCC keyspace. If no conflicting intents are found, then (since cockroachdb#76312), the request is guaranteed to be fully isolated against all other concurrent requests and can be allowed to release its latches at this point. This allows the actual evaluation of the read (over the MVCC part of the keyspace) to proceed without latches being held, which is the main motivation of this work. This validation phase could be thought of as an extension to the validation that the concurrency manager already performs when requests are sequenced through it, by trying to detect any conflicting intents that have already been pulled into the in-memory lock table. Additionally, for certain types of requests that can drop their latches early, and do not need to access the `IntentHistory` for any of their parent txn's intents, this commit attempts to make their MVCC scan cheaper by eliminating the need for an `intentInterleavingIterator`. This is enabled by the observation that once the validation phase is complete, the only remaining intents in the read's declared span must be intents belonging to the reader's transaction. So if the reader doesn't need to read an intent that isn't the latest intent on a key, then it doesn't need access to the key's `IntentHistory` (which lives in the lock-table keyspace), and doesn't need to use an `intentInterleavingIterator`. Release note (performance improvement): certain types of reads will now have a far smaller contention footprint with conflicting concurrent writers Resolves cockroachdb#66485 Release justification: high benefit change to existing functionality, part of 22.2 roadmap

85993: kvserver: allow certain read-only requests to drop latches before evaluation r=aayushshah15 a=aayushshah15 This commit introduces a change to the way certain types of read-only requests are evaluated. Traditionally, read-only requests have held their latches throughout their execution. This commit allows certain qualifying reads to be able to release their latches earlier. At a high level, reads may attempt to resolve all conflicts upfront by performing a sort of "validation" phase before they perform their MVCC scan. This validation phase performs a scan of the lock table keyspace in order to find any conflicting intents that may need to be resolved before the actual evaluation of the request over the MVCC keyspace. If no conflicting intents are found, then (since #76312), the request is guaranteed to be fully isolated against all other concurrent requests and can be allowed to release its latches at this point. This allows the actual evaluation of the read (over the MVCC part of the keyspace) to proceed without latches being held, which is the main motivation of this work. This validation phase could be thought of as an extension to the validation that the concurrency manager already performs when requests are sequenced through it, by trying to detect any conflicting intents that have already been pulled into the in-memory lock table. Additionally, for certain types of requests that can drop their latches early, and do not need to access the `IntentHistory` for any of their parent txn's intents, this commit attempts to make their MVCC scan cheaper by eliminating the need for an `intentInterleavingIterator`. This is enabled by the observation that once the validation phase is complete, the only remaining intents in the read's declared span must be intents belonging to the reader's transaction. So if the reader doesn't need to read an intent that isn't the latest intent on a key, then it doesn't need access to the key's `IntentHistory` (which lives in the lock-table keyspace), and doesn't need to use an `intentInterleavingIterator`. Release note (performance improvement): certain types of reads will now have a far smaller contention footprint with conflicting concurrent writers Resolves #66485 Release justification: high benefit change to existing functionality, part of 22.2 roadmap Co-authored-by: Aayush Shah <aayush.shah15@gmail.com>

nvanbenschoten added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-kv-transactions Relating to MVCC and the transactional model. labels Jun 15, 2021

nvanbenschoten mentioned this issue Jun 15, 2021

bulk/kv: uploading to cloud storage in ExportRequest evaluation considered harmful #66486

Closed

jlinder added the T-kv KV Team label Jun 16, 2021

sumeerbhola mentioned this issue Jun 24, 2021

storage: add Reader method to pin iterators at current engine state #66845

Merged

sumeerbhola mentioned this issue Jun 24, 2021

kv: release latches before evaluation of intent resolution #66867

Open

nvanbenschoten mentioned this issue Jun 30, 2021

kv: remove spanset.GetDBEngine #67093

Merged

nvanbenschoten mentioned this issue Jul 1, 2021

kv: remove EvalContext.Engine #67121

Merged

erikgrinaker mentioned this issue Oct 29, 2021

kvserver: add MVCC-compliant AddSSTable variant #72085

Merged

nvanbenschoten mentioned this issue May 7, 2022

storage: use provisional version value to determine uncertainty of intents #81125

Merged

craig bot closed this as completed in 7ff05f1 Nov 17, 2022

cockroach-teamcity mentioned this issue Nov 17, 2022

PR #85993 - kvserver: allow certain read-only requests to drop latches before evaluation cockroachdb/docs#15642

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: release latches before evaluation for read-only requests #66485

kv: release latches before evaluation for read-only requests #66485

nvanbenschoten commented Jun 15, 2021 •

edited by cockroach-jira-scripts

Loading

sumeerbhola commented Jun 24, 2021

andreimatei commented Jun 24, 2021

sumeerbhola commented Jun 24, 2021

sumeerbhola commented Jun 24, 2021

kv: release latches before evaluation for read-only requests #66485

kv: release latches before evaluation for read-only requests #66485

Comments

nvanbenschoten commented Jun 15, 2021 • edited by cockroach-jira-scripts Loading

sumeerbhola commented Jun 24, 2021

andreimatei commented Jun 24, 2021

sumeerbhola commented Jun 24, 2021

sumeerbhola commented Jun 24, 2021

nvanbenschoten commented Jun 15, 2021 •

edited by cockroach-jira-scripts

Loading