-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage/engine: expose FailOnMoreRecent MVCC{Get/Scan}Option #44473
storage/engine: expose FailOnMoreRecent MVCC{Get/Scan}Option #44473
Conversation
This commit commits the changes from `make c-deps-fmt`. This hadn't been done in a while. I don't want the diff leaking into the next commit. I was surprised to see that we don't lint against this.
10b529f
to
964640e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
save for a discussion about an unclear interaction with inconsistent
.
Folding mvccGetInternal
can be possible now that the entire scanner is back in Go, but mvccGetInternal
is still more inlined and likely more performant as a result.
Reviewed 20 of 20 files at r1, 9 of 9 files at r2.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)
c-deps/libroach/mvcc.h, line 287 at r2 (raw file):
} if (fail_on_more_recent_) {
Is it worth adding a && !inconsistent
here? Having fail_on_more_recent == true
and inconsistent == true
makes little sense, so we might as well have fail_on_more_recent_
not do anything in inconsistent mode (and the comment on MVCCGet
/MVCCScan
would also need to say which bool takes priority).
pkg/storage/engine/rocksdb.go, line 2763 at r2 (raw file):
func writeTooOldToError(readTS hlc.Timestamp, existingCTS C.DBTimestamp) error { if existingCTS.wall_time != 0 || existingCTS.logical != 0 {
Nit: Might be cleaner to do the conversion first, then existingTS != hlc.Timestamp{}
pkg/storage/engine/testdata/mvcc_histories/read_fail_on_more_recent, line 27 at r2 (raw file):
# for op in (get, scan): # for ts in (9, 10, 11): # for failOnMoreRecent in (false, true):
Might be a good idea to also toggle inconsistent
in the k=k1, ts = 9, failOnMoreRecent = true
cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/storage/engine/mvcc.go, line 2462 at r2 (raw file):
// When scanning in "fail on more recent" mode, a WriteTooOldError will be // returned if the scan observes a version with a timestamp above the read // timestamp. Similarly, a WriteIntentError will be returned if the scan
just making sure I understand: the reason we distinguish these two errors is that for the latter the transaction needs to wait behind the transaction with the intent (and possibly push it), so in the new code this will result in an addDiscoveredLock()
call on lockTable
. Is that correct?
pkg/storage/engine/pebble_mvcc_scanner.go, line 370 at r2 (raw file):
if !ownIntent { // 8. The key contains an intent which was not written by our // transaction and our read timestamp is newer than that of the
is this comment stale in light of the changes above, in that the read timestamp is not necessarily newer than the intent?
Relates to cockroachdb#40205. This change introduces a new option to `MVCCGetOptions` and `MVCCScanOptions` that causes reads to throw an error if they observe an MVCC version at a timestamp above their read timestamp. Specifically, when the option is enabled, a `WriteTooOldError` will be returned if a scan observes an MVCC version with a timestamp above the read. Similarly, a `WriteIntentError` will be returned if a scan observes another transaction's intent, even if that intent has a timestamp above the scan's read timestamp. This option will be used in the future by `ScanRequests` and `ReverseScanRequests` that intend to acquire unreplicated key-level locks, which will power `SELECT FOR UPDATE`. These scans do not want to ignore existing MVCC versions at higher timestamps like traditional scans do. Instead, they want to throw errors and force their transaction to increase its timestamp through either a refresh or a retry. This was previously prototyped in 4a8e8dc. Interestingly, this is not new logic to the MVCC layer. This behavior is exactly the same as that of the initial key-value lookup performed during MVCC writes (see `mvccPutInternal`). It's fitting that behavior needed for `SELECT FOR UPDATE` would mirror that already exhibited by the read portion of read-write operations. This also hints at an opportunity to potentially use this option to merge the two implementations and get rid of custom logic like `mvccGetInternal` that only exists on the write path. We'd need to be careful about doing so though, as this code is heavily tuned. Release note: None
964640e
to
a25a9c9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs!
Folding mvccGetInternal can be possible now that the entire scanner is back in Go, but mvccGetInternal is still more inlined and likely more performant as a result.
Yes, I don't think we'll be able to get it to work in practice without having an impact on performance. Instead, I'd like to see us move the other way and explore blind writes into the MVCC keyspace through a combination of the segregated lock table, a reintroduction of the write timestamp cache, and inexact MVCC stats. This will become more pressing as we look towards a "cloud-native" storage hierarchy.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @itsbilal and @sumeerbhola)
c-deps/libroach/mvcc.h, line 287 at r2 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Is it worth adding a
&& !inconsistent
here? Havingfail_on_more_recent == true
andinconsistent == true
makes little sense, so we might as well havefail_on_more_recent_
not do anything in inconsistent mode (and the comment onMVCCGet
/MVCCScan
would also need to say which bool takes priority).
Very good point. This combination wasn't working before and doesn't really make sense. It's unclear whether it should fail on intents at higher timestamps or return them without failing. I don't want to ignore the combination because that seems error-prone, but I think it's fine to reject it. Done.
pkg/storage/engine/mvcc.go, line 2462 at r2 (raw file):
Previously, sumeerbhola wrote…
just making sure I understand: the reason we distinguish these two errors is that for the latter the transaction needs to wait behind the transaction with the intent (and possibly push it), so in the new code this will result in an
addDiscoveredLock()
call onlockTable
. Is that correct?
Yes, that's exactly correct. A WriteTooOldError just means that the transaction found a write above the time that it was trying to write, so it needs to bump its timestamp. A WriteIntentError means that the transaction found another transaction's intent, so it needs to hit the new addDiscoveredLock
code and hop into the lockTable
.
pkg/storage/engine/pebble_mvcc_scanner.go, line 370 at r2 (raw file):
Previously, sumeerbhola wrote…
is this comment stale in light of the changes above, in that the read timestamp is not necessarily newer than the intent?
Yes, it was stale for a few reasons. Fixed.
pkg/storage/engine/rocksdb.go, line 2763 at r2 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Nit: Might be cleaner to do the conversion first, then
existingTS != hlc.Timestamp{}
Done.
pkg/storage/engine/testdata/mvcc_histories/read_fail_on_more_recent, line 27 at r2 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Might be a good idea to also toggle
inconsistent
in thek=k1, ts = 9, failOnMoreRecent = true
cases.
Done.
cmdScan was already doing this, so we should be consistent or it becomes confusing why scan results show up on errors cases but not get results.
bors r+ |
44473: storage/engine: expose FailOnMoreRecent MVCC{Get/Scan}Option r=nvanbenschoten a=nvanbenschoten Relates to #40205. This change introduces a new option to `MVCCGetOptions` and `MVCCScanOptions` that causes reads to throw an error if they observe an MVCC version at a timestamp above their read timestamp. Specifically, when the option is enabled, a `WriteTooOldError` will be returned if a scan observes an MVCC version with a timestamp above the read. Similarly, a `WriteIntentError` will be returned if a scan observes another transaction's intent, even if that intent has a timestamp above the scan's read timestamp. This option will be used in the future by `ScanRequests` and `ReverseScanRequests` that intend to acquire unreplicated key-level locks, which will power `SELECT FOR UPDATE`. These scans do not want to ignore existing MVCC versions at higher timestamps like traditional scans do. Instead, they want to throw errors and force their transaction to increase its timestamp through either a refresh or a retry. This was previously prototyped in 4a8e8dc. Interestingly, this is not new logic to the MVCC layer. This behavior is exactly the same as that of the initial key-value lookup performed during MVCC writes (see `mvccPutInternal`). It's fitting that behavior needed for `SELECT FOR UPDATE` would mirror that already exhibited by the read portion of read-write operations. This also hints at an opportunity to potentially use this option to merge the two implementations and get rid of custom logic like `mvccGetInternal` that only exists on the write path. We'd need to be careful about doing so though, as this code is heavily tuned. Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Build succeeded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I don't think we'll be able to get it to work in practice without having an impact on performance. Instead, I'd like to see us move the other way and explore blind writes into the MVCC keyspace through a combination of the segregated lock table, a reintroduction of the write timestamp cache, and inexact MVCC stats. This will become more pressing as we look towards a "cloud-native" storage hierarchy.
If we are going to do blind writes we will need to support a mode where the exclusive lock is acquired only when the preceding read/upgrade/exclusive lock is still held, otherwise we could miss an intervening write since the read, yes?
And why is this useful for "cloud-native" storage, by which you presumably mean higher latency since the files being read may not be local. My understanding is that we currently do read-modify-writes for puts and cputs in order to check the condition and to modify only certain columns in a row, and hold latches while doing so -- is the blind write idea motivated by reducing the duration of latches (though that does not quite matter since non-locking reads at lower timestamps can be concurrent)?
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @itsbilal and @sumeerbhola)
This isn't how I was picturing things working, but that does seem like an option. A more straightforward approach would be to reintroduce the write timestamp cache and insert all writes into it. Remember that the timestamp cache is a concurrent in-memory data structure with a strict memory limit and a guarantee that if a key span is inserted with a timestamp, all overlapping queries into the structure will return a timestamp equal to or greater than the insert timestamp. In other words, it efficiently provides an upper bound on all timestamps inserted in a given key range.
Yes, that is exactly what I was meaning.
The condition check is true of cputs, but not of puts. Puts perform a read-modify-write strictly to check for conflicting intents and to ensure monotonicity across writes to a given key. The lockTable can now serve this first purpose (and we expect its LSM blocks to always be cached locally) and the write timestamp cache can serve the second purpose (which is necessary for single-key linearizability). So Puts could become blind writes with respect to the LSM, which could speed up workloads like bulk
It's not particularly motivated by contention, no. It's primarily motivated by write-heavy workloads like the classic timeseries use-case. |
Relates to #40205.
This change introduces a new option to
MVCCGetOptions
andMVCCScanOptions
that causes reads to throw an error if they observe an MVCC version at a timestamp above their read timestamp. Specifically, when the option is enabled, aWriteTooOldError
will be returned if a scan observes an MVCC version with a timestamp above the read. Similarly, aWriteIntentError
will be returned if a scan observes another transaction's intent, even if that intent has a timestamp above the scan's read timestamp.This option will be used in the future by
ScanRequests
andReverseScanRequests
that intend to acquire unreplicated key-level locks, which will powerSELECT FOR UPDATE
. These scans do not want to ignore existing MVCC versions at higher timestamps like traditional scans do. Instead, they want to throw errors and force their transaction to increase its timestamp through either a refresh or a retry. This was previously prototyped in 4a8e8dc.Interestingly, this is not new logic to the MVCC layer. This behavior is exactly the same as that of the initial key-value lookup performed during MVCC writes (see
mvccPutInternal
). It's fitting that behavior needed forSELECT FOR UPDATE
would mirror that already exhibited by the read portion of read-write operations. This also hints at an opportunity to potentially use this option to merge the two implementations and get rid of custom logic likemvccGetInternal
that only exists on the write path. We'd need to be careful about doing so though, as this code is heavily tuned.Release note: None