-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: separated lock-table keyspace #41720
Comments
This sentence appears to be incomplete. Many of the time-bound iterator complications stem from deletion of the metadata key not carrying any timestamp information. How would this proposal address that?
We could test this hypothesis by changing the existing MVCC code to always do a separate seek for the metadata and version key (I'm thinking of
The interleaving of the metadata records with the versioned records is embedded in a lot of MVCC code. It would be worthwhile to enumerate the places that need to get adjusted, such as |
This proposal would get rid of (non-synthetic) metadata keys outside of their use for inline kvs (which have no timestamp). Transaction isolation would be completely separate from the MVCC keyspace. A user of time-bound iterators (along with all other requests) would first check for locks at their desired timestamp. This would all be done based on the declared spans of the request. If none exists, they could scan with a time-bound iterator without needing to worry about whether they should observe intents or not.
My strawman should have made it more clear that we wouldn't be trying to merge iterators or keep the current MVCC logic. We would check for locks in the lock table before doing the scan. We wouldn't need to scan at the same RocksDB snapshot because we already have in-mem latches to coordinate conflicting requests. The only case where this would result in different semantics is with limited scans. Right now, a limited scan will ignore an intent if it never reaches it. This proposal as is wouldn't know that it can avoid an intent that it never reaches. We already have this issue with latches though, so I don't think this is really a new issue (see #9521 + #33373). |
That works for MVCC scan, but what about MVCC put? |
It would be roughly the same thing. We'd check for conflicting locks in the lock table before performing the put. If there were no conflicts then the put would write the provisional version kv and add a lock to the lock table. This is how we could get blind writes into the MVCC keyspace. Does that answer your question or am I missing something? One complication here is when a transaction updates its own write. We'd need to store the sequence history on the lock itself. |
Yes, though I was thinking about the complication you mention below.
I was imagining this complication as well. Essentially you're proposing to decompose MVCC operations into lock checking phases and then execution phases. With MVCC put we may need to be passing info from the lock check phase to the execution phase. I only vaguely see how this would be done, though it doesn't seem infeasible. |
@nvanbenschoten Regarding "This proposal as is wouldn't know that it can avoid an intent that it never reaches. We already have this issue with latches though ..." |
Latch lifetime is often on the order of the latency of replication, so it's not always short. In fact, in a lot of cases, it's roughly the same as the lifetime of a transaction.
I wasn't imagining the change trying to do either of those proposals immediately, but they seem like good follow-on optimizations. |
Here's are a few other thoughts/clarifications from last week that are now floating somewhere around the bay area. (1) One of the largest remaining abstraction violations of this proposal (which @petermattis picked up on above) is that a transaction will need a delicate interaction with its locks when rewriting intents and when reading below the latest sequence number. This is because we'll likely need to continue storing the intent history on the lock key, which will now move to the lock table keyspace. This could all be made simpler if we included a sequence number suffix in the MVCC keyspace directly. If we did this then keys would look like With this structure, we would then be able to let a transaction write multiple versions of a key directly into the keyspace and not require any coupling with an intent history when reading or writing. If a transaction was reading at an earlier sequence then it would naturally observe only sequence numbers up to the sequence it desires. If we did this then we wouldn't even need to remove old sequence numbers during intent resolution, although I think we'd want to in practice. I think @andy-kimball's AcidLib did something like this, and I can see why. The big thing I haven't figured out here is how to deal with timestamp collisions between txns. To get the desired effect, I think we would need the portion of the keyspace before the sequence to be unique to a given transaction so that a transaction's own sequence number would never influence the values it can observe from other transactions. Something like This would also be another migration, although it's slightly less daunting than the main proposal here because we could just default missing seqnums to 0. (2) This proposal would naturally lead to us releasing read latches before performing the read itself. A read could simply grab a read latch, check locks (triggering contention handling if necessary), bump the timestamp cache, and then grab a RocksDB iterator before dropping its latches. The actual MVCC iteration could be done completely outside of latches. This would address a question that @andreimatei has had for a long time. (3) Allowing intents to be resolved without latching would be beneficial for a few reasons that are listed above. The biggest of these is that conflicting transactions wouldn't need to wait for the intent resolution to go through Raft before proceeding. In addition to this, I think the change would have another benefit which could simplify conflict resolution. Right now, requests that hit If we could efficiently check for intents/locks before evaluation then I think we could have tighter interaction between latching and locking and simplify the transaction pushing interaction. After a transaction has acquired latches it would check for conflicting locks. If it finds any, it could enter a This came up recently in a |
How would we encode the seqnum? MVCC keys are currently encoded as:
The last byte in the key is either always 0, 8, or 12, indicating no timestamp, walltime-only, or walltime+logical. We can't easily append anything else to this format without rewriting all of the keys. I think we can add an optional sequence number. This would give the following meanings to the timestamp-bytes value:
I think that works. Not exactly what I'd design if starting from scratch, but it does mean we wouldn't have to rewrite all of the data on a store for a migration. |
@petermattis That backward-compatible encoding does seem workable if we go this route. At the moment though, I don't think the rest of the idea is fully workable. The primary blockers to it are:
The way I have this working in my prototype is that we have a
The idea here would have reduced the contention footprint of transactions from 3 rounds of distributed consensus down to 2. We could mark locks as COMMITTED before going through the process of actually removing them. Other transactions would then ignore these locks during their lock checking phase. The round of Raft to remove the locks could then be done without holding latches. There would need to be a translation step for provisional values that are being moved to new timestamps during the intent resolution process. The Reducing the contention footprint from 3 writes to 2 is great, but @tbg just had a fantastic idea that would allow us to reduce it even further, down to just 1 write. The idea would address the longstanding desire to resolving intents while a transaction is implicitly committed (STAGING) instead of explicitly committed (COMMITTED). While in this phase, we could mark the lock as COMMITTED_BUT_NOT_REMOVABLE as soon as we're aware that the transaction's parallel commit has finished. Other transactions would ignore the lock just as before, but no-one would be allowed to remove the lock until the lock's owner became explicitly committed. In this way, we could get the lock out of the way of other transactions without removing evidence of an implicitly committed transaction. |
This commit removes the MVCCGet and MVCCScan methods from engine.Iterator and uses the rest of the interface to implement these methods as free functions. This restructuring allows the MVCC operations to support polymorphism of the iterator, which is what the code was intending to do when originally written. The code was moved to the current structure as a way of avoiding cgo calls when using RocksDB's iterator implementation. This is an important optimization when using RocksDB (but not Pebble) so the commit retains it through optional specializations of MVCCGet and MVCCScan. RocksDB's iterator implements this specialization but Pebble's does not need to. This isn't quite ready for a review. I'm mainly pushing it in case others want to take a look. It will be used to get the prototype of cockroachdb#41720 up and running. Benchmarks show about a 0-1% performance regression due to this change. More testing should be done if we actually want to productionize this.
@nvanbenschoten I don't fully understand the 3 rounds of distributed consensus and the optimizations to get it down to 2 rounds and then to 1 round.
After the STAGED round and all the writes being successful, the client can be informed of the commit but there are two more rounds before intents are removed. Is that correct? Is the leaseholder for the range with the transaction record supposed to drive the intent resolution? And the second and third rounds cannot be concurrent since they may be at different ranges and if the third races ahead of the second and the second needs to be retried we would have lost the intents that confirmed that the STAGED transaction could be COMMITTED, and would incorrectly think it had failed and remove all its writes? |
Yes, your understanding is correct for all three questions in this paragraph.
Neither transition needs to go through Raft. This information could all be held in memory while we wait to be able to remove the locks and during the process of removing the locks. I don't think the locks would ever be marked as COMMITTED durably. The important part is that the leaseholder can begin letting new requests observe the result of the know-committed transaction immediately after it hears that the transaction is committed, even if it doesn't remove the lock immediately. So getting the contention footprint down from 3 rounds to 2 is a clear win. Getting it down from 2 to 1 would require more communication ("a txn is committed so release its locks, but don't remove them"), so it's not obvious that it's a win in all cases, but it certainly would be in contended cases. |
I don't know if this idea already came up somewhere (it looks like we've discussed it only for non-versioned keys), but if we had a key encoding scheme which contains the seqno and txnid, we could ensure that the versions are never overwritten and thus we'd be able to use SingleDelete when we eventually delete them. I don't know if this would really give us a lot of benefit, though, since versions are only deleted when GC runs, at which point the original write has probably sunk down in the LSM to some degree (definitely with today's 25h TTL). |
There has been a small bit of water-cooler discussion about performing GC during compactions which would completely eliminate any overhead for GC'ing old data. |
It's here: #16132 |
@tbg Adding to what Peter said, GC during compactions is one of the ways we can exploit knowledge of the timestamp in the key at the Pebble level. It would need to know which time intervals are needed for backups etc. (this would be similar to how snapshot aware gc of sequence numbers works in RocksDB and Pebble), to know which "dead" versions can be thrown away. When there are dead and live versions in the same sstable one could track the amount of dead data in each sstable with a coarse age histogram. This could be used to trigger sstable rewrites (independent of normal compactions) based on a simple cost-benefit function. This won't clean all dead versions -- those that can only be known to be dead when looking at a different sstable that does not participate in compactions with this sstable will need to be deleted by a periodic scan of the key space. Another optimization could be to make these key scans cheap by segregating keys and values inside a file. |
I wonder if the ability to mark locks as non-active without removing them could allow us to get rid of the AbortSpan. A poisoning PushTxnRequest could mark the lock as ABORTED instead of removing it. It could then defer to the transaction itself to actually remove the lock. When scanning the lock table, transactions would take notice of whether their own locks have been ABORTED, which would provide the same protection against zombie transactions that the AbortSpan currently does. If we got rid of the AbortSpan lookups then this change wouldn't even be introducing any new RocksDB iterations. cc. @andreimatei. |
This commit removes the MVCCGet and MVCCScan methods from engine.Iterator and uses the rest of the interface to implement these methods as free functions. This restructuring allows the MVCC operations to support polymorphism of the iterator, which is what the code was intending to do when originally written. The code was moved to the current structure as a way of avoiding cgo calls when using RocksDB's iterator implementation. This is an important optimization when using RocksDB (but not Pebble) so the commit retains it through optional specializations of MVCCGet and MVCCScan. RocksDB's iterator implements this specialization but Pebble's does not need to. This isn't quite ready for a review. I'm mainly pushing it in case others want to take a look. It will be used to get the prototype of cockroachdb#41720 up and running. Benchmarks show about a 0-1% performance regression due to this change. More testing should be done if we actually want to productionize this.
This commit removes the MVCCGet and MVCCScan methods from engine.Iterator and uses the rest of the interface to implement these methods as free functions. This restructuring allows the MVCC operations to support polymorphism of the iterator, which is what the code was intending to do when originally written. The code was moved to the current structure as a way of avoiding cgo calls when using RocksDB's iterator implementation. This is an important optimization when using RocksDB (but not Pebble) so the commit retains it through optional specializations of MVCCGet and MVCCScan. RocksDB's iterator implements this specialization but Pebble's does not need to. This isn't quite ready for a review. I'm mainly pushing it in case others want to take a look. It will be used to get the prototype of cockroachdb#41720 up and running. Benchmarks show about a 0-1% performance regression due to this change. More testing should be done if we actually want to productionize this.
42210: storage/engine: lift Go MVCC implementation above Iterator interface r=nvanbenschoten a=nvanbenschoten This commit removes the `MVCCGet` and `MVCCScan` methods from `engine.Iterator` and uses the rest of the interface to implement these methods as free functions. This restructuring allows the MVCC operations to support polymorphism of the iterator, which is what the code was intending to do when originally written. The code was moved to the current structure as a way of avoiding cgo calls when using RocksDB's iterator implementation. This is an important optimization when using RocksDB (but not Pebble) so the commit retains it through optional specializations of `MVCCGet` and `MVCCScan`. RocksDB's iterator implements this specialization but Pebble's does not need to. This isn't quite ready for a review. I'm mainly pushing it in case others want to take a look. It will be used to get the prototype of #41720 up and running. Benchmarks show about a 0-1% performance regression due to this change. More testing should be done if we actually want to productionize this. cc. @petermattis @itsbilal Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
The segregated lock table also helps bulk operations, based on a discussion with @dt (that I've tried to capture below):
|
Adds a long running migration, SeparatedIntentsMigration, that does the following steps: 1) Iterate through all range descriptors using a RangeIterator 2) Calls a new ranged write command, Barrier, that is a no-op during evaluation but waits on any other writes on that range to finish before returning 3) Call ScanInterleavedIntents, a new ranged read command, to return interleaved intents encountered over key range 4) Iterate over returned intents then push all those txns and resolve any intents returned. The barrier in step 2 ensures that no future writes after that point will write interleaved intents. The actual disabling of the setting to write interleaved intents happens in the next commit. Part of cockroachdb#41720. Release note: None.
Adds a long running migration, SeparatedIntentsMigration, that does the following steps: 1) Iterate through all range descriptors using a RangeIterator 2) Calls a new ranged write command, Barrier, that is a no-op during evaluation but waits on any other writes on that range to finish before returning 3) Call ScanInterleavedIntents, a new ranged read command, to return interleaved intents encountered over key range 4) Iterate over returned intents then push all those txns and resolve any intents returned. The barrier in step 2 ensures that no future writes after that point will write interleaved intents. The actual disabling of the setting to write interleaved intents happens in the next commit. Part of cockroachdb#41720. Release note: None.
Adds a long running migration, SeparatedIntentsMigration, that does the following steps: 1) Iterate through all range descriptors using a RangeIterator 2) Calls a new ranged write command, Barrier, that is a no-op during evaluation but waits on any other writes on that range to finish before returning 3) Call ScanInterleavedIntents, a new ranged read command, to return interleaved intents encountered over key range 4) Iterate over returned intents then push all those txns and resolve any intents returned. The barrier in step 2 ensures that no future writes after that point will write interleaved intents. The actual disabling of the setting to write interleaved intents happens in the next commit. Part of cockroachdb#41720. Release note: None.
Adds a long running migration, SeparatedIntentsMigration, that does the following steps: 1) Iterate through all range descriptors using a RangeIterator 2) Calls a new ranged write command, Barrier, that is a no-op during evaluation but waits on any other writes on that range to finish before returning 3) Call ScanInterleavedIntents, a new ranged read command, to return interleaved intents encountered over key range 4) Iterate over returned intents then push all those txns and resolve any intents returned. The barrier in step 2 ensures that no future writes after that point will write interleaved intents. The actual disabling of the setting to write interleaved intents happens in the next commit. Part of cockroachdb#41720. Release note: None.
Adds a long running migration, SeparatedIntentsMigration, that does the following steps: 1) Iterate through all range descriptors using a RangeIterator 2) Calls a new ranged write command, Barrier, that is a no-op during evaluation but waits on any other writes on that range to finish before returning 3) Call ScanInterleavedIntents, a new ranged read command, to return interleaved intents encountered over key range 4) Iterate over returned intents then push all those txns and resolve any intents returned. The barrier in step 2 ensures that no future writes after that point will write interleaved intents. The actual disabling of the setting to write interleaved intents happens in the next commit. Part of cockroachdb#41720. Release note: None.
Adds a long running migration, SeparatedIntentsMigration, that does the following steps: 1) Iterate through all range descriptors using a RangeIterator 2) Calls a new ranged write command, Barrier, that is a no-op during evaluation but waits on any other writes on that range to finish before returning 3) Call ScanInterleavedIntents, a new ranged read command, to return interleaved intents encountered over key range 4) Iterate over returned intents then push all those txns and resolve any intents returned. The barrier in step 2 ensures that no future writes after that point will write interleaved intents. The actual disabling of the setting to write interleaved intents happens in the next commit. Part of cockroachdb#41720. Release note: None.
Adds a long running migration, SeparatedIntentsMigration, that does the following steps: 1) Iterate through all range descriptors using a RangeIterator 2) Calls a new ranged write command, Barrier, that is a no-op during evaluation but waits on any other writes on that range to finish before returning 3) Call ScanInterleavedIntents, a new ranged read command, to return interleaved intents encountered over key range 4) Iterate over returned intents then push all those txns and resolve any intents returned. The barrier in step 2 ensures that no future writes after that point will write interleaved intents. The actual disabling of the setting to write interleaved intents happens in the next commit. Part of cockroachdb#41720. Release note: None.
Adds a long running migration, SeparatedIntentsMigration, that does the following steps: 1) Iterate through all range descriptors using a RangeIterator 2) Calls a new ranged write command, Barrier, that is a no-op during evaluation but waits on any other writes on that range to finish before returning 3) Call ScanInterleavedIntents, a new ranged read command, to return interleaved intents encountered over key range 4) Iterate over returned intents then push all those txns and resolve any intents returned. The barrier in step 2 ensures that no future writes after that point will write interleaved intents. The actual disabling of the setting to write interleaved intents happens in the next commit. Part of cockroachdb#41720. Release note: None.
Adds a long running migration, SeparatedIntentsMigration, that does the following steps: 1) Iterate through all range descriptors using a RangeIterator 2) Calls a new ranged write command, Barrier, that is a no-op during evaluation but waits on any other writes on that range to finish before returning 3) Call ScanInterleavedIntents, a new ranged read command, to return interleaved intents encountered over key range 4) Iterate over returned intents then push all those txns and resolve any intents returned. The barrier in step 2 ensures that no future writes after that point will write interleaved intents. The actual disabling of the setting to write interleaved intents happens in the next commit. Part of cockroachdb#41720. Release note: None.
Adds a long running migration, SeparatedIntentsMigration, that does the following steps: 1) Iterate through all range descriptors using a RangeIterator 2) Calls a new ranged write command, Barrier, that is a no-op during evaluation but waits on any other writes on that range to finish before returning 3) Call ScanInterleavedIntents, a new ranged read command, to return interleaved intents encountered over key range 4) Iterate over returned intents then push all those txns and resolve any intents returned. The barrier in step 2 ensures that no future writes after that point will write interleaved intents. The actual disabling of the setting to write interleaved intents happens in the next commit. Part of cockroachdb#41720. Release note: None.
Adds a long running migration, SeparatedIntentsMigration, that does the following steps: 1) Iterate through all range descriptors using a RangeIterator 2) Calls a new ranged write command, Barrier, that is a no-op during evaluation but waits on any other writes on that range to finish before returning 3) Call ScanInterleavedIntents, a new ranged read command, to return interleaved intents encountered over key range 4) Iterate over returned intents then push all those txns and resolve any intents returned. The barrier in step 2 ensures that no future writes after that point will write interleaved intents. The actual disabling of the setting to write interleaved intents happens in the next commit. Part of cockroachdb#41720. Release note: None.
Adds a long running migration, SeparatedIntentsMigration, that does the following steps: 1) Iterate through all range descriptors using a RangeIterator 2) Calls a new ranged write command, Barrier, that is a no-op during evaluation but waits on any other writes on that range to finish before returning 3) Call ScanInterleavedIntents, a new ranged read command, to return interleaved intents encountered over key range 4) Iterate over returned intents then push all those txns and resolve any intents returned. The barrier in step 2 ensures that no future writes after that point will write interleaved intents. The actual disabling of the setting to write interleaved intents happens in the next commit. Part of cockroachdb#41720. Release note: None.
Adds a long running migration, SeparatedIntentsMigration, that does the following steps: 1) Iterate through all range descriptors using a RangeIterator 2) Calls a new ranged write command, Barrier, that is a no-op during evaluation but waits on any other writes on that range to finish before returning 3) Call ScanInterleavedIntents, a new ranged read command, to return interleaved intents encountered over key range 4) Iterate over returned intents then push all those txns and resolve any intents returned. The barrier in step 2 ensures that no future writes after that point will write interleaved intents. The actual disabling of the setting to write interleaved intents happens in the next commit. Part of cockroachdb#41720. Release note: None.
Adds a long running migration, SeparatedIntentsMigration, that does the following steps: 1) Iterate through all range descriptors using a RangeIterator 2) Calls a new ranged write command, Barrier, that is a no-op during evaluation but waits on any other writes on that range to finish before returning 3) Call ScanInterleavedIntents, a new ranged read command, to return interleaved intents encountered over key range 4) Iterate over returned intents then push all those txns and resolve any intents returned. The barrier in step 2 ensures that no future writes after that point will write interleaved intents. The actual disabling of the setting to write interleaved intents happens in the next commit. Part of cockroachdb#41720. Release note: None.
66445: kvserver: Add migration for interleaved intents to separated r=nvanbenschoten a=itsbilal Adds a long running migration, SeparatedIntentsMigration, that does the following steps: 1) Iterate through all range descriptors using a RangeIterator 2) Calls a new ranged write command, Barrier, that is a no-op during evaluation but waits on any other writes on that range to finish before returning 3) Call ScanInterleavedIntents, a new ranged read command, to return interleaved intents encountered over key range 4) Iterate over returned intents then push all those txns and resolve any intents returned. The barrier in step 2 ensures that no future writes after that point will write interleaved intents. The actual disabling of the setting to write interleaved intents happens in the next commit. Part of #41720. Release note: None. 68252: sql: add metric for schema change job user/database errors r=postamar a=fqazi Fixes: #68111 Previously, we had no way of tracking the reason a schema change job failed in our metrics. This was inadequate because we had no way of knowing why schema jobs failed. To address this, this patch will add metrics for tracking failures due to constraint violations and database errors. Release note (sql change): Add a new metrics to track schema job failure (sql.schema_changer.errors.all, sql.schema_changer.errors.constraint_violation, sql.schema_changer.errors.uncategorized), errors inside the crdb_internal.feature_usage table. Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com> Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com>
Now that the migration is checked in, is it time to close this issue? |
Closing this, as it seems like we did all we are going to do in 21.2 |
Summary
Move lock portion of write intents to a separate lock-table keyspace instead of storing it inline with MVCC data.
Motivation
Write intents are currently made up of an MVCCMetadata key and a provisional value key. Both are stored inline in the global MVCC keyspace, with the former stored at timestamp 0 of the write intent's key and the latter stored at the transaction's provisional commit timestamp at the time of writing the record.
Here's what this looks like visually for an intent at key "b":
We can think of the information stored in the MVCCMetadata key as a "write lock" and the information stored in the provisional value key as the "value under lock".
Storing the provisional value where it likely will remain after the transaction has committed is desirable because we only need to write it once. However, storing the write lock inline with the MVCC data is less of a clear win. Here are the only benefits I can think of for doing so:
I propose we move it to a separate "lock-table" keyspace for all of the following reasons:
SingleDelete
for raft log truncation #8979)In addition to all of these benefits, I don't think the one downside (multiple storage engine scans) is as bad as it sounds. The lock table's RocksDB block will likely always remain in cache, so it's unlikely that this will have a noticeable impact on performance. In fact, the speedup due to the simplification of MVCC code (not to mention the other optimizations this allows) will likely outweigh the overhead of separated lock checking.
Strawman Proposal
/Local/<key>/lock/<txnID>
. This ends up being in-line with transaction record keys, which isn't really a problem. It hints at the equivalence between these two concepts.This will require a fairly serious migration, but it seems like an important prereq for a number of desirable projects.
gz#9005
The text was updated successfully, but these errors were encountered: