storage: allow multiple intents #5861

tbg · 2016-04-05T09:16:14Z

From an email thread with Andrew Kimball.

Andrew:

I've been taking a look at the Yabandeh paper and thinking about how it might affect the Cockroach algorithm. The paper does a good job explaining why WSI (write snapshot-isolation) is a useful concept, and why it guarantees serializable semantics. However, the paper argues that WSI can be as efficient as SI for real-world workloads. I feel skeptical about this claim, because SI only needs to abort transactions on write-write conflicts, whereas WSI needs to abort on read-write conflicts. Since read-write conflicts are more common, WSI is going to abort more.

I confess that I only skimmed the section in the paper about their implementation, as they use a centralized "status oracle". I don't think that's a good idea in a system like Cockroach, which should avoid single points of failure. Even if the status oracle consists of several "failure-resistant" machines linked by consensus, there are still delays during re-configuration that would affect all ongoing transactions. A large-scale distributed system should be built to gracefully degrade on failures rather than going immediately from 100% => 0%, as in that case. Also, in the case of nodes distributed over a wide geographic area (i.e. the globe), a central machine is going to greatly increase the overall latency of the system, as many nodes will be hundreds of milliseconds away from it. If the ranges I'm committing to are nearby, and if I find no conflicts, then I should be able to commit immediately, with no coordination necessary with machines elsewhere in the world.

I therefore conclude that this paper does not undermine the algorithm used by Cockroach for SI. I think it's still very much worthwhile to support SI, as there should be many fewer aborts in important real-world workloads. However, I do think that Cockroach's serializable algorithm could be improved, based on the concepts from the paper. Hopefully my knowledge of Cockroach's serializable algorithm is not too far out-of-date, otherwise this might not make much sense.

In particular, there is no need to check for write-write conflicts in serializable mode. So, if a txn is trying to lay down a write intent, it could simply lay it down right beside another write intent, rather than aborting the other txn. This is true as long as the candidate commit time of the txn is > the last time that row may have been read. For example, using the notation from the paper:

r1[x] w2[x] w1[x] c2 c1

In this case, since no conflict was registered (since write-write conflicts are ignored), txn1's commit timestamp would not be pushed forward in logical time, and so would simply equal the starting timestamp: Ts(txn1) = Tc(txn1). The same situation would be true for txn2. The equivalent serial history would be (assuming Ts(txn1) < Ts(txn2)):

r1[x] w1[x] c1 w2[x] c2

Happily, Cockroach's current serializable rules already prevent read-write conflicts. When it comes time to commit a txn, Cockroach already checks if the txn's commit time is different than its start time. If it is different, then it's possible that it may have read an older version of some value that another txn later modified. I believe Cockroach simply aborts the txn in that case. However, another possibility is remembering the read-set of the txn and then re-checking all previously read values to ensure they haven't been changed in the interval [Ts(txn), Tc(txn)]. If no changes have occurred, then it's safe to go forward with the commit.

One other interesting point. Consider History 6 in the paper, which will force an abort due to the rules of WSI:

H6: r1[x] r2[z] w2[x] w1[y] c2 c1

This is actually a serializable history, and does not need to be aborted:

H7: r1[x] w1[y] c1 r2[z] w2[x] c2

The rules that Cockroach use would not abort this. Here is the sequence:

txn1 reads x at Ts(txn1).
txn2 reads z at Ts(txn2), where we'll assume that Ts(txn2) > Ts(txn1).
txn2 writes x at Ts(txn2). Because the timestamp of the write intent is > the timestamp of the last read of that row, no conflict occurs.
txn1 writes y at Ts(txn1).
txn2 tries to commit. Because Tc(txn2) = Ts(txn2), it commits at logical time Ts(txn2).
txn1 tries to commit. Because Tc(txn1) = Ts(txn1), it commits at logical time Ts(txn1).

Thus we have arrived at the serializable history H7 with no abort. So while the Cockroach algorithm aborts cases that Yabandeh would not, there are cases where the reverse is true. Furthermore, Cockroach could avoid the other aborts if it remembered a transaction's read-set, as I suggest above as a possibility.

~Andy

P.S. I sometimes like to use an alternate history representation which factors in logical time as well as absolute (wall-clock) time:

t0: r1[x]             w1[y]       c1
t1:       r2[z] w2[x]       c2

In this notation, absolute time reads from left to right, and logical time reads from top to bottom. This makes it more clear that txn1 commits after txn2 in absolute time, but before it in logical time.

Jira issue: CRDB-6193

The text was updated successfully, but these errors were encountered:

tbg · 2016-04-05T09:17:47Z

Having multiple write intents on a key is a topic we've discussed a few times (@mrtracy was interested in it back in the day, and @andreimatei more recently).

It's an interesting concept, but it does make things more complicated internally. As you mentioned, read-write conflicts are (usually) more common than write-write, so optimizing write-write conflicts away already affects only a minority of workloads. But no doubt are there going to be some for which it does make a big difference. My main concern is that additional complexity would have a measurable impact on the case in which it isn't required (i.e. the common case) and would ultimately distract us at this point. I have it on my internal "maybe one day" list.

For example,

Meta keys. Currently we have implicit meta keys for keys without an intent on them (i.e. just a single key-value pair), and an explicit meta key when there is an intent. With multiple intents, any old version of the key could be an intent, so we need a new concept of meta keys, presumably listing the versions of the keys which are intents, and their associated transactions as well.
Snapshot transactions which have been pushed still need to see their own writes, so finding the right version to read from one of the extended meta keys above is more work (i.e. meta keys must support lookup by transaction ID).
Snapshot gets more complicated generally. When a SNAPSHOT intent gets pushed, it currently gets pushed further than the read (this makes sense for clock uncertainty reasons), which means that there's a gap between what the read timestamp cache prohibits and what's possible. In particular, when not being careful, a transaction could read at T1, write based on that read to the same key, have that write pushed to a higher timestamp T2, and yet another transaction could come along and lay an intent at some timestamp between T1 and T2. No bueno! To prevent this, we need to populate the read timestamp cache on pushes of a SNAPSHOT transaction - all can be done, but you can see the complexity creep in there as well.
Conflict resolution gets more intricate. A reader may want to access a key and find multiple intents there, and must push all of them simultaneously. That means that the way priorities work gets skewed (since chances of success vary based on how many intents you run into).
There may be some issues with the way we implement replay protection, though the last time I thought about it I didn't find anything concrete - just more cases to consider.
Uncertainty restarts would apply to writes as well: If a write finds a newer write intent in its near future, I think it may need to push or restart (with optimizations we already have in place for the read case). For example, writing increasing numbers on a single client thread should result in increasing versions of the key even when going through multiple coordinators with clock skew. That means that some transactions would still have to restart (even though that's still better than a write-write conflict).

Those are some of the immediate reasons that I wouldn't want to make this change unless we have a good reason that it's a major game changer. Due to the migration story, it's unlikely to happen until we have the next large breaking major release, anyway.

However, another possibility is remembering the read-set of the txn and then re-checking all previously read values to ensure they haven't been changed in the interval [Ts(txn), Tc(txn)]. If no changes have occurred, then it's safe to go forward with the commit.

That's very expensive. For txns with large read sets even more so, but even for small ones the extra round-trips are going to disqualify this (perhaps unless everything happens on a single node). I believe Postgres does something like this, but they can afford it due to having all of that information local to the transaction.

tbg · 2016-04-05T09:18:28Z

Andrew:

Good stuff. I agree that the list of disadvantages with multiple intents is daunting enough that it's not worth allowing unless something major changes in the underlying representation that makes it much easier. As you say, write-write conflicts are quite uncommon, especially when you only consider the case where the value was not previously read. Even a transaction as simple as X = X + 1 will read the value before writing and trigger a read-write conflict that would force abort/restart.

On keeping the read-set, you would keep the list of ranges that you read (i.e. not the individual read keys) on the coordinator as it executed the query. If the # ranges got too high, you could always fall back to just aborting/restarting. Most OLTP queries would have just a few read ranges. That said, this is just a possible optimization that would only be worthwhile after proving that it's justified on real workloads. Definitely not of any short-term concern.

bdarnell · 2016-04-05T17:01:16Z

Even a transaction as simple as X = X + 1 will read the value before writing and trigger a read-write conflict that would force abort/restart.

Not necessarily: UPDATE t SET X = X+1 WHERE pk=$1 could be turned into a KV-level increment command (not quite the same as our current increment command, e.g. don't create a value where one doesn't exist and don't return a value), and these increments would be commutative. Commutative operations would be a powerful use of multiple intents because they don't have to interfere with each other. Neha Narula has done a lot of work on exploiting commutativity to improve parallelism and laying down multiple intents for commutative operations would be one way to apply these principles to cockroachdb.

There's no question that this would be an extremely complex change to make, but I think the benefits could be substantial and go beyond things that today manifest as write-write conflicts.

nvanbenschoten · 2018-10-01T17:19:31Z

A middle ground that @andreimatei and I arrived at would be to keep a history of a single transaction's intents on a key while not adjusting our rules for write-write conflicts between different transactions.

The idea is that we could store prior versions of a given intent in the MVCC metadata key instead of overwriting the intent entirely when a transaction writes to the same key multiple times. Doing so would be a fairly self-contained change which would not alter performance characteristics of any transaction that does not write to the same key multiple times. The necessary changes would be:

MVCCMetadata would get a new IntentHistory []struct{seq int32, val []byte, more?} field
Transactional writes which find an existing intent at a lower sequence number would add the value of the intent to the IntentHistory before overwriting it.
Transacitonal requests would no longer return replay errors when they find an intent with a larger sequence number. Instead, they would look through the intent history to determine their result.

The obvious utility of this change is that it would allow us to support idempotent intra-transactional requests (see #26915). Requests in a txn at sequence numbers beneath the current intent's could ignore that intent and look back at the intent history. This would allow it to produce a determinstic result without the risk of missing intents that it should have seen.

nvanbenschoten · 2018-10-02T16:14:22Z

cc. @andy-kimball. I figure you might have opinions here. This is one of your earliest issues, after all.

andy-kimball · 2018-10-02T16:50:15Z

My main opinion is that I'm definitely in favor of idempotent designs in distributed systems. It tends to pay off may times over the course of the system's lifetime, often in unexpected ways. So if there's a reasonable way to make Cockroach more idempotent, I'm in favor.

32688: storage: Collect IntentHistory for transactions r=ridwanmsharif a=ridwanmsharif First part of making transactions more idempotent. This change adds history of all writes on a key by the transaction. This can be used to create save points and rollback to certain points inside a transaction. More on why this is needed is explained here: #5861 (comment) Release note: None Co-authored-by: Ridwan Sharif <ridwan@cockroachlabs.com>

33001: storage: allow transactions to run at a lower sequence r=ridwanmsharif a=ridwanmsharif Continuing off #32688, final part of #5861 (comment). Adds support to have transactions run at a lower sequence at a given key. It asserts the value it computes with the value written for the sequence in the intent history instead or returning a retry-able error. Release note: None Co-authored-by: Ridwan Sharif <ridwan@cockroachlabs.com>

github-actions · 2023-09-26T11:11:25Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

tbg added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-investigation Further steps needed to qualify. C-label will change. labels Apr 5, 2016

tbg added this to the Post 1.0 milestone Apr 5, 2016

tbg added A-kv-client Relating to the KV client and the KV interface. C-wishlist A wishlist feature. and removed C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) labels Aug 21, 2018

nvanbenschoten assigned ridwanmsharif Nov 14, 2018

nvanbenschoten mentioned this issue Nov 19, 2018

storage: query hangs when using INSERT FROM ... SELECT on same table #28842

Closed

nvanbenschoten unassigned ridwanmsharif Nov 19, 2018

ridwanmsharif mentioned this issue Nov 28, 2018

storage: Collect IntentHistory for transactions #32688

Merged

ridwanmsharif mentioned this issue Dec 12, 2018

storage: allow transactions to run at a lower sequence #33001

Merged

nvanbenschoten mentioned this issue Oct 18, 2019

storage: separated lock-table keyspace #41720

Closed

lunevalex added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) and removed C-investigation Further steps needed to qualify. C-label will change. labels Mar 26, 2021

jlinder added the T-kv KV Team label Jun 16, 2021

github-actions bot added the no-issue-activity label Sep 26, 2023

github-actions bot added the X-stale label Oct 9, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 9, 2023

exalate-issue-sync bot closed this as completed Oct 9, 2023

yuzefovich removed X-stale no-issue-activity labels May 3, 2024

yuzefovich reopened this May 3, 2024

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Incoming in KV Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: allow multiple intents #5861

storage: allow multiple intents #5861

tbg commented Apr 5, 2016 •

edited by cockroach-jira-scripts

Loading

tbg commented Apr 5, 2016

tbg commented Apr 5, 2016

bdarnell commented Apr 5, 2016

nvanbenschoten commented Oct 1, 2018

nvanbenschoten commented Oct 2, 2018

andy-kimball commented Oct 2, 2018

github-actions bot commented Sep 26, 2023

storage: allow multiple intents #5861

storage: allow multiple intents #5861

Comments

tbg commented Apr 5, 2016 • edited by cockroach-jira-scripts Loading

tbg commented Apr 5, 2016

tbg commented Apr 5, 2016

bdarnell commented Apr 5, 2016

nvanbenschoten commented Oct 1, 2018

nvanbenschoten commented Oct 2, 2018

andy-kimball commented Oct 2, 2018

github-actions bot commented Sep 26, 2023

tbg commented Apr 5, 2016 •

edited by cockroach-jira-scripts

Loading