Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: allow multiple intents #5861

Open
tbg opened this issue Apr 5, 2016 · 7 comments
Open

storage: allow multiple intents #5861

tbg opened this issue Apr 5, 2016 · 7 comments
Labels
A-kv-client Relating to the KV client and the KV interface. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-wishlist A wishlist feature. T-kv KV Team
Milestone

Comments

@tbg
Copy link
Member

tbg commented Apr 5, 2016

From an email thread with Andrew Kimball.

Andrew:

I've been taking a look at the Yabandeh paper and thinking about how it might affect the Cockroach algorithm. The paper does a good job explaining why WSI (write snapshot-isolation) is a useful concept, and why it guarantees serializable semantics. However, the paper argues that WSI can be as efficient as SI for real-world workloads. I feel skeptical about this claim, because SI only needs to abort transactions on write-write conflicts, whereas WSI needs to abort on read-write conflicts. Since read-write conflicts are more common, WSI is going to abort more.

I confess that I only skimmed the section in the paper about their implementation, as they use a centralized "status oracle". I don't think that's a good idea in a system like Cockroach, which should avoid single points of failure. Even if the status oracle consists of several "failure-resistant" machines linked by consensus, there are still delays during re-configuration that would affect all ongoing transactions. A large-scale distributed system should be built to gracefully degrade on failures rather than going immediately from 100% => 0%, as in that case. Also, in the case of nodes distributed over a wide geographic area (i.e. the globe), a central machine is going to greatly increase the overall latency of the system, as many nodes will be hundreds of milliseconds away from it. If the ranges I'm committing to are nearby, and if I find no conflicts, then I should be able to commit immediately, with no coordination necessary with machines elsewhere in the world.

I therefore conclude that this paper does not undermine the algorithm used by Cockroach for SI. I think it's still very much worthwhile to support SI, as there should be many fewer aborts in important real-world workloads. However, I do think that Cockroach's serializable algorithm could be improved, based on the concepts from the paper. Hopefully my knowledge of Cockroach's serializable algorithm is not too far out-of-date, otherwise this might not make much sense.

In particular, there is no need to check for write-write conflicts in serializable mode. So, if a txn is trying to lay down a write intent, it could simply lay it down right beside another write intent, rather than aborting the other txn. This is true as long as the candidate commit time of the txn is > the last time that row may have been read. For example, using the notation from the paper:

r1[x] w2[x] w1[x] c2 c1

In this case, since no conflict was registered (since write-write conflicts are ignored), txn1's commit timestamp would not be pushed forward in logical time, and so would simply equal the starting timestamp: Ts(txn1) = Tc(txn1). The same situation would be true for txn2. The equivalent serial history would be (assuming Ts(txn1) < Ts(txn2)):

r1[x] w1[x] c1 w2[x] c2

Happily, Cockroach's current serializable rules already prevent read-write conflicts. When it comes time to commit a txn, Cockroach already checks if the txn's commit time is different than its start time. If it is different, then it's possible that it may have read an older version of some value that another txn later modified. I believe Cockroach simply aborts the txn in that case. However, another possibility is remembering the read-set of the txn and then re-checking all previously read values to ensure they haven't been changed in the interval [Ts(txn), Tc(txn)]. If no changes have occurred, then it's safe to go forward with the commit.

One other interesting point. Consider History 6 in the paper, which will force an abort due to the rules of WSI:

H6: r1[x] r2[z] w2[x] w1[y] c2 c1

This is actually a serializable history, and does not need to be aborted:

H7: r1[x] w1[y] c1 r2[z] w2[x] c2

The rules that Cockroach use would not abort this. Here is the sequence:

  • txn1 reads x at Ts(txn1).
  • txn2 reads z at Ts(txn2), where we'll assume that Ts(txn2) > Ts(txn1).
  • txn2 writes x at Ts(txn2). Because the timestamp of the write intent is > the timestamp of the last read of that row, no conflict occurs.
  • txn1 writes y at Ts(txn1).
  • txn2 tries to commit. Because Tc(txn2) = Ts(txn2), it commits at logical time Ts(txn2).
  • txn1 tries to commit. Because Tc(txn1) = Ts(txn1), it commits at logical time Ts(txn1).

Thus we have arrived at the serializable history H7 with no abort. So while the Cockroach algorithm aborts cases that Yabandeh would not, there are cases where the reverse is true. Furthermore, Cockroach could avoid the other aborts if it remembered a transaction's read-set, as I suggest above as a possibility.

~Andy

P.S. I sometimes like to use an alternate history representation which factors in logical time as well as absolute (wall-clock) time:

t0: r1[x]             w1[y]       c1
t1:       r2[z] w2[x]       c2

In this notation, absolute time reads from left to right, and logical time reads from top to bottom. This makes it more clear that txn1 commits after txn2 in absolute time, but before it in logical time.

Jira issue: CRDB-6193

@tbg tbg added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-investigation Further steps needed to qualify. C-label will change. labels Apr 5, 2016
@tbg tbg added this to the Post 1.0 milestone Apr 5, 2016
@tbg
Copy link
Member Author

tbg commented Apr 5, 2016

Having multiple write intents on a key is a topic we've discussed a few times (@mrtracy was interested in it back in the day, and @andreimatei more recently).

It's an interesting concept, but it does make things more complicated internally. As you mentioned, read-write conflicts are (usually) more common than write-write, so optimizing write-write conflicts away already affects only a minority of workloads. But no doubt are there going to be some for which it does make a big difference. My main concern is that additional complexity would have a measurable impact on the case in which it isn't required (i.e. the common case) and would ultimately distract us at this point. I have it on my internal "maybe one day" list.

For example,

  • Meta keys. Currently we have implicit meta keys for keys without an intent on them (i.e. just a single key-value pair), and an explicit meta key when there is an intent. With multiple intents, any old version of the key could be an intent, so we need a new concept of meta keys, presumably listing the versions of the keys which are intents, and their associated transactions as well.
  • Snapshot transactions which have been pushed still need to see their own writes, so finding the right version to read from one of the extended meta keys above is more work (i.e. meta keys must support lookup by transaction ID).
  • Snapshot gets more complicated generally. When a SNAPSHOT intent gets pushed, it currently gets pushed further than the read (this makes sense for clock uncertainty reasons), which means that there's a gap between what the read timestamp cache prohibits and what's possible. In particular, when not being careful, a transaction could read at T1, write based on that read to the same key, have that write pushed to a higher timestamp T2, and yet another transaction could come along and lay an intent at some timestamp between T1 and T2. No bueno! To prevent this, we need to populate the read timestamp cache on pushes of a SNAPSHOT transaction - all can be done, but you can see the complexity creep in there as well.
  • Conflict resolution gets more intricate. A reader may want to access a key and find multiple intents there, and must push all of them simultaneously. That means that the way priorities work gets skewed (since chances of success vary based on how many intents you run into).
  • There may be some issues with the way we implement replay protection, though the last time I thought about it I didn't find anything concrete - just more cases to consider.
  • Uncertainty restarts would apply to writes as well: If a write finds a newer write intent in its near future, I think it may need to push or restart (with optimizations we already have in place for the read case). For example, writing increasing numbers on a single client thread should result in increasing versions of the key even when going through multiple coordinators with clock skew. That means that some transactions would still have to restart (even though that's still better than a write-write conflict).

Those are some of the immediate reasons that I wouldn't want to make this change unless we have a good reason that it's a major game changer. Due to the migration story, it's unlikely to happen until we have the next large breaking major release, anyway.

However, another possibility is remembering the read-set of the txn and then re-checking all previously read values to ensure they haven't been changed in the interval [Ts(txn), Tc(txn)]. If no changes have occurred, then it's safe to go forward with the commit.

That's very expensive. For txns with large read sets even more so, but even for small ones the extra round-trips are going to disqualify this (perhaps unless everything happens on a single node). I believe Postgres does something like this, but they can afford it due to having all of that information local to the transaction.

@tbg
Copy link
Member Author

tbg commented Apr 5, 2016

Andrew:

Good stuff. I agree that the list of disadvantages with multiple intents is daunting enough that it's not worth allowing unless something major changes in the underlying representation that makes it much easier. As you say, write-write conflicts are quite uncommon, especially when you only consider the case where the value was not previously read. Even a transaction as simple as X = X + 1 will read the value before writing and trigger a read-write conflict that would force abort/restart.

On keeping the read-set, you would keep the list of ranges that you read (i.e. not the individual read keys) on the coordinator as it executed the query. If the # ranges got too high, you could always fall back to just aborting/restarting. Most OLTP queries would have just a few read ranges. That said, this is just a possible optimization that would only be worthwhile after proving that it's justified on real workloads. Definitely not of any short-term concern.

@bdarnell
Copy link
Contributor

bdarnell commented Apr 5, 2016

Even a transaction as simple as X = X + 1 will read the value before writing and trigger a read-write conflict that would force abort/restart.

Not necessarily: UPDATE t SET X = X+1 WHERE pk=$1 could be turned into a KV-level increment command (not quite the same as our current increment command, e.g. don't create a value where one doesn't exist and don't return a value), and these increments would be commutative. Commutative operations would be a powerful use of multiple intents because they don't have to interfere with each other. Neha Narula has done a lot of work on exploiting commutativity to improve parallelism and laying down multiple intents for commutative operations would be one way to apply these principles to cockroachdb.

There's no question that this would be an extremely complex change to make, but I think the benefits could be substantial and go beyond things that today manifest as write-write conflicts.

@tbg tbg added A-kv-client Relating to the KV client and the KV interface. C-wishlist A wishlist feature. and removed C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) labels Aug 21, 2018
@nvanbenschoten
Copy link
Member

A middle ground that @andreimatei and I arrived at would be to keep a history of a single transaction's intents on a key while not adjusting our rules for write-write conflicts between different transactions.

The idea is that we could store prior versions of a given intent in the MVCC metadata key instead of overwriting the intent entirely when a transaction writes to the same key multiple times. Doing so would be a fairly self-contained change which would not alter performance characteristics of any transaction that does not write to the same key multiple times. The necessary changes would be:

  • MVCCMetadata would get a new IntentHistory []struct{seq int32, val []byte, more?} field
  • Transactional writes which find an existing intent at a lower sequence number would add the value of the intent to the IntentHistory before overwriting it.
  • Transacitonal requests would no longer return replay errors when they find an intent with a larger sequence number. Instead, they would look through the intent history to determine their result.

The obvious utility of this change is that it would allow us to support idempotent intra-transactional requests (see #26915). Requests in a txn at sequence numbers beneath the current intent's could ignore that intent and look back at the intent history. This would allow it to produce a determinstic result without the risk of missing intents that it should have seen.

@nvanbenschoten
Copy link
Member

cc. @andy-kimball. I figure you might have opinions here. This is one of your earliest issues, after all.

@andy-kimball
Copy link
Contributor

My main opinion is that I'm definitely in favor of idempotent designs in distributed systems. It tends to pay off may times over the course of the system's lifetime, often in unexpected ways. So if there's a reasonable way to make Cockroach more idempotent, I'm in favor.

craig bot pushed a commit that referenced this issue Dec 6, 2018
32688: storage: Collect IntentHistory for transactions r=ridwanmsharif a=ridwanmsharif

First part of making transactions more idempotent. This change
adds history of all writes on a key by the transaction. This
can be used to create save points and rollback to certain points
inside a transaction. More on why this is needed is explained here:
#5861 (comment)

Release note: None

Co-authored-by: Ridwan Sharif <ridwan@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Dec 12, 2018
33001: storage: allow transactions to run at a lower sequence r=ridwanmsharif a=ridwanmsharif

Continuing off #32688, final part of #5861 (comment).

Adds support to have transactions run at a lower sequence
at a given key. It asserts the value it computes with the
value written for the sequence in the intent history instead
or returning a retry-able error.

Release note: None

Co-authored-by: Ridwan Sharif <ridwan@cockroachlabs.com>
@lunevalex lunevalex added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) and removed C-investigation Further steps needed to qualify. C-label will change. labels Mar 26, 2021
@jlinder jlinder added the T-kv KV Team label Jun 16, 2021
@github-actions
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-client Relating to the KV client and the KV interface. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-wishlist A wishlist feature. T-kv KV Team
Projects
No open projects
Status: Incoming
Development

No branches or pull requests

8 participants