-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: allow multiple intents #5861
Comments
Having multiple write intents on a key is a topic we've discussed a few times (@mrtracy was interested in it back in the day, and @andreimatei more recently). It's an interesting concept, but it does make things more complicated internally. As you mentioned, read-write conflicts are (usually) more common than write-write, so optimizing write-write conflicts away already affects only a minority of workloads. But no doubt are there going to be some for which it does make a big difference. My main concern is that additional complexity would have a measurable impact on the case in which it isn't required (i.e. the common case) and would ultimately distract us at this point. I have it on my internal "maybe one day" list. For example,
Those are some of the immediate reasons that I wouldn't want to make this change unless we have a good reason that it's a major game changer. Due to the migration story, it's unlikely to happen until we have the next large breaking major release, anyway.
That's very expensive. For txns with large read sets even more so, but even for small ones the extra round-trips are going to disqualify this (perhaps unless everything happens on a single node). I believe Postgres does something like this, but they can afford it due to having all of that information local to the transaction. |
Andrew: Good stuff. I agree that the list of disadvantages with multiple intents is daunting enough that it's not worth allowing unless something major changes in the underlying representation that makes it much easier. As you say, write-write conflicts are quite uncommon, especially when you only consider the case where the value was not previously read. Even a transaction as simple as X = X + 1 will read the value before writing and trigger a read-write conflict that would force abort/restart. On keeping the read-set, you would keep the list of ranges that you read (i.e. not the individual read keys) on the coordinator as it executed the query. If the # ranges got too high, you could always fall back to just aborting/restarting. Most OLTP queries would have just a few read ranges. That said, this is just a possible optimization that would only be worthwhile after proving that it's justified on real workloads. Definitely not of any short-term concern. |
Not necessarily: There's no question that this would be an extremely complex change to make, but I think the benefits could be substantial and go beyond things that today manifest as write-write conflicts. |
A middle ground that @andreimatei and I arrived at would be to keep a history of a single transaction's intents on a key while not adjusting our rules for write-write conflicts between different transactions. The idea is that we could store prior versions of a given intent in the MVCC metadata key instead of overwriting the intent entirely when a transaction writes to the same key multiple times. Doing so would be a fairly self-contained change which would not alter performance characteristics of any transaction that does not write to the same key multiple times. The necessary changes would be:
The obvious utility of this change is that it would allow us to support idempotent intra-transactional requests (see #26915). Requests in a txn at sequence numbers beneath the current intent's could ignore that intent and look back at the intent history. This would allow it to produce a determinstic result without the risk of missing intents that it should have seen. |
cc. @andy-kimball. I figure you might have opinions here. This is one of your earliest issues, after all. |
My main opinion is that I'm definitely in favor of idempotent designs in distributed systems. It tends to pay off may times over the course of the system's lifetime, often in unexpected ways. So if there's a reasonable way to make Cockroach more idempotent, I'm in favor. |
32688: storage: Collect IntentHistory for transactions r=ridwanmsharif a=ridwanmsharif First part of making transactions more idempotent. This change adds history of all writes on a key by the transaction. This can be used to create save points and rollback to certain points inside a transaction. More on why this is needed is explained here: #5861 (comment) Release note: None Co-authored-by: Ridwan Sharif <ridwan@cockroachlabs.com>
33001: storage: allow transactions to run at a lower sequence r=ridwanmsharif a=ridwanmsharif Continuing off #32688, final part of #5861 (comment). Adds support to have transactions run at a lower sequence at a given key. It asserts the value it computes with the value written for the sequence in the intent history instead or returning a retry-able error. Release note: None Co-authored-by: Ridwan Sharif <ridwan@cockroachlabs.com>
We have marked this issue as stale because it has been inactive for |
From an email thread with Andrew Kimball.
Andrew:
I've been taking a look at the Yabandeh paper and thinking about how it might affect the Cockroach algorithm. The paper does a good job explaining why WSI (write snapshot-isolation) is a useful concept, and why it guarantees serializable semantics. However, the paper argues that WSI can be as efficient as SI for real-world workloads. I feel skeptical about this claim, because SI only needs to abort transactions on write-write conflicts, whereas WSI needs to abort on read-write conflicts. Since read-write conflicts are more common, WSI is going to abort more.
I confess that I only skimmed the section in the paper about their implementation, as they use a centralized "status oracle". I don't think that's a good idea in a system like Cockroach, which should avoid single points of failure. Even if the status oracle consists of several "failure-resistant" machines linked by consensus, there are still delays during re-configuration that would affect all ongoing transactions. A large-scale distributed system should be built to gracefully degrade on failures rather than going immediately from 100% => 0%, as in that case. Also, in the case of nodes distributed over a wide geographic area (i.e. the globe), a central machine is going to greatly increase the overall latency of the system, as many nodes will be hundreds of milliseconds away from it. If the ranges I'm committing to are nearby, and if I find no conflicts, then I should be able to commit immediately, with no coordination necessary with machines elsewhere in the world.
I therefore conclude that this paper does not undermine the algorithm used by Cockroach for SI. I think it's still very much worthwhile to support SI, as there should be many fewer aborts in important real-world workloads. However, I do think that Cockroach's serializable algorithm could be improved, based on the concepts from the paper. Hopefully my knowledge of Cockroach's serializable algorithm is not too far out-of-date, otherwise this might not make much sense.
In particular, there is no need to check for write-write conflicts in serializable mode. So, if a txn is trying to lay down a write intent, it could simply lay it down right beside another write intent, rather than aborting the other txn. This is true as long as the candidate commit time of the txn is > the last time that row may have been read. For example, using the notation from the paper:
In this case, since no conflict was registered (since write-write conflicts are ignored), txn1's commit timestamp would not be pushed forward in logical time, and so would simply equal the starting timestamp: Ts(txn1) = Tc(txn1). The same situation would be true for txn2. The equivalent serial history would be (assuming Ts(txn1) < Ts(txn2)):
Happily, Cockroach's current serializable rules already prevent read-write conflicts. When it comes time to commit a txn, Cockroach already checks if the txn's commit time is different than its start time. If it is different, then it's possible that it may have read an older version of some value that another txn later modified. I believe Cockroach simply aborts the txn in that case. However, another possibility is remembering the read-set of the txn and then re-checking all previously read values to ensure they haven't been changed in the interval [Ts(txn), Tc(txn)]. If no changes have occurred, then it's safe to go forward with the commit.
One other interesting point. Consider History 6 in the paper, which will force an abort due to the rules of WSI:
This is actually a serializable history, and does not need to be aborted:
The rules that Cockroach use would not abort this. Here is the sequence:
Thus we have arrived at the serializable history H7 with no abort. So while the Cockroach algorithm aborts cases that Yabandeh would not, there are cases where the reverse is true. Furthermore, Cockroach could avoid the other aborts if it remembered a transaction's read-set, as I suggest above as a possibility.
~Andy
P.S. I sometimes like to use an alternate history representation which factors in logical time as well as absolute (wall-clock) time:
In this notation, absolute time reads from left to right, and logical time reads from top to bottom. This makes it more clear that txn1 commits after txn2 in absolute time, but before it in logical time.
Jira issue: CRDB-6193
The text was updated successfully, but these errors were encountered: