- Feature Name: SQL savepoints
- Status: draft
- Start Date: 2019-10-14
- Authors: andrei knz, with technical input from nathan tbg lucy radu
- RFC PR: #41569
- Supporting tech note: #42116 or here after this PR merges.
- Cockroach Issue: #10735 #28842
Remember, you can submit a PR with your RFC before the text is complete. Refer to the README for details.
Table of contents:
- Summary
- Motivation
- Guide-level explanation
- Reference-level explanation
- Drawbacks
- Rationale and Alternatives
- Unresolved questions
This RFC proposes to implement SQL savepoints as supported by PostgreSQL.
Savepoints enable a client to partially roll back a transaction.
This is a feature that is often requested by users, and used prominently in the test suites of 3rd party tools, in particular ORMs. We want to enable these test suites so as to increase our compatibility coverage.
The addition of SQL savepoints is enabled by recent changes to the KV layer, whereby write intents now preserve the history of sequence numbers that wrote to a key throughout a transaction.
The proposal here also incidentally addresses CockroachDB's own Halloween problem (issue #28842) without additional work: the introduction of sequencing steps inside SQL statements ensures that SQL mutations cannot read their own writes any more.
SQL savepoints are prominently used by 3rd party tools and frameworks. They are key to implement nested transactions, which is a common programming idiom from Java and other object-oriented languages.
A savepoint is a special mark inside a transaction that allows all commands that are executed after it was established to be rolled back, restoring the transaction state to what it was at the time of the savepoint.
See also: https://www.postgresql.org/docs/current/sql-savepoint.html
- Syntax and introduction
- Nesting
- Savepoint name scoping
- Multiple-level commit/rollback
- Behavior in case of errors
- Relationship with client-side retries
-
to establish a savepoint inside a transaction:
SAVEPOINT savepoint_name
The usual PostgreSQL identifier rules apply:
SAVEPOINT foo
andSAVEPOINT Foo
define the same savepoint, whereasSAVEPOINT "Foo"
defines another. -
to roll back a transaction partially to a previously established savepoint:
ROLLBACK TO SAVEPOINT savepoint_name
-
to forget a savepoint, and keep the effects of statements executed after the savepoint was established:
RELEASE SAVEPOINT savepoint_name
For example:
BEGIN;
INSERT INTO table1 VALUES (1);
SAVEPOINT my_savepoint;
INSERT INTO table1 VALUES (2);
ROLLBACK TO SAVEPOINT my_savepoint;
INSERT INTO table1 VALUES (3);
COMMIT;
The above transaction will insert the values 1 and 3, but not 2.
Schema changes and other DDL are supported "under" savepoints and can be partially rolled back without rolling back the entire transaction.
For example:
BEGIN;
CREATE TABLE u(x INT);
SAVEPOINT foo;
CREATE TABLE t(x INT);
INSERT INTO t(x) VALUES (1);
ROLLBACK TO SAVEPOINT foo;
INSERT INTO u(x) VALUES(1);
SAVEPOINT bar;
CREATE TABLE t(x TEXT);
RELEASE SAVEPOINT foo;
INSERT INTO t(x) VALUES ('a');
COMMIT;
This inserts 1 into u and 'a' into t. The table t
with an INT column
does not exist after the transaction commits.
Note that the ordering of DDL vs DML statements remain restricted as per previously, none of the known limitations are lifted in this work.
Savepoints can be nested.
For example:
BEGIN;
INSERT INTO table1 VALUES (1);
SAVEPOINT my_savepoint;
INSERT INTO table1 VALUES (2);
SAVEPOINT my_savepoint2;
INSERT INTO table1 VALUES (3);
ROLLBACK TO SAVEPOINT my_savepoint2;
INSERT INTO table1 VALUES (4);
RELEASE my_savepoint;
COMMIT;
This inserts values 1, 2 and 4 but not 3.
Changes partially committed by a savepoint release can be rolled back by an outer savepoint.
For example:
BEGIN;
INSERT INTO table1 VALUES (1);
SAVEPOINT my_savepoint;
INSERT INTO table1 VALUES (2);
SAVEPOINT my_savepoint2;
INSERT INTO table1 VALUES (3);
RELEASE SAVEPOINT my_savepoint2;
ROLLBACK TO SAVEPOINT my_savepoint;
COMMIT;
This inserts only value 1. The value 3 is rolled back alongside 2.
As an extension to the SQL standard, PostgreSQL allows a
SAVEPOINT
statement to shadow an earlier savepoint with the same
name. The name refers to the new savepoint until released/rolled back,
after which the name reverts to referring to the previous savepoint.
For example:
BEGIN;
INSERT INTO table1 VALUES (1);
SAVEPOINT my_savepoint;
INSERT INTO table1 VALUES (2);
SAVEPOINT my_savepoint;
INSERT INTO table1 VALUES (3);
ROLLBACK TO SAVEPOINT my_savepoint;
INSERT INTO table1 VALUES (4);
RELEASE SAVEPOINT my_savepoint;
COMMIT;
This inserts values 1, 2 and 4, but not 3.
RELEASE SAVEPOINT
and ROLLBACK TO SAVEPOINT
can refer to
a savepoint "higher" in the nesting hierarchy. When this occurs, all
the savepoints "under" the nesting are automatically released/rolled
back too.
For example:
BEGIN;
SAVEPOINT foo;
INSERT INTO table1 VALUES (1);
SAVEPOINT bar;
INSERT INTO table1 VALUES (2);
RELEASE SAVEPOINT foo;
COMMIT;
This inserts both 1 and 2.
BEGIN;
SAVEPOINT foo;
INSERT INTO table1 VALUES (1);
SAVEPOINT bar;
INSERT INTO table1 VALUES (2);
ROLLBACK TO SAVEPOINT foo;
COMMIT;
This inserts nothing: both inserts are rolled back.
BEGIN;
SAVEPOINT foo;
SAVEPOINT bar;
ROLLBACK TO SAVEPOINT foo;
RELEASE SAVEPOINT bar; -- error: savepoint "bar" does not exist
COMMIT;
This demonstrates that the name "bar" is not visible after it was rolled back over.
If a SQL error occurs "under" a savepoint, it is possible to recover an open, "healthy" txn by rolling back the savepoint without rolling back the txn. (An exception to this is discussed below.)
For example:
kena=> create table u(x int unique);
CREATE TABLE
kena=> insert into u(x) values(1);
INSERT 0 1
kena=> begin; \
savepoint foo; \
insert into u(x) values (1); \
rollback to savepoint foo; \
insert into u(x) values (2); \
commit;
BEGIN
SAVEPOINT
ERROR: duplicate key value violates unique constraint "u_x_key"
DETAIL: Key (x)=(1) already exists.
ROLLBACK
INSERT 0 1
COMMIT
kena=> select * from u;
x
---
1
2
(2 rows)
(i.e. the second insert succeeds even though the first insert encountered an error)
In the first implementation, an exception to this will be the handling of retry errors. These will not be cancellable with a savepoint rollback. Instead, a client-side retry loop must be implemented using the regular mechanism (see next section).
The mechanism previously used by CockroachDB to drive client-side retries remains valid, initially with the following restrictions:
- only savepoints with a name starting with the special prefix
cockroach_restart
(includingcockroach_restart
itself but alsocockroach_restart123134
etc) are to be considered a marker for client-side retries. - restart savepoints can only be used at the outmost level of nesting,
i.e.
begin; savepoint cockroach_restart; savepoint foo
is OK, butbegin; savepoint foo; savepoint cockroach_restart
is not.
In other words, a retry error requires clients to unwind the entire savepoint hierarchy.
If the transaction encounters a retry error, and subsequently attempts
to roll back / release a savepoint other than a top-level
cockroach_restart
, another retry error is raised (to force the
client into continuing the error processing).
There is a glimmer of hope for an implementation that is able to handle retries at the level of a single savepoint (the innermost one) but this would be more complex and is left out of scope for this implementation.
(more restrictions TBD as informed by implementation)
- Design overview
- Bill of work (preliminary)
- SQL executor changes
- Savepoints and schema changes
- Savepoint rollbacks, MVCC and storage
- TxnCoordSender changes
The overall design can be understood as follows:
-
at the SQL/KV interface, KV operations are associated with sequence numbers (seqnums):
- write operations generate new seqnums, which are stored inside write intents
- the txn object also contains a list of sequence numbers (or ranges thereof) that are rolled back and are to be ignored during reads.
- read operations operate "at" a particular seqnum- a MVCC read that encounters an intent ignores the values written at later seqnums and returns the most recent value at that seqnum instead, also substracting any past value at seqnums marked to be ignored due to rollbacks.
- intent resolution also throws away any values inside the rolled back seqnum ranges.
-
outside of SQL, reads operate at the seqnum of the latest write as is done today; this preserves read-your-own-writes even inside individual KV batches
-
inside SQL, the executor configures the sql/kv interface so that all reads performed on behalf of an "execution step" (usually, a single statement) are performed at the seqnum of the last write in the previous step.
- most statements consist of just one step, whereby the seqnum boundary is in-between statements. However, with CTE mutations, FK checks and cascading updates, there may be multiple steps inside a single complex statement.
- for CPuts specifically, the condition part ignores the configured read seqnum and checks the most recent (non rolled back) write instead. This is needed in mutations.
-
a savepoint is defined by:
- when adding a savepoint, by saving the current write seqnum,
- to roll back a savepoint, by marking the seqnums in-between as rolled back inside the txn object, to be ignored during reads (see above).
- savepoint release is a no-op in kv, and simply updates a name-seqnum mapping inside SQL.
-
conceptually, savepoints define a stack-like structure. Any savepoint has at most one savepoint open "under" it. When releasing or rolling back to a savepoint higher in the stack, all the savepoints in-between are also released or rolled back. In the SQL executor, this is managed using a stack-like data structure.
-
to introduce savepoint semantics in schema updates / DDL, the descriptor caches are extended to store the seqnum alongside the cached descs. Cache entries are not considered/evicted if the seqnum has been rolled back.
Initial analysis, pre-impl, suggests the following Work to be performed:
- SQL execution changes:
- define the new executor
Step()
method and call it where relevant. - integrate with mutation and FK semantics
- connect the current savepoint/rollback logic to the new txncoordsender interface
- define the new executor
- SQL metadata:
- evict entries from desc caches upon savepoint rollback
- Storage changes to support rollbacks:
- #41612
- extend the txn proto with a list of ignored seqnum ranges
- extend the MVCC read logic to skip over ignored seqnum ranges
- extend the intent resolution logic to skip over ignored seqnum ranges
- TxnCoordSender changes:
- extend the TxnCoordSender interface with methods to mark a SQL step, create a savepoint and rewind (= mark seqnum ranges as ignored)
- rework the error handling logic in txncoordsender (and perhaps other layers) to avoid marking the txn object as in-error for non-permanent errors
- When rolling back to a savepoint, we'll also want to stop
considering writes with those sequence numbers as in-flight from
the perspective of a parallel commit. This logic will live in the
txnPipeliner
.
To support nesting and shadowing, the SQL executor maintains a naming environment: a list of mappings from names to savepoints.
In Go, using []struct{name,sp}
or equivalent.
(We do not use a simple map
because savepoints form a stack.)
The most recent, innermost savepoint information is at the end of the list.
When defining a savepoint, a new entry is appended. When releasing or rolling back a savepoint, the latest mapping for that sp name is removed:
- Trying to release or rollback a savepoint using a name that does not exist in that list results in an error 3B001 "savepoint does not exist".
- When the target of a RELEASE is an entry that's earlier than the latest entry, all the entries from that entry to the last are also popped out and released.
- Similarly, if the target of a ROLLBACK is an earlier entry, all the entries in-between are also removed from the environment and (conceptually) rolled back.
- (TBD:) uses of savepoint syntax with
cockroach_restart
special names first verify that the stack of savepoints is empty. If it is not, a usage/syntax/unimplemented error is produced.
The executor also provides a method Step()
on the execution context
(likely planner
or whatever is most relevant at the time of
implementation):
- this is called every time a SQL execution step is reached:
- at the end of execution of each regular statement
- after each top-level mutation CTE in a multi-mutation statement
- in-between the FK and cascading action phases, if any
- the implementation of
Step()
forwards the call to the RootTxn's TxnCoordSenderStep()
method (new), which in turn saves the current write seqnum as reference seqnum for future reads.
At a high level, mutations read at the seqnum of the previous statement and write at new seqnums. In particular, they don't read their own writes and this solves issue #28842.
This raises a couple "interesting" topics:
- Uniqueness violations
- ON CONFLICT processing
- FK existence checks under a single mutation
- FK cascading actions under a single mutation
- Multiple mutations with CTEs
There are really two cases:
-
we insert/modify a single row, and doing so creating a duplicate of a row that was modified in a previous statement (or sequencing step). This case is simple and transparently handled by "read at seqnum of previous step".
-
we insert/modify the same row two times inside the same mutation statement, or two rows such that they are duplicate according to some unique index. Here the problem is seemingly that the 2nd row update will not see the first.
However, when looking more closely there is no new problem here.
All writes to a unique index go through a KV CPut
on the uniqueness key.
By ensuring that CPuts
read their condition at the current write
seqnum, we can always pick up the latest write and detect duplicates.
(CPut will still skip over ignored / rolled back seqnums like other KV ops. It's only the target read seqnum that's ratcheted up to the present for CPut, in contrast to other mvcc ops that will be blocked by the configured target read seqnum.)
This opens a question of whether we need a variant of CPut which does not do this. TBD. (Initial analysis says no.)
Question arises of what to do when the result of ON CONFLICT processing changes a row in a read-modify-write fashion. For example:
INSERT INTO t(x) VALUES (1), (1) ON CONFLICT(x) DO UPDATE SET x = t.x + excluded.x
-- ^^^^^^^^ notice the dup row
Here conceptually the INSERT suggests that the 2nd ON CONFLICT resolution will observe the row as it was left over by the 1st. This would not work with "read at seqnum of previous statement".
The answer here is from a previous discussion around mutations that observed the following:
-
postgres does not support updating the same row two times in an ON CONFLICT clause.
-
it is desirable to batch / pre-compute the ON CONFLICT result values concurrently with the mutation for performance, and we've already established back in 2018 that the lack of support for updating the same row twice in pg makes this optimization possible.
-
the implementation was updated when bringing this logic under the CBO
From here, it follows that we don't care about "read at seqnum" inconsistencies as the current CBO logic already assumes that it's fine to miss earlier conflict resolutions.
FK existence checks must observe the data values post-mutation. For this we introduce a step boundary between the end of the "run" phase (where results were produced for the client) and the FK existence checks.
This way the reads for FK existence checks can see all the writes by the mutation.
(To be checked)
Postgres uses post-statement triggers to process FK cascading actions and existence checks. Cascading actions that result in mutations to other tables themselves append more triggers to run.
Each subsequent step in this cascade of effects is able to read its own writes (for futher FK checks).
We emulate this in CockroachDB by introducing a step boundary between iterations of the cascading algorithm.
It's possible for a single statement to define multiple mutations for example:
WITH
a AS (INSERT ... RETURNING ...),
b AS (INSERT ... RETURNING ...)
SELECT ...
PostgreSQL does not guarantee that the effect of one mutation is visible to another, or even to the later read-only parts of the statement. In fact it requires that all mutations operate on the same data at the beginning of the statement:
More specifically: https://www.postgresql.org/docs/12/queries-with.html
The sub-statements in WITH are executed concurrently with each other and with the main query. Therefore, when using data-modifying statements in WITH, the order in which the specified updates actually happen is unpredictable. All the statements are executed with the same snapshot (see Chapter 13), so they cannot “see” one another's effects on the target tables. This alleviates the effects of the unpredictability of the actual order of row updates, and means that RETURNING data is the only way to communicate changes between different WITH sub-statements and the main query.
So with the logic proposed so far, all the mutations inside the same statement execute from the same read seqnum.
If there is FK work to be done, the first sequencing step necessary for FK checks (to advance the read seqnum) will only occur after all mutations have completed.
(The observations from Uniqueness violations above apply here as well.)
Background:
- descriptors used in a txn are loaded in a desc cache (TableCollection + db desc cache) that's (currently) invalidated at the end of the txn
- DDL statements work as follows:
- they modify the descriptor by adding "pending mutation" records and persist the descs embedding these records using the current KV txn context
- additional DDL may add more mutations and perform further KV updates
- at the end of a SQL txn, after the KV txn commits, a "schema changer" job/process is kicked off to resolve the mutation entries on descriptors. The mutation records are read using AS OF SYSTEM TIME with the KV commit timestamp and the final desc post-mutations is written back to kv (possibly at a later timestamp).
To implement SQL savepoints:
-
the descriptor caches must be invalidated when a savepoint is rolled back. To achieve this:
- we extend the caches to store the savepoint object (write seqnum) under which a descriptor was loaded (= the most recent savepoint that's still open).
- when releasing a savepoint, we re-annotate all the cached descriptors cached "under" the savepoint(s) being released, to become cached "under" the savepoint that's still open. (Alternatively, we could flush the desc caches entirely. This is simpler but perf impact unknown. TBD.)
- when rolling back a savepoint, we evict entries from the cache that correspond to a seqnum being rolled back.
-
we let the further processing of DDL statements proceed using KV operations as usual. We expect that the rest of the KV/SQL semantics will make schema changes "just work":
- pending mutation records are cancelled/rolled back as any other KV write
- when the KV txn commits, only the KV writes corresponding to non-rolledback savepoints have been persisted, so any DDL changes "under" savepoints become invisible.
- the schema changer task that starts after that will thus only see the remaining (committed/released) mutation records and the DDL semantics should "just work".
(This will be validated by experimentation in a prototype).
Background:
- for a given KV pair, any write by a txn is done via a write intent
- if there is another write inside the txn, we append the new write
to the intent already laid. After two writes, we'd have separately:
- two seqnums
- two values
- two mvcc timestamps
We need to keep both values around even without savepoints because we may reorder KV reads and writes. For example, in the sequence write1-read1-write2, if the read1 operation happens to be processed after write2, we still want it to see only write1 that is logically in its past.
The way this works today is that each KV operation also has a metadata field which tells it "at which seqnum" it should read. The MVCC read logic, when it encounters an intent, scans the entries in the intent from last to first, and ignores all writes performed at a later seqnum.
For savepoints, this needs to be extended as follows:
-
the txn metadata, which is available in the context of every MVCC read, is extended by a new field "list of ignored seqnum ranges".
-
when a MVCC reads finds an intent and scans to find the first value that's no later than the current read seqnum, it should also skip over any value written at a seqnum also in the ignore list.
Savepoint rollbacks are implemented by populating the ignore list with the range of seqnums generated from the point the savepoint was last established, to the point of the rollback.
This storage-specific part of the work is described in this issue: #41612
- Background: Txn interface between SQL and KV
- Overview of TxnCoordSender changes
- SQL / KV API definition
This RFC assumes understanding from the TxnCoordSender tech note at #42116 or here after this PR merges.
On the "all-is-well" path (no errors) we want to associate SQL savepoints with seqnums, and clean up the read-your-own-writes semantics of SQL mutations.
The code changes will introduce the following principles for SQL transactions (non-SQL txns are unaffected):
- reads will always be performed at the seqnum
of the latest sequencing point (and not just the latest KV write
performed). Sequencing points are introduced by a new TCS
Step()
method called by SQL on the RootTxn where relevant:
- in-between statements
- in between the mutation and FK check / cascading actions
This method takes the latest write seqnum generated before it was called and copies it to become the read seqnum for every read performed thereafter.
-
there cannot be in-flight LeafTxns active when the
Step()
method is called, so that it's always clear which last invocation ofStep()
any given LeafTxn refers to. -
seqnums will not reset to 0 when a txn epoch is increemented so that seqnums can continue to increase monotonically across txn epochs and there is no seqnum reuse across epochs.
-
a SQL savepoint token is a copy of the seqnum captured by the last call to
Step()
, together with the minimal additional state sufficient to partially roll back the txn (this is discussed further below). -
a SQL savepoint release checks and reports any currently deferred error (see tech note for definition, e.g. txn push or WriteTooOld).
-
a SQL savepoint rollback is implemented as a
SavepointRollback()
method in the TCS, which takes as argument the SQL savepoint token where to rollback, computes the range of seqnums from the target savepoint to the current last generated seqnum, and then populates the current txn object to mark this entire range as rolled back. (This information will live in the transaction record and be available throughTxnMeta
to every KV op in MVCC, which need it.) -
there cannot be any in-flight LeafTxn active when a savepoint is rolled back, so that the list of ignored seqnum ranges can never change "under" a LeafTxn concurrently.
The SQL executor is responsible for organizing SQL execution so as to
prevent LeafTxns existing concurrently between the special Step()
and SavepointRollback()
operations.
See the section of the tech note for background information.
Error kind | Prior situation | New (proposed) situation |
---|---|---|
recoverable errors with in-place recovery | Auto-retry/adjust internal to TCS, txn object remains live, no client error | (unchanged) |
recoverable errors with txn restart | txn object re-init with epoch bump, retry error | unchanged, but see (2) below |
deferred retry errors (eg WriteTooOld) | error stored, temporarily hidden from client, re-reported during commit | unchanged, but see (3) below |
transient processing errors | TCS trashed + txn aborted | TCS + txn remain active, no txn state change, see (1) below |
transaction aborts | TCS trashed + txn aborted | (unchanged) |
unhandled errors | TCS trashed + txn aborted | (unchanged) |
-
avoid trashing the TCS when innocuous KV errors happen
The first and main change is to relax the notion of "unrecoverable error" (see tech note for a definition).
Today, transient processing errors like reading from a historical ts that's been GCed, an invalid CPut condition, etc all cause the TCS to move to the "txnError" state after which no operation is ever possible any more. The RFC proposes to change this so that only internal assertion errors cause a TCS to become fully invalidated.
Instead, KV errors like CPut condition errors will simply generate an error object (as they already do) and nothing else, and this error will flow back to SQL where it can be dealt with as usual.
The new behavior is that it will be possible to continue issuing KV requests via the TCS after such an error occurs.
-
The main change from the Seqnums and savepoints section above decouples the seqnum increments from epoch increments. This means that the "internal" error handling performed by TCS on retry errors will not invalidate seqnum ranges (and rolled back seqnum ranges after SQL savepoint rollbacks).
The new behavior is that it becomes possible to recover from a retry error (other than WriteTooOld, see point 3 below) using a SQL savepoint rollback.
-
any deferred error (currently, just
WriteTooOld
) is deferred merely to the first next savepoint RELEASE. If present it is reported then. If a client opens a new savepoint while there is a pending WriteTooOld error, that error state is preserved in the savepoint token and restored when the savepoint is rolled back.(TBD: whether this is correct/desirable. There is discussion about whether WriteTooOld is still handled this way at all.)
From the perspective of SQL clients:
Error type | Prior situation | New (proposed) situation |
---|---|---|
transaction aborts | no recovery possible | (unchanged) |
transient processing errors | no recovery possible | can be recovered using savepoint rollback |
unhandled errors | no recovery possible | (unchanged) |
recoverable errors with in-place recovery | automatic recovery, invisible to client | (unchanged) |
recoverable errors with txn restart | retry error, must start from beginning of txn | can be recovered using savepoint rollback |
deferred retry errors (eg WriteTooOld) | error reported during commit, entire txn can be retried | can be recovered using savepoint rollback |
-
Step()
method on TCS and*client.Txn
:-
new
StepsEnabled
boolean on the TCS and/ortxnSeqNumAllocator
interceptor. Set totrue
on first call toStep()
. Enables "all reads from last step seqnum" instead of the default "all reads from last write seqnum". -
new
Step()
method on TCS. -
new
Step()
method onTxnSender
interface. -
new
Step()
method onclient.Txn
, forwards to theTxnSender
.
-
-
GetSavepoint() (SavepointToken, error)
method on TCS and*client.Txn
, returns aSavepointToken
, to be used exclusively on RootTxns without LeafTxns active.-
new
GetSavepoint()
method on TCS.Initially:
type SavepointToken struct { SeqNum enginepb.TxnSeq Timestamp hlc.Timestamp RefreshedTimestamp hlc.Timestamp OrigTimestampWasObserved bool InFlightWriteIndex int // see explanation below Epoch int32 // (may be removed entirely)
}
2) conceptually asks all interceptors to "build a savepoint" although first implementation will be as simple as getting the current counter value in the `txnSeqNumAllocator`. 3) new `GetSavepoint()` on `TxnSender` interface 4) new `GetSavepoint()` on `client.Txn`, forwards to the `TxnSender`.
-
-
RollbackToSavepoint(SavepointToken) error
method on TCS and*client.Txn
, to be used exclusively on RootTxns without LeafTxns active.-
new
RollbackToSavepoint()
method on TCS. -
performs the seqnum invalidation in the txn record, as described above. Restores the additional state. In particular, the in flight write slice (
InFlightWrites
in theroachpb.Transaction
record) is truncated to the position indicated byInFlightWriteIndex
). -
new
RollbackToSavepoint()
method onTxnSender
interface -
new
RollbackToSavepoint()
method onclient.Txn
, forwards toTxnSender
.
-
This feature introduces more complexity in the SQL executor.
(to be populated)
There are two general design directions:
-
mark certain ranges of sequence numbers as "rolled back" in the transaction record itself, to be ignored during MVCC reads.
Pros: rollbacks are cheaper
Cons: reads are more expensive
Cons: after a rollback the intents are "still there" and create contention with concurrent txns that touch the rolled back intents.
-
proactively iterate through all intents generated for the current txn and remove the rolled back sequence numbers from the generated intents, including removing the intent if the sequence numbers being rolled back are the only ones remaining.
Pros: reads are cheaper
Pros: rolled back intents also roll back the contention
Pros: may simplify the row locking story (although Tobias found out that PostgreSQL is pretty bad on this so we don't need this pro to be at least as good as pg)
Cons: rollbacks are more expensive
Nathan recommends approach 1, sensing that it requires less work.
(none at this time)