Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tx/rm_stm: fix handling of replay commit requests #21340

Merged
merged 1 commit into from
Jul 11, 2024

Conversation

bharathv
Copy link
Contributor

@bharathv bharathv commented Jul 10, 2024

A commit replay can happen in the following cases for example:

  1. Coordinator marked the tx as prepared, dispatched remote commit_tx and
    crashed. Recovers notices the transaction as prepared and attempts to
    roll forward the transaction (recommit).

  2. Coordinator lost leadership as it was dispatching remote commit_tx,
    got frozen, a new leader takes over, commits and bumps the sequence
    number to start a new transaction. Meanwhile the original coordinator
    recovers from frozen state and starts retrying for a brief period
    (rare case).

Prior to this commit, in case of replayed commits, if the partition lost
the transaction state, it was rejecting the request as invalid. The
transaction state can be lost due to producer eviction or with the
sequence number advancing. This can result in a stuck transaction on the
coordinator (particularly in example 1) as it cannot be rolled forward
or back even with a higher producer epoch effectively resulting in a
unusable producer.

With this commit, if there is no local inflight transaction for the
given producer the partition simply assumes the transaction as committed
to make forward progress. This is a valid assumption for the following
reasons

  1. The local transaction state is only evicted if the transaction is
    committed or aborted, so either of these must have happened.
  2. The fact that there is a commit_tx() from a coordinator means we are
    guaranteed that the tx is "prepared" on the coordinator, so the only
    way it could've been sealed already was with an earlier commit_tx

Fixes: #19954
Fixes: #18184

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.1.x
  • v23.3.x
  • v23.2.x

Release Notes

  • none

@bharathv
Copy link
Contributor Author

/ci-repeat 1
skip-units
dt-repeat=50 tests/rptest/transactions/transactions_test.py::TxUpgradeTest

A commit replay can happen in the following cases for example:

1. Coordinator marked the tx as prepared, dispatched remote commit_tx and
   crashed. Recovers notices the transaction as prepared and attempts to
   roll forward the transaction (recommit).

2. Coordinator lost leadership as it was dispatching remote commit_tx,
   got frozen, a new leader takes over, commits and bumps the sequence
   number to start a new transaction. Meanwhile the original coordinator
   recovers from frozen state and starts retrying for a brief period
   (rare case).

Prior to this commit, in case of replayed commits, if the partition lost
the transaction state, it was rejecting the request as invalid. The
transaction state can be lost due to producer eviction or with the
sequence number advancing. This can result in a stuck transaction on the
coordinator (particularly in example 1) as it cannot be rolled forward
or back even with a higher producer epoch effectively resulting in a
unusable producer.

With this commit, if there is no local inflight transaction for the
given producer the partition simply assumes the transaction as committed
to make forward progress. This is a valid assumption for the following
reasons

1. The local transaction state is only evicted if the transaction is
   committed or aborted, so either of these must have happened.
2. The fact that there is a commit_tx() from a coordinator means we are
   guaranteed that the tx is "prepared" on the coordinator, so the only
   way it could've been sealed already was with an earlier commit_tx
@bharathv bharathv changed the title tx/rm_stm: fix dealing with replay commit requests tx/rm_stm: fix handling of replay commit requests Jul 11, 2024
@bharathv bharathv marked this pull request as ready for review July 11, 2024 04:01
@bharathv bharathv merged commit c18c786 into redpanda-data:dev Jul 11, 2024
21 checks passed
@bharathv bharathv deleted the fix-19954-2 branch July 11, 2024 14:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants