Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add thrashing of old txn sessions to prevent OOM #12477

Merged
merged 11 commits into from
Aug 30, 2023

Conversation

rystsov
Copy link
Contributor

@rystsov rystsov commented Jul 27, 2023

PR adds tx thrashing to abort & evict old txn sessions once the number of known txn session passes a threshold (new max_transactions_per_coordinator property).

Fixes #12424

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.2.x
  • v23.1.x
  • v22.3.x

Release Notes

  • none

@rystsov
Copy link
Contributor Author

rystsov commented Jul 28, 2023

/ci-repeat 5
dt-repeat=1

@rystsov
Copy link
Contributor Author

rystsov commented Jul 29, 2023

/ci-repeat 5
dt-repeat=1

@rystsov
Copy link
Contributor Author

rystsov commented Jul 30, 2023

/ci-repeat 5
dt-repeat=1

@rystsov rystsov requested review from bharathv and mmaslankaprv July 31, 2023 15:48
@rystsov rystsov changed the title WIP: 12424 fix Add thrashing of old txn sessions to prevent OOM Jul 31, 2023
@rystsov rystsov marked this pull request as ready for review July 31, 2023 15:51
@rystsov rystsov force-pushed the issue-12424 branch 2 times, most recently from 3a91912 to 54c87cd Compare August 2, 2023 02:10
@rystsov
Copy link
Contributor Author

rystsov commented Aug 2, 2023

/ci-repeat 5
dt-repeat=1

@rystsov
Copy link
Contributor Author

rystsov commented Aug 2, 2023

/ci-repeat 5
dt-repeat=1

@rystsov rystsov requested a review from mmaslankaprv August 3, 2023 15:44
src/v/cluster/tm_stm_cache.h Outdated Show resolved Hide resolved
src/v/cluster/tm_stm_cache.h Show resolved Hide resolved
src/v/utils/rwlock.h Show resolved Hide resolved
src/v/cluster/tm_stm.h Outdated Show resolved Hide resolved
src/v/cluster/tx_gateway_frontend.cc Show resolved Hide resolved
@mmaslankaprv
Copy link
Member

I am wondering if in the case when tx_id limit is breached we should simply return an error until old transactions are expired completed ? Also why other Kafka protocol implementations don't have a limit for max concurrent transactions ?

@rystsov
Copy link
Contributor Author

rystsov commented Aug 11, 2023

I am wondering if in the case when tx_id limit is breached we should simply return an error until old transactions are expired completed?

We can't rely on expiration because it's controlled by the users and Flink recommends to set it to pretty large values

Also why other Kafka protocol implementations don't have a limit for max concurrent transactions?

They don't have the limit but they have the same problem: see KIP-936. Once it's approved and implemented we'll support to be compatible but we can't wait and to fix existing problem we'll use the limit so far.

src/v/utils/rwlock.h Show resolved Hide resolved
src/v/cluster/tm_stm_cache.cc Show resolved Hide resolved
src/v/cluster/tm_stm.cc Show resolved Hide resolved
src/v/cluster/tm_stm.h Outdated Show resolved Hide resolved
src/v/cluster/tm_stm_cache.h Outdated Show resolved Hide resolved
src/v/cluster/tm_stm.cc Outdated Show resolved Hide resolved
src/v/cluster/tm_stm.h Outdated Show resolved Hide resolved
src/v/cluster/tm_stm_cache.cc Outdated Show resolved Hide resolved
src/v/cluster/tx_gateway_frontend.cc Show resolved Hide resolved
src/v/utils/rwlock.h Show resolved Hide resolved
@rystsov
Copy link
Contributor Author

rystsov commented Aug 16, 2023

Known issue: #12659

@mmaslankaprv
Copy link
Member

I am wondering if in the case when tx_id limit is breached we should simply return an error until old transactions are expired completed?

We can't rely on expiration because it's controlled by the users and Flink recommends to set it to pretty large values

Also why other Kafka protocol implementations don't have a limit for max concurrent transactions?

They don't have the limit but they have the same problem: see KIP-936. Once it's approved and implemented we'll support to be compatible but we can't wait and to fix existing problem we'll use the limit so far.

The KIP-936 is mentioning producer_id not transactional_id so it is still not clear to me how this will help ?

@rystsov
Copy link
Contributor Author

rystsov commented Aug 17, 2023

I am wondering if in the case when tx_id limit is breached we should simply return an error until old transactions are expired completed?

We can't rely on expiration because it's controlled by the users and Flink recommends to set it to pretty large values

Also why other Kafka protocol implementations don't have a limit for max concurrent transactions?

They don't have the limit but they have the same problem: see KIP-936. Once it's approved and implemented we'll support to be compatible but we can't wait and to fix existing problem we'll use the limit so far.

The KIP-936 is mentioning producer_id not transactional_id so it is still not clear to me how this will help?

Reread it, yeah it mentioned only producer id caused OOMs. Maybe they'll introduce quotas to txn coordinator too

src/v/cluster/tm_stm.h Outdated Show resolved Hide resolved
src/v/cluster/tx_gateway_frontend.cc Show resolved Hide resolved
rystsov added 11 commits August 29, 2023 12:07
pure refactoring, nothing new added
`limit_init_tm_tx` will be developed into controlling the number
of active tx sessions. Once the number of txs passes the threshold
the method will terminate old session to make room for new txes to
prevent OOM.
LRU tx list is added to track & look up oldest tx session
`limit_init_tm_tx` is updated into controlling the number of active
tx sessions. Once the number of tx sessions on a single controller
passes the threshold (`max_transactions_per_coordinator`) the method
terminates old tx session to make room for new txes to prevent OOM.
@rystsov
Copy link
Contributor Author

rystsov commented Aug 30, 2023

Only known issues fail:

@piyushredpanda piyushredpanda merged commit c159efd into redpanda-data:dev Aug 30, 2023
@vbotbuildovich
Copy link
Collaborator

/backport v23.2.x

@vbotbuildovich
Copy link
Collaborator

/backport v23.1.x

@vbotbuildovich
Copy link
Collaborator

Failed to run cherry-pick command. I executed the commands below:

git checkout -b backport-pr-12477-v23.1.x-248 remotes/upstream/v23.1.x
git cherry-pick -x cdd6481925721ab10f26aedb52d0318038762eca 6e4fce3b81a23f58cf0105fa0afd0ceacfc9a0ea 679baa6179bbac06312636848aca33fd0c87519f 48d85d29c57280d5167889fae3bcaac5e1867336 03d4a0e2798d87bc565b341b944930fcb3fb82e1 b4ed101182bafa92609c6be86e8515c59455fce7 37d745f3a85d85edc324e7739439af62388edbdb dc9b56ec32b1a648b3ea262987de4c5d54f37572 4bf7d351654cbe4bce41a7562bba3b466e44f5ce 1da806427e71f3888dcf03257a3552d1a94978cc b831fd8ad5bf00b43e7a39228ba0f3a02a0c10e2

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Limit the number of active transactions to prevent OOM
6 participants