Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc/conn_cache: eager cleanup of connection cache on shutdown #8847

Merged
merged 8 commits into from
May 2, 2023

Conversation

bharathv
Copy link
Contributor

At shutdown, eager cleanup of connection cache entries can result in faster termination of RPCs/connections to suspended nodes. This patch includes two main changes.

  • Hooks up an app level abort source with connection cache that triggers the cleanup function on notification
  • Removes the use of naked transport pointers in protocol and elsewhere as it is prone to UAF bugs.

Fixes #7981

Backports Required

  • none - not a bug fix
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v22.3.x
  • v22.2.x
  • v22.1.x

UX Changes

Release Notes

  • none

@bharathv
Copy link
Contributor Author

A couple of alternate approaches I prototyped but didn't like

  • Hooking up a client supplied abort_source via client_opts. This is tricky to implement because the client typically resides on a different shard than the transport. So the abort source notifications potentially involve cross shard communication and that is prone to bugs and not easy to reason about.
  • Hooking up a (shard local) abort source to the transport. This boils down to passing an abort source reference in transport c'tor but that didn't seem natural to me (and the code turned out to be ugly).

Rather cleaning up at the cache layer seemed less complicated and easy to reason about. wdyt.

@bharathv
Copy link
Contributor Author

/ci-repeat 10

@dotnwat
Copy link
Member

dotnwat commented Feb 21, 2023

@bharathv got a merge conflict

@bharathv
Copy link
Contributor Author

@bharathv got a merge conflict

Rebased.

Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of having a sharded top-level abort source that we can plumb down starting at the application layer. Curious if you have thoughts on how much or little we should extend this to other subsystems

LGTM pending CI

Comment on lines +97 to +102
return ss::do_with(
cache.get(node_id),
[connection_timeout = connection_timeout.timeout_at(),
f = std::forward<Func>(f)](auto& transport_ptr) mutable {
return transport_ptr->get_connected(connection_timeout)
.then([f = std::forward<Func>(f)](
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious if you tried coroutinizing this as some unsharded template method, and if you did but opted for this what the snag was? the inferred signature? the convert() calls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't try coro-ing this, just playing safe, didn't want to introduce new bugs.

@@ -61,11 +61,14 @@ ss::future<> connection_cache::remove(model::node_id n) {
/// \brief closes all client connections
ss::future<> connection_cache::stop() {
auto units = co_await _mutex.get_units();
co_await parallel_for_each(_cache, [](auto& it) {
// Exchange ensures the cache is invalidated and concurrent
// accesses wait on the mutex to populate new entries.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we were holding the lock before so wouldn't the concurrent accesses have been waiting (or maybe the cache was checked without acquiring the lock)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe the cache was checked without acquiring the lock

This .. via connection_cache::contains() & get().

auto& [_, cli] = it;
return cli->stop();
});
_cache.clear();
cache.clear();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cache is now going to go out of scope on the next line and cleared in any case

Comment on lines 358 to +361
app_signal.wait().get();
trigger_abort_source();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is an abort_source in app_signal already. can we use (or expand and use) that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did begin with that thought but I ran into an issue. Invoking this new sharded abort_source via invoke_all() returns a future but I can't block on it in app_signal's abort_source call back. To workaround it, I have to gate on it in an async fiber and that seemed unnecessarily complex.

Copy link
Contributor Author

@bharathv bharathv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reviews, I forgot about this one, addressing comments, will shortly rebase to fix conflicts.

@@ -61,11 +61,14 @@ ss::future<> connection_cache::remove(model::node_id n) {
/// \brief closes all client connections
ss::future<> connection_cache::stop() {
auto units = co_await _mutex.get_units();
co_await parallel_for_each(_cache, [](auto& it) {
// Exchange ensures the cache is invalidated and concurrent
// accesses wait on the mutex to populate new entries.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe the cache was checked without acquiring the lock

This .. via connection_cache::contains() & get().

Comment on lines +97 to +102
return ss::do_with(
cache.get(node_id),
[connection_timeout = connection_timeout.timeout_at(),
f = std::forward<Func>(f)](auto& transport_ptr) mutable {
return transport_ptr->get_connected(connection_timeout)
.then([f = std::forward<Func>(f)](
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't try coro-ing this, just playing safe, didn't want to introduce new bugs.

Comment on lines 358 to +361
app_signal.wait().get();
trigger_abort_source();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did begin with that thought but I ran into an issue. Invoking this new sharded abort_source via invoke_all() returns a future but I can't block on it in app_signal's abort_source call back. To workaround it, I have to gate on it in an async fiber and that seemed unnecessarily complex.

@bharathv bharathv force-pushed the cleanup_conn_cache branch from 419da08 to 5b51cae Compare March 14, 2023 05:27
@bharathv
Copy link
Contributor Author

Last force-push is a rebase to fix conflicts.

@bharathv bharathv requested review from dotnwat and andrwng March 14, 2023 15:55
Introduces a decoupled shutdown method that clears the transport map.
Abort source to be used as a trigger for shutdown
the connection cache.
Currently we use a raw transport pointer at the protocol level and
it is unsafe. Replace it with a shared pointer.
There is an occassional UAF due to early shutdown of cache if we
do not keep the transport pointer alive until the send finishes.
@bharathv bharathv force-pushed the cleanup_conn_cache branch from 5b51cae to 6946abb Compare April 18, 2023 06:11
@bharathv
Copy link
Contributor Author

Last force-push is a rebase.

@dotnwat
Copy link
Member

dotnwat commented Apr 21, 2023

/ci-repeat

@andrwng
Copy link
Contributor

andrwng commented May 1, 2023

LGTM, the CI failures look like known flakes

@andrwng
Copy link
Contributor

andrwng commented May 1, 2023

/ci-repeat

@bharathv
Copy link
Contributor Author

bharathv commented May 2, 2023

Test failure: #9315 (unrelated, known issue).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ARM: shutdown hang in NodeOperationFuzzyTest.test_node_operations
4 participants