rpc/conn_cache: eager cleanup of connection cache on shutdown #8847

bharathv · 2023-02-13T19:07:11Z

At shutdown, eager cleanup of connection cache entries can result in faster termination of RPCs/connections to suspended nodes. This patch includes two main changes.

Hooks up an app level abort source with connection cache that triggers the cleanup function on notification
Removes the use of naked transport pointers in protocol and elsewhere as it is prone to UAF bugs.

Fixes #7981

Backports Required

UX Changes

Release Notes

none

bharathv · 2023-02-13T19:14:58Z

A couple of alternate approaches I prototyped but didn't like

Hooking up a client supplied abort_source via client_opts. This is tricky to implement because the client typically resides on a different shard than the transport. So the abort source notifications potentially involve cross shard communication and that is prone to bugs and not easy to reason about.
Hooking up a (shard local) abort source to the transport. This boils down to passing an abort source reference in transport c'tor but that didn't seem natural to me (and the code turned out to be ugly).

Rather cleaning up at the cache layer seemed less complicated and easy to reason about. wdyt.

bharathv · 2023-02-15T17:52:30Z

/ci-repeat 10

dotnwat · 2023-02-21T20:41:09Z

@bharathv got a merge conflict

bharathv · 2023-02-22T19:10:56Z

@bharathv got a merge conflict

Rebased.

andrwng

I like the idea of having a sharded top-level abort source that we can plumb down starting at the application layer. Curious if you have thoughts on how much or little we should extend this to other subsystems

LGTM pending CI

andrwng · 2023-02-23T02:10:49Z

src/v/rpc/connection_cache.h

+              return ss::do_with(
+                cache.get(node_id),
+                [connection_timeout = connection_timeout.timeout_at(),
+                 f = std::forward<Func>(f)](auto& transport_ptr) mutable {
+                    return transport_ptr->get_connected(connection_timeout)
+                      .then([f = std::forward<Func>(f)](


Just curious if you tried coroutinizing this as some unsharded template method, and if you did but opted for this what the snag was? the inferred signature? the convert() calls?

I didn't try coro-ing this, just playing safe, didn't want to introduce new bugs.

dotnwat · 2023-02-23T19:24:53Z

src/v/rpc/connection_cache.cc

@@ -61,11 +61,14 @@ ss::future<> connection_cache::remove(model::node_id n) {
 /// \brief closes all client connections
 ss::future<> connection_cache::stop() {
    auto units = co_await _mutex.get_units();
-    co_await parallel_for_each(_cache, [](auto& it) {
+    // Exchange ensures the cache is invalidated and concurrent
+    // accesses wait on the mutex to populate new entries.


we were holding the lock before so wouldn't the concurrent accesses have been waiting (or maybe the cache was checked without acquiring the lock)?

or maybe the cache was checked without acquiring the lock

This .. via connection_cache::contains() & get().

dotnwat · 2023-02-23T19:26:11Z

src/v/rpc/connection_cache.cc

        auto& [_, cli] = it;
        return cli->stop();
    });
-    _cache.clear();
+    cache.clear();


the cache is now going to go out of scope on the next line and cleared in any case

dotnwat · 2023-02-23T19:33:57Z

src/v/redpanda/application.cc

                app_signal.wait().get();
+                trigger_abort_source();


there is an abort_source in app_signal already. can we use (or expand and use) that?

I did begin with that thought but I ran into an issue. Invoking this new sharded abort_source via invoke_all() returns a future but I can't block on it in app_signal's abort_source call back. To workaround it, I have to gate on it in an async fiber and that seemed unnecessarily complex.

bharathv

Thanks for the reviews, I forgot about this one, addressing comments, will shortly rebase to fix conflicts.

bharathv · 2023-02-23T22:24:19Z

src/v/rpc/connection_cache.cc

@@ -61,11 +61,14 @@ ss::future<> connection_cache::remove(model::node_id n) {
 /// \brief closes all client connections
 ss::future<> connection_cache::stop() {
    auto units = co_await _mutex.get_units();
-    co_await parallel_for_each(_cache, [](auto& it) {
+    // Exchange ensures the cache is invalidated and concurrent
+    // accesses wait on the mutex to populate new entries.


or maybe the cache was checked without acquiring the lock

This .. via connection_cache::contains() & get().

bharathv · 2023-03-14T04:40:07Z

src/v/rpc/connection_cache.h

+              return ss::do_with(
+                cache.get(node_id),
+                [connection_timeout = connection_timeout.timeout_at(),
+                 f = std::forward<Func>(f)](auto& transport_ptr) mutable {
+                    return transport_ptr->get_connected(connection_timeout)
+                      .then([f = std::forward<Func>(f)](


I didn't try coro-ing this, just playing safe, didn't want to introduce new bugs.

bharathv · 2023-03-14T04:47:41Z

src/v/redpanda/application.cc

                app_signal.wait().get();
+                trigger_abort_source();


I did begin with that thought but I ran into an issue. Invoking this new sharded abort_source via invoke_all() returns a future but I can't block on it in app_signal's abort_source call back. To workaround it, I have to gate on it in an async fiber and that seemed unnecessarily complex.

bharathv · 2023-03-14T05:28:19Z

Last force-push is a rebase to fix conflicts.

No logic changes.

Introduces a decoupled shutdown method that clears the transport map.

Abort source to be used as a trigger for shutdown the connection cache.

Currently we use a raw transport pointer at the protocol level and it is unsafe. Replace it with a shared pointer.

There is an occassional UAF due to early shutdown of cache if we do not keep the transport pointer alive until the send finishes.

bharathv · 2023-04-18T06:11:52Z

Last force-push is a rebase.

dotnwat · 2023-04-21T05:20:02Z

/ci-repeat

andrwng · 2023-05-01T23:00:46Z

LGTM, the CI failures look like known flakes

andrwng · 2023-05-01T23:00:54Z

/ci-repeat

bharathv · 2023-05-02T01:43:41Z

Test failure: #9315 (unrelated, known issue).

bharathv requested review from dotnwat, andrwng and mmaslankaprv February 13, 2023 19:07

github-actions bot added the area/redpanda label Feb 13, 2023

bharathv force-pushed the cleanup_conn_cache branch from 413262f to 419da08 Compare February 22, 2023 19:10

andrwng reviewed Feb 23, 2023

View reviewed changes

dotnwat reviewed Feb 23, 2023

View reviewed changes

bharathv commented Mar 14, 2023

View reviewed changes

bharathv force-pushed the cleanup_conn_cache branch from 419da08 to 5b51cae Compare March 14, 2023 05:27

bharathv requested review from dotnwat and andrwng March 14, 2023 15:55

bharathv added 8 commits April 17, 2023 22:56

rpc: coroutinize connection_cache::stop

54ba0e1

No logic changes.

rpc/conn_cache: Tighten cache access during cleanup

75aec56

rpc/conn_cache: Decouple shutdown and stop.

2e1a33c

Introduces a decoupled shutdown method that clears the transport map.

application: add app level abort source service

1f0a22b

r/conn_cache: Hook up cache with abort_source

a678d4d

Abort source to be used as a trigger for shutdown the connection cache.

rpc/transport: avoid using raw transport pointer

a626a68

Currently we use a raw transport pointer at the protocol level and it is unsafe. Replace it with a shared pointer.

r/conn_cache: Keep transport pointer alive until send completes

ca21ead

There is an occassional UAF due to early shutdown of cache if we do not keep the transport pointer alive until the send finishes.

r/conn_cache: add test for rpc failure on abort.

6946abb

bharathv force-pushed the cleanup_conn_cache branch from 5b51cae to 6946abb Compare April 18, 2023 06:11

andrwng approved these changes May 1, 2023

View reviewed changes

piyushredpanda merged commit b77f273 into redpanda-data:dev May 2, 2023

ztlpn mentioned this pull request Jun 16, 2023

CI Failure (BadLogLines missing_node_rpc_client) in RandomNodeOperationsTest.test_node_operations #10112

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rpc/conn_cache: eager cleanup of connection cache on shutdown #8847

rpc/conn_cache: eager cleanup of connection cache on shutdown #8847

bharathv commented Feb 13, 2023

bharathv commented Feb 13, 2023

bharathv commented Feb 15, 2023

dotnwat commented Feb 21, 2023

bharathv commented Feb 22, 2023

andrwng left a comment

andrwng Feb 23, 2023

bharathv Mar 14, 2023

dotnwat Feb 23, 2023

bharathv Feb 23, 2023

dotnwat Feb 23, 2023

dotnwat Feb 23, 2023

bharathv Mar 14, 2023

bharathv left a comment

bharathv Feb 23, 2023

bharathv Mar 14, 2023

bharathv Mar 14, 2023

bharathv commented Mar 14, 2023

bharathv commented Apr 18, 2023

dotnwat commented Apr 21, 2023

andrwng commented May 1, 2023

andrwng commented May 1, 2023

bharathv commented May 2, 2023

rpc/conn_cache: eager cleanup of connection cache on shutdown #8847

rpc/conn_cache: eager cleanup of connection cache on shutdown #8847

Conversation

bharathv commented Feb 13, 2023

Backports Required

UX Changes

Release Notes

bharathv commented Feb 13, 2023

bharathv commented Feb 15, 2023

dotnwat commented Feb 21, 2023

bharathv commented Feb 22, 2023

andrwng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bharathv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bharathv commented Mar 14, 2023

bharathv commented Apr 18, 2023

dotnwat commented Apr 21, 2023

andrwng commented May 1, 2023

andrwng commented May 1, 2023

bharathv commented May 2, 2023