Allow followers to catchup with leader to return freshest describeACL responses #12716

graphcareful · 2023-08-10T17:24:20Z

This PR attempts to solve the issue of stale describeACLs responses returned when a preceding modification of an ACL was issued shortly before.

This situation currently occurs because all requests that wish to edit ACL state get proxied to the controller, while all requests to query this state may be handled by a follower. Followers will only have the most up to date copy of the current state when their respective controller_stm's catchup with the leaders. Since the system is eventually consistent w.r.t. replicating the controller commands to followers there is no guarantee for a follower to return the freshest result set.

The proposed fix in this PR is to create a new endpoint to allow a follower to query the last applied offset of the controller log from the leader node. Once obtained, the follower then can wait on this offset to be applied within its local controller_stm. When this wait completes, its guaranteed that the follower is up to date with the leader at least at the offset the leader was, at the time the query was initially made.

Backports Required

Release Notes

Improvements

Modifications to avoid stale responses returned from DescribeACLs requests

src/v/cluster/service.cc

rockwotj · 2023-08-10T21:06:07Z

src/v/cluster/types.h

+    model::offset last_applied;
+    errc result;
+
+    auto serde_fields() { return std::tie(last_applied); }


I think you need errc here too.

Are there roundtrip tests or serde backwards compat tests we can add for this?

src/v/kafka/server/handlers/describe_acls.cc

rockwotj · 2023-08-10T21:07:12Z

src/v/cluster/service.h

@@ -158,6 +162,7 @@ class service : public controller_service {
    ss::future<partition_state_reply>
      do_get_partition_state(partition_state_request);

+    std::unique_ptr<controller>& _controller;


Can this be a raw pointer to the controller instead of a reference to a unique_ptr.

This class should not be able to do things like reset the pointer.

rockwotj · 2023-08-10T21:09:14Z

src/v/cluster/security_frontend.cc

@@ -339,4 +339,56 @@ security_frontend::get_bootstrap_user_creds_from_env() {
      std::in_place, std::move(username), std::move(credentials));
 }

+ss::future<result<model::offset>> security_frontend::get_leader_last_applied(


nit: get_remote_leader_last_applied?

bharathv · 2023-08-10T22:58:50Z

src/v/cluster/controller.json

@@ -135,6 +135,11 @@
            "name": "get_partition_state",
            "input_type": "partition_state_request",
            "output_type": "partition_state_reply"
+        },
+        {
+            "name": "get_controller_last_applied",


wondering if this should be last applied or committed? Theoretically last_applied can be < committed_offset if the stm is lagging, so the response may contain an offset < committed offset of acl

Isn't that the idea though? We want to return the last applied and wait on that. If the stm is lagging, then thats the same response that would have been returned if we had invoked the command on the leader anyway.

not sure I follow.. the specific case I had in my mind is change of leadership, if this request hits a leader different than the one that actually committed+applied the ACL request, there is no guarantee that the last_applied >= committed_offset of acl batch. Consider the following sequence of actions.

create/delete-acl - broker 0 (current controller_leader) - raft committed_offset = applied_offset = 10 (command waits until applied_offset = committed_offset so that is guaranteed)

leadership change - broker 1 (new controller leader) - raft_committed_offset is at least 10 (raft leadership guarantee) but there is no guarantee on applied_offset, theoretically it could be lagging

now a follower makes a request to broker 1, it returns a stale applied_offset

I see your point, so if at any given time an elected leader applied the command, the effect is considered as performed but in the situation you described we could still be reporting stale state +1

bharathv · 2023-08-10T23:00:16Z

src/v/cluster/service.cc

+                  controller_last_applied_reply{
+                    .result = errc::not_leader_controller});
+            }
+            return _controller->get_last_applied_offset().then(


do we need a linearizable barrier here before we return committed offset? Theoretically a stale follower (making this request) could hit a stale leader (broker that thinks it is a leader but is not).

Yeah that is a good point, then it could be the case that after all of this waiting is done, a stale response is still returned

bharathv · 2023-08-10T23:08:14Z

src/v/cluster/security_frontend.cc

+    if (leader == _self) {
+        co_return make_error_code(errc::success);
+    }


In this case don't need to wait until stm catches up? eg: acl committed but the stm is not yet aware of it. I think we ultimately care if the stm applied it (so that we can query the state) and not whether it is raft committed.

This PR has followers wait until the leaders last applied, whatever the value of that is now. If this logic is applied to the leader, the last applied has already been applied, so no need to wait. So the semantics kind of match now between how the command works when handled by a follower or leader.

Same as the other comment..

If this logic is applied to the leader, the last applied has already been applied,

This could be a different leader than the one that actually applied the command?

bharathv · 2023-08-23T16:57:34Z

src/v/cluster/service.cc

+                    vassert(
+                      r.error().category() == raft::error_category(),
+                      "Unexpected error_category encountered");


nit: do we need this assert? it doesn't seem critical enough to crash the broker if it fires.

bharathv · 2023-08-23T16:58:43Z

src/v/cluster/types.h

+      serde::version<0>,
+      serde::compat_version<0>> {
+    using rpc_adl_exempt = std::true_type;
+    model::offset last_applied;


last_committed?

bharathv · 2023-08-23T16:59:55Z

src/v/cluster/service.cc

+                    default:
+                        return controller_committed_offset_reply{
+                          .result = errc::replication_error};


what does replication_error mean? a default of no_leader seems logical to me.

no_leader would have already hit the case above

I mean not returning replication_error at all, it seems a little out of place in this context of get_last_committed_offset? Also the barrier doesn't replicate anything.

src/v/cluster/controller.h

bharathv · 2023-08-23T17:01:00Z

src/v/cluster/security_frontend.cc

@@ -339,4 +339,59 @@ security_frontend::get_bootstrap_user_creds_from_env() {
      std::in_place, std::move(username), std::move(credentials));


nit: commit message s/last_applied/last_committed.

bharathv · 2023-08-23T17:09:08Z

tests/rptest/tests/rpk_acl_test.py

+                                           self.superuser.username,
+                                           self.superuser.password,
+                                           self.superuser.algorithm)
+            described = superclient.acl_list()


q: does it always go to a leader? in which case it always returns the freshest response, wondering if we can loop through all the brokers.

bharathv 3 weeks ago
q: does it always go to a leader? in which case it always returns the freshest response, wondering if we can loop through all the brokers.

@graphcareful ?

I spoke about this to Bharath offline, probably should have updated the comments here, in any case the answer is a random broker will be chosen

tests/rptest/tests/rpk_acl_test.py

bharathv · 2023-08-23T17:14:36Z

tests/rptest/tests/rpk_acl_test.py

+        assert 'CREATE' in described, "Failed to modify ACL"
+
+        # Network partition the leader away from the rest of the cluster
+        fi = make_failure_injector(self.redpanda)


nit: better to use with construct? (to heal_all() incase something throws)

this is done so i can pass in the node i want to isolate

not sure what you mean, I was thinking something like..

with make_failure_injector(self.redpanda) as fi: fi.isolate(controller) <... do something..>

bharathv · 2023-08-23T17:16:12Z

tests/rptest/tests/rpk_acl_test.py

+        time.sleep(3)
+
+        # Of the other remaining nodes, none can be declared a leader before


something missing here? Don't we need to make a request to the isolated node?

There is a missing acl_list() call here nice catch

emaxerrno · 2023-08-23T17:22:45Z

@graphcareful so we can deterministically repro the issue and fix w/ this PR. the cover letter seems unclear.

graphcareful · 2023-08-23T18:04:54Z

@graphcareful so we can deterministically repro the issue and fix w/ this PR. the cover letter seems unclear.

Sorry what is the ask? Also, the tests fail reliably when run without the changes made here. Let me know how I can make the cover letter more clear too

bharathv · 2023-08-23T18:26:44Z

src/v/cluster/service.cc

+                    default:
+                        return controller_committed_offset_reply{
+                          .result = errc::replication_error};


I mean not returning replication_error at all, it seems a little out of place in this context of get_last_committed_offset? Also the barrier doesn't replicate anything.

bharathv · 2023-08-23T18:29:08Z

tests/rptest/tests/rpk_acl_test.py

+        # the election timeout occurs; also the "current" leader is technically
+        # stale so it cannot be sure its returning the freshest data either. In
+        # all cases the log below should be observed on the node handling the req.
+        self.redpanda.search_log_any(


need to assert the return value?

bharathv · 2023-08-23T18:32:25Z

tests/rptest/tests/rpk_acl_test.py

+        assert 'CREATE' in described, "Failed to modify ACL"
+
+        # Network partition the leader away from the rest of the cluster
+        fi = make_failure_injector(self.redpanda)


not sure what you mean, I was thinking something like..

with make_failure_injector(self.redpanda) as fi: fi.isolate(controller) <... do something..>

graphcareful · 2023-08-24T17:24:50Z

During development a bug with linerizable_barrier was discovered, fix is in the works, making note this PR depends on the fix here #12990

bharathv

couple of nits.

bharathv · 2023-08-24T21:58:33Z

src/v/cluster/security_frontend.cc

+    auto leader = _leaders.local().get_leader(model::controller_ntp);
+    if (!leader) {
+        co_return make_error_code(errc::no_leader_controller);
+    }
+
+    result<model::offset> leader_committed = model::offset{};
+    const auto now = model::timeout_clock::now();
+    if (leader == _self) {
+        leader_committed = co_await ss::smp::submit_to(
+          controller_stm_shard, [this, timeout = now + timeout]() {
+              return ss::with_timeout(
+                       timeout, _controller->linearizable_barrier())
+                .handle_exception_type([](const ss::timed_out_error&) {
+                    return result<model::offset>(errc::timeout);
+                });
+          });
+    } else {
+        /// Waiting up until the leader committed offset means its possible that
+        /// waiting on an offset higher then neccessary is performed but the
+        /// alternative of waiting on the last_applied isn't a complete solution
+        /// as its possible that this offset is behind the actual last_applied
+        /// of a previously elected leader, resulting in a stale response being
+        /// returned.
+        leader_committed = co_await get_leader_committed(*leader, timeout);
+    }
+
+    if (leader_committed.has_error()) {
+        co_return leader_committed.error();
+    }


nit: I think this should belong in get_leader_committed.

bharathv · 2023-08-24T22:02:14Z

tests/rptest/clients/rpk.py

@@ -1037,7 +1037,7 @@ def _kafka_conn_settings(self):
            ]


commit message needs an update (no election timeout bump)

- When handling this request the responding broker must be the current leader and will return the last_committed offset within its respective controller log. - This offset represents the highest command that has been completely processed by the controller log.

- This routine will, contact the leader for the last_committed offset within the controller log, then wait for its local stm to have caught up until at least that point.

- Before handling describe ACLs requests, ensure that if this node is a follower, its processing of controller commands has caught up with the leaders at the point in time the request was recieved.

- This raises the election timeout to 10s while network partitioning a leader then making a describeACLs request. - In this scenario any node that is queried should be reporting that stale results may be returned. Previous followers will be reporting this because they cannot reach the leader, and the stale leader will be reporting this because it cannot inject a barrier since it had been network partitioned.

dotnwat

@graphcareful @bharathv

what is the expected behavior during a rolling upgrade when a describe ACLs request hits the controller that is on an older version and doesn't support this new RPC?

this might be more common than we expect, too. if a rolling upgrade is having trouble, jumping on console or inspecting the system might have ACLs being inspected.

dotnwat · 2023-09-13T01:16:41Z

src/v/cluster/security_frontend.cc

+      });
+}
+
+ss::future<std::error_code> security_frontend::wait_until_caughtup_with_leader(


Why is this a security frontend interface? This seems like it should be generic.

I didn't want to offer this as a generic solution where there are currently no other consumers, figured someone else could break it out if they needed. I'm fine with either FWIW

figured someone else could break it out if they needed

how will the find it?

dotnwat · 2023-09-13T01:19:03Z

src/v/cluster/security_frontend.cc

+    /// Waiting up until the leader committed offset means its possible that
+    /// waiting on an offset higher then neccessary is performed but the
+    /// alternative of waiting on the last_applied isn't a complete solution
+    /// as its possible that this offset is behind the actual last_applied
+    /// of a previously elected leader, resulting in a stale response being
+    /// returned.


I don't understand this comment.

We have an offset in the controller log and we want to wait until the controller log has been replayed locally up to at least that position. It seems like that is what we are doing, so I'm not sure what the nuance is that the comment seems to be describing. Could you elaborate?

Yes that is what is going on. The comment is just explaining how we are just waiting for an offset that is probably higher then necessary; specifically why waiting on leader_committed is the chosen offset to wait on instead of leader_applied.

src/v/cluster/security_frontend.cc

dotnwat · 2023-09-13T01:21:07Z

tests/rptest/tests/rpk_acl_test.py

+                                           self.superuser.username,
+                                           self.superuser.password,
+                                           self.superuser.algorithm)
+            described = superclient.acl_list()


bharathv 3 weeks ago
q: does it always go to a leader? in which case it always returns the freshest response, wondering if we can loop through all the brokers.

@graphcareful ?

graphcareful · 2023-09-13T01:33:32Z

what is the expected behavior during a rolling upgrade when a describe ACLs request hits the controller that is on an older version and doesn't support this new RPC?

The RPC will fail and in all cases when this occurs the behavior is to resort handling the request the way it was handled before these changes.

The changes in this PR make a best effort to grab the freshest ACL state , but if that can't happen within a predefined timeout the state that exists on the current node is returned , and a log is printed that described the scenario.

bharathv · 2023-09-13T03:55:28Z

what is the expected behavior during a rolling upgrade when a describe ACLs request hits the controller that is on an older version and doesn't support this new RPC?

The RPC will fail and in all cases when this occurs the behavior is to resort handling the request the way it was handled before these changes.

The changes in this PR make a best effort to grab the freshest ACL state , but if that can't happen within a predefined timeout the state that exists on the current node is returned , and a log is printed that described the scenario.

Yep, it is on a best effort basis. The same outcome is possible even in a fully upgraded cluster where the caller has trouble reaching the controller (RPC timeout for example). We just log the error, move on and return whatever we have in the local cache.

dotnwat · 2023-09-13T04:57:21Z

@bharathv @graphcareful thanks for looking at my review.

I was thinking that when the RPC server received a request for an unknown method we wanted errors to be logged, and so to avoid unnecessary errors in the log we needed to use feature gates to ensure invalid RPCs were never dispatched.

But it does seem we log those unknown method messages at debug level.

Sorry about the noise there!

…ader Backport of #12716 avoid returning stale describeACL responses

graphcareful requested review from dotnwat, bharathv and ztlpn August 10, 2023 17:24

github-actions bot added the area/redpanda label Aug 10, 2023

rockwotj reviewed Aug 10, 2023

View reviewed changes

bharathv reviewed Aug 10, 2023

View reviewed changes

bharathv requested a review from mmaslankaprv August 10, 2023 23:18

graphcareful force-pushed the follower-catchup branch from 490552c to 09f8f19 Compare August 15, 2023 17:30

graphcareful requested a review from bharathv August 21, 2023 19:53

bharathv reviewed Aug 23, 2023

View reviewed changes

graphcareful force-pushed the follower-catchup branch from b3ef4b7 to a415a4c Compare August 23, 2023 22:41

graphcareful mentioned this pull request Aug 24, 2023

r/consensus: fixed reusing follower sequence id #12990

Merged

7 tasks

graphcareful force-pushed the follower-catchup branch from a415a4c to 4e1ea51 Compare August 24, 2023 17:24

graphcareful requested a review from bharathv August 24, 2023 17:24

bharathv previously approved these changes Aug 24, 2023

View reviewed changes

graphcareful dismissed bharathv’s stale review via f7ef9b7 August 25, 2023 16:14

graphcareful force-pushed the follower-catchup branch from 4e1ea51 to f7ef9b7 Compare August 25, 2023 16:14

graphcareful requested a review from bharathv August 25, 2023 16:14

bharathv previously approved these changes Aug 25, 2023

View reviewed changes

Rob Blafford added 5 commits September 9, 2023 23:57

cluster: Routine for follower to catchup w/ leader

b9f7ada

- This routine will, contact the leader for the last_committed offset within the controller log, then wait for its local stm to have caught up until at least that point.

kafka/s/h: Prevent stale describe_acls responses

03cd09f

- Before handling describe ACLs requests, ensure that if this node is a follower, its processing of controller commands has caught up with the leaders at the point in time the request was recieved.

rptest: Test describeACLs returns freshest results

be02254

graphcareful dismissed bharathv’s stale review via 2814c78 September 10, 2023 04:15

graphcareful force-pushed the follower-catchup branch from f7ef9b7 to 2814c78 Compare September 10, 2023 04:15

bharathv approved these changes Sep 11, 2023

View reviewed changes

graphcareful merged commit 70cb23b into redpanda-data:dev Sep 11, 2023
9 checks passed

dotnwat reviewed Sep 13, 2023

View reviewed changes

piyushredpanda mentioned this pull request Dec 9, 2023

Send DescribeACL requests to the controller #9266

Closed

graphcareful mentioned this pull request Dec 13, 2023

Backport of #12716 avoid returning stale describeACL responses #15437

Merged

7 tasks

piyushredpanda added a commit that referenced this pull request Dec 13, 2023

Merge pull request #15437 from graphcareful/backport-describe-acls-le…

3a369eb

…ader Backport of #12716 avoid returning stale describeACL responses

		@@ -339,4 +339,59 @@ security_frontend::get_bootstrap_user_creds_from_env() {
		std::in_place, std::move(username), std::move(credentials));

		time.sleep(3)

		# Of the other remaining nodes, none can be declared a leader before

Allow followers to catchup with leader to return freshest describeACL responses #12716

Allow followers to catchup with leader to return freshest describeACL responses #12716

Conversation

graphcareful commented Aug 10, 2023 • edited Loading

Backports Required

Release Notes

Improvements

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bharathv Aug 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emaxerrno commented Aug 23, 2023

graphcareful commented Aug 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

graphcareful commented Aug 24, 2023

bharathv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dotnwat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

graphcareful commented Sep 13, 2023

bharathv commented Sep 13, 2023

dotnwat commented Sep 13, 2023

graphcareful commented Aug 10, 2023 •

edited

Loading

bharathv Aug 10, 2023 •

edited

Loading