Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow followers to catchup with leader to return freshest describeACL responses #12716

Merged
merged 5 commits into from
Sep 11, 2023

Conversation

graphcareful
Copy link
Contributor

@graphcareful graphcareful commented Aug 10, 2023

This PR attempts to solve the issue of stale describeACLs responses returned when a preceding modification of an ACL was issued shortly before.

This situation currently occurs because all requests that wish to edit ACL state get proxied to the controller, while all requests to query this state may be handled by a follower. Followers will only have the most up to date copy of the current state when their respective controller_stm's catchup with the leaders. Since the system is eventually consistent w.r.t. replicating the controller commands to followers there is no guarantee for a follower to return the freshest result set.

The proposed fix in this PR is to create a new endpoint to allow a follower to query the last applied offset of the controller log from the leader node. Once obtained, the follower then can wait on this offset to be applied within its local controller_stm. When this wait completes, its guaranteed that the follower is up to date with the leader at least at the offset the leader was, at the time the query was initially made.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.2.x
  • v23.1.x
  • v22.3.x

Release Notes

Improvements

  • Modifications to avoid stale responses returned from DescribeACLs requests

src/v/cluster/service.cc Outdated Show resolved Hide resolved
model::offset last_applied;
errc result;

auto serde_fields() { return std::tie(last_applied); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need errc here too.

Are there roundtrip tests or serde backwards compat tests we can add for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch

src/v/kafka/server/handlers/describe_acls.cc Show resolved Hide resolved
@@ -158,6 +162,7 @@ class service : public controller_service {
ss::future<partition_state_reply>
do_get_partition_state(partition_state_request);

std::unique_ptr<controller>& _controller;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be a raw pointer to the controller instead of a reference to a unique_ptr.

This class should not be able to do things like reset the pointer.

@@ -339,4 +339,56 @@ security_frontend::get_bootstrap_user_creds_from_env() {
std::in_place, std::move(username), std::move(credentials));
}

ss::future<result<model::offset>> security_frontend::get_leader_last_applied(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: get_remote_leader_last_applied?

@@ -135,6 +135,11 @@
"name": "get_partition_state",
"input_type": "partition_state_request",
"output_type": "partition_state_reply"
},
{
"name": "get_controller_last_applied",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering if this should be last applied or committed? Theoretically last_applied can be < committed_offset if the stm is lagging, so the response may contain an offset < committed offset of acl

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that the idea though? We want to return the last applied and wait on that. If the stm is lagging, then thats the same response that would have been returned if we had invoked the command on the leader anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I follow.. the specific case I had in my mind is change of leadership, if this request hits a leader different than the one that actually committed+applied the ACL request, there is no guarantee that the last_applied >= committed_offset of acl batch. Consider the following sequence of actions.

  • create/delete-acl - broker 0 (current controller_leader) - raft committed_offset = applied_offset = 10 (command waits until applied_offset = committed_offset so that is guaranteed)
  • leadership change - broker 1 (new controller leader) - raft_committed_offset is at least 10 (raft leadership guarantee) but there is no guarantee on applied_offset, theoretically it could be lagging
  • now a follower makes a request to broker 1, it returns a stale applied_offset

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point, so if at any given time an elected leader applied the command, the effect is considered as performed but in the situation you described we could still be reporting stale state +1

controller_last_applied_reply{
.result = errc::not_leader_controller});
}
return _controller->get_last_applied_offset().then(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need a linearizable barrier here before we return committed offset? Theoretically a stale follower (making this request) could hit a stale leader (broker that thinks it is a leader but is not).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that is a good point, then it could be the case that after all of this waiting is done, a stale response is still returned

Comment on lines 372 to 395
if (leader == _self) {
co_return make_error_code(errc::success);
}
Copy link
Contributor

@bharathv bharathv Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case don't need to wait until stm catches up? eg: acl committed but the stm is not yet aware of it. I think we ultimately care if the stm applied it (so that we can query the state) and not whether it is raft committed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR has followers wait until the leaders last applied, whatever the value of that is now. If this logic is applied to the leader, the last applied has already been applied, so no need to wait. So the semantics kind of match now between how the command works when handled by a follower or leader.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the other comment..

If this logic is applied to the leader, the last applied has already been applied,

This could be a different leader than the one that actually applied the command?

Comment on lines 762 to 764
vassert(
r.error().category() == raft::error_category(),
"Unexpected error_category encountered");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we need this assert? it doesn't seem critical enough to crash the broker if it fires.

serde::version<0>,
serde::compat_version<0>> {
using rpc_adl_exempt = std::true_type;
model::offset last_applied;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last_committed?

Comment on lines 772 to 774
default:
return controller_committed_offset_reply{
.result = errc::replication_error};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does replication_error mean? a default of no_leader seems logical to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no_leader would have already hit the case above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean not returning replication_error at all, it seems a little out of place in this context of get_last_committed_offset? Also the barrier doesn't replicate anything.

src/v/cluster/controller.h Show resolved Hide resolved
@@ -339,4 +339,59 @@ security_frontend::get_bootstrap_user_creds_from_env() {
std::in_place, std::move(username), std::move(credentials));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: commit message s/last_applied/last_committed.

self.superuser.username,
self.superuser.password,
self.superuser.algorithm)
described = superclient.acl_list()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q: does it always go to a leader? in which case it always returns the freshest response, wondering if we can loop through all the brokers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bharathv 3 weeks ago
q: does it always go to a leader? in which case it always returns the freshest response, wondering if we can loop through all the brokers.

@graphcareful ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spoke about this to Bharath offline, probably should have updated the comments here, in any case the answer is a random broker will be chosen

tests/rptest/tests/rpk_acl_test.py Outdated Show resolved Hide resolved
assert 'CREATE' in described, "Failed to modify ACL"

# Network partition the leader away from the rest of the cluster
fi = make_failure_injector(self.redpanda)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: better to use with construct? (to heal_all() incase something throws)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is done so i can pass in the node i want to isolate

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what you mean, I was thinking something like..

with make_failure_injector(self.redpanda) as fi:
   fi.isolate(controller)
   <... do something..>

Comment on lines 72 to 74
time.sleep(3)

# Of the other remaining nodes, none can be declared a leader before
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something missing here? Don't we need to make a request to the isolated node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a missing acl_list() call here nice catch

@emaxerrno
Copy link
Contributor

@graphcareful so we can deterministically repro the issue and fix w/ this PR. the cover letter seems unclear.

@graphcareful
Copy link
Contributor Author

@graphcareful so we can deterministically repro the issue and fix w/ this PR. the cover letter seems unclear.

Sorry what is the ask? Also, the tests fail reliably when run without the changes made here. Let me know how I can make the cover letter more clear too

Comment on lines 772 to 774
default:
return controller_committed_offset_reply{
.result = errc::replication_error};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean not returning replication_error at all, it seems a little out of place in this context of get_last_committed_offset? Also the barrier doesn't replicate anything.

# the election timeout occurs; also the "current" leader is technically
# stale so it cannot be sure its returning the freshest data either. In
# all cases the log below should be observed on the node handling the req.
self.redpanda.search_log_any(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to assert the return value?

assert 'CREATE' in described, "Failed to modify ACL"

# Network partition the leader away from the rest of the cluster
fi = make_failure_injector(self.redpanda)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what you mean, I was thinking something like..

with make_failure_injector(self.redpanda) as fi:
   fi.isolate(controller)
   <... do something..>

@graphcareful
Copy link
Contributor Author

During development a bug with linerizable_barrier was discovered, fix is in the works, making note this PR depends on the fix here #12990

@graphcareful graphcareful requested a review from bharathv August 24, 2023 17:24
bharathv
bharathv previously approved these changes Aug 24, 2023
Copy link
Contributor

@bharathv bharathv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of nits.

Comment on lines 371 to 394
auto leader = _leaders.local().get_leader(model::controller_ntp);
if (!leader) {
co_return make_error_code(errc::no_leader_controller);
}

result<model::offset> leader_committed = model::offset{};
const auto now = model::timeout_clock::now();
if (leader == _self) {
leader_committed = co_await ss::smp::submit_to(
controller_stm_shard, [this, timeout = now + timeout]() {
return ss::with_timeout(
timeout, _controller->linearizable_barrier())
.handle_exception_type([](const ss::timed_out_error&) {
return result<model::offset>(errc::timeout);
});
});
} else {
/// Waiting up until the leader committed offset means its possible that
/// waiting on an offset higher then neccessary is performed but the
/// alternative of waiting on the last_applied isn't a complete solution
/// as its possible that this offset is behind the actual last_applied
/// of a previously elected leader, resulting in a stale response being
/// returned.
leader_committed = co_await get_leader_committed(*leader, timeout);
}

if (leader_committed.has_error()) {
co_return leader_committed.error();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think this should belong in get_leader_committed.

@@ -1037,7 +1037,7 @@ def _kafka_conn_settings(self):
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit message needs an update (no election timeout bump)

bharathv
bharathv previously approved these changes Aug 25, 2023
Rob Blafford added 5 commits September 9, 2023 23:57
- When handling this request the responding broker must be the current
leader and will return the last_committed offset within its respective
controller log.

- This offset represents the highest command that has been completely
processed by the controller log.
- This routine will, contact the leader for the last_committed offset
within the controller log, then wait for its local stm to have caught up
until at least that point.
- Before handling describe ACLs requests, ensure that if this node is a
follower, its processing of controller commands has caught up with the
leaders at the point in time the request was recieved.
- This raises the election timeout to 10s while network partitioning a
leader then making a describeACLs request.

- In this scenario any node that is queried should be reporting that
stale results may be returned. Previous followers will be reporting this
because they cannot reach the leader, and the stale leader will be
reporting this because it cannot inject a barrier since it had been
network partitioned.
@graphcareful graphcareful merged commit 70cb23b into redpanda-data:dev Sep 11, 2023
9 checks passed
Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@graphcareful @bharathv

what is the expected behavior during a rolling upgrade when a describe ACLs request hits the controller that is on an older version and doesn't support this new RPC?

this might be more common than we expect, too. if a rolling upgrade is having trouble, jumping on console or inspecting the system might have ACLs being inspected.

});
}

ss::future<std::error_code> security_frontend::wait_until_caughtup_with_leader(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a security frontend interface? This seems like it should be generic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to offer this as a generic solution where there are currently no other consumers, figured someone else could break it out if they needed. I'm fine with either FWIW

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

figured someone else could break it out if they needed

how will the find it?

Comment on lines +403 to +408
/// Waiting up until the leader committed offset means its possible that
/// waiting on an offset higher then neccessary is performed but the
/// alternative of waiting on the last_applied isn't a complete solution
/// as its possible that this offset is behind the actual last_applied
/// of a previously elected leader, resulting in a stale response being
/// returned.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment.

We have an offset in the controller log and we want to wait until the controller log has been replayed locally up to at least that position. It seems like that is what we are doing, so I'm not sure what the nuance is that the comment seems to be describing. Could you elaborate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is what is going on. The comment is just explaining how we are just waiting for an offset that is probably higher then necessary; specifically why waiting on leader_committed is the chosen offset to wait on instead of leader_applied.

src/v/cluster/security_frontend.cc Show resolved Hide resolved
self.superuser.username,
self.superuser.password,
self.superuser.algorithm)
described = superclient.acl_list()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bharathv 3 weeks ago
q: does it always go to a leader? in which case it always returns the freshest response, wondering if we can loop through all the brokers.

@graphcareful ?

@graphcareful
Copy link
Contributor Author

what is the expected behavior during a rolling upgrade when a describe ACLs request hits the controller that is on an older version and doesn't support this new RPC?

The RPC will fail and in all cases when this occurs the behavior is to resort handling the request the way it was handled before these changes.

The changes in this PR make a best effort to grab the freshest ACL state , but if that can't happen within a predefined timeout the state that exists on the current node is returned , and a log is printed that described the scenario.

@bharathv
Copy link
Contributor

what is the expected behavior during a rolling upgrade when a describe ACLs request hits the controller that is on an older version and doesn't support this new RPC?

The RPC will fail and in all cases when this occurs the behavior is to resort handling the request the way it was handled before these changes.

The changes in this PR make a best effort to grab the freshest ACL state , but if that can't happen within a predefined timeout the state that exists on the current node is returned , and a log is printed that described the scenario.

Yep, it is on a best effort basis. The same outcome is possible even in a fully upgraded cluster where the caller has trouble reaching the controller (RPC timeout for example). We just log the error, move on and return whatever we have in the local cache.

@dotnwat
Copy link
Member

dotnwat commented Sep 13, 2023

@bharathv @graphcareful thanks for looking at my review.

I was thinking that when the RPC server received a request for an unknown method we wanted errors to be logged, and so to avoid unnecessary errors in the log we needed to use feature gates to ensure invalid RPCs were never dispatched.

But it does seem we log those unknown method messages at debug level.

Sorry about the noise there!

piyushredpanda added a commit that referenced this pull request Dec 13, 2023
…ader

Backport of #12716 avoid returning stale describeACL responses
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants