-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow followers to catchup with leader to return freshest describeACL responses #12716
Allow followers to catchup with leader to return freshest describeACL responses #12716
Conversation
src/v/cluster/types.h
Outdated
model::offset last_applied; | ||
errc result; | ||
|
||
auto serde_fields() { return std::tie(last_applied); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need errc
here too.
Are there roundtrip tests or serde backwards compat tests we can add for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch
src/v/cluster/service.h
Outdated
@@ -158,6 +162,7 @@ class service : public controller_service { | |||
ss::future<partition_state_reply> | |||
do_get_partition_state(partition_state_request); | |||
|
|||
std::unique_ptr<controller>& _controller; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be a raw pointer to the controller instead of a reference to a unique_ptr.
This class should not be able to do things like reset the pointer.
src/v/cluster/security_frontend.cc
Outdated
@@ -339,4 +339,56 @@ security_frontend::get_bootstrap_user_creds_from_env() { | |||
std::in_place, std::move(username), std::move(credentials)); | |||
} | |||
|
|||
ss::future<result<model::offset>> security_frontend::get_leader_last_applied( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: get_remote_leader_last_applied
?
src/v/cluster/controller.json
Outdated
@@ -135,6 +135,11 @@ | |||
"name": "get_partition_state", | |||
"input_type": "partition_state_request", | |||
"output_type": "partition_state_reply" | |||
}, | |||
{ | |||
"name": "get_controller_last_applied", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering if this should be last applied or committed? Theoretically last_applied can be < committed_offset if the stm is lagging, so the response may contain an offset < committed offset of acl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that the idea though? We want to return the last applied and wait on that. If the stm is lagging, then thats the same response that would have been returned if we had invoked the command on the leader anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure I follow.. the specific case I had in my mind is change of leadership, if this request hits a leader different than the one that actually committed+applied the ACL request, there is no guarantee that the last_applied >= committed_offset of acl batch. Consider the following sequence of actions.
- create/delete-acl - broker 0 (current controller_leader) - raft committed_offset = applied_offset = 10 (command waits until applied_offset = committed_offset so that is guaranteed)
- leadership change - broker 1 (new controller leader) - raft_committed_offset is at least 10 (raft leadership guarantee) but there is no guarantee on applied_offset, theoretically it could be lagging
- now a follower makes a request to broker 1, it returns a stale applied_offset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point, so if at any given time an elected leader applied the command, the effect is considered as performed but in the situation you described we could still be reporting stale state +1
src/v/cluster/service.cc
Outdated
controller_last_applied_reply{ | ||
.result = errc::not_leader_controller}); | ||
} | ||
return _controller->get_last_applied_offset().then( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need a linearizable barrier here before we return committed offset? Theoretically a stale follower (making this request) could hit a stale leader (broker that thinks it is a leader but is not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that is a good point, then it could be the case that after all of this waiting is done, a stale response is still returned
src/v/cluster/security_frontend.cc
Outdated
if (leader == _self) { | ||
co_return make_error_code(errc::success); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case don't need to wait until stm catches up? eg: acl committed but the stm is not yet aware of it. I think we ultimately care if the stm applied it (so that we can query the state) and not whether it is raft committed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR has followers wait until the leaders last applied, whatever the value of that is now
. If this logic is applied to the leader, the last applied has already been applied, so no need to wait. So the semantics kind of match now between how the command works when handled by a follower or leader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as the other comment..
If this logic is applied to the leader, the last applied has already been applied,
This could be a different leader than the one that actually applied the command?
490552c
to
09f8f19
Compare
src/v/cluster/service.cc
Outdated
vassert( | ||
r.error().category() == raft::error_category(), | ||
"Unexpected error_category encountered"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: do we need this assert? it doesn't seem critical enough to crash the broker if it fires.
src/v/cluster/types.h
Outdated
serde::version<0>, | ||
serde::compat_version<0>> { | ||
using rpc_adl_exempt = std::true_type; | ||
model::offset last_applied; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
last_committed?
src/v/cluster/service.cc
Outdated
default: | ||
return controller_committed_offset_reply{ | ||
.result = errc::replication_error}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does replication_error mean? a default of no_leader seems logical to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no_leader would have already hit the case above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean not returning replication_error at all, it seems a little out of place in this context of get_last_committed_offset? Also the barrier doesn't replicate anything.
@@ -339,4 +339,59 @@ security_frontend::get_bootstrap_user_creds_from_env() { | |||
std::in_place, std::move(username), std::move(credentials)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: commit message s/last_applied/last_committed.
self.superuser.username, | ||
self.superuser.password, | ||
self.superuser.algorithm) | ||
described = superclient.acl_list() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
q: does it always go to a leader? in which case it always returns the freshest response, wondering if we can loop through all the brokers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bharathv 3 weeks ago
q: does it always go to a leader? in which case it always returns the freshest response, wondering if we can loop through all the brokers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spoke about this to Bharath offline, probably should have updated the comments here, in any case the answer is a random broker will be chosen
tests/rptest/tests/rpk_acl_test.py
Outdated
assert 'CREATE' in described, "Failed to modify ACL" | ||
|
||
# Network partition the leader away from the rest of the cluster | ||
fi = make_failure_injector(self.redpanda) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: better to use with
construct? (to heal_all() incase something throws)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is done so i can pass in the node i want to isolate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what you mean, I was thinking something like..
with make_failure_injector(self.redpanda) as fi:
fi.isolate(controller)
<... do something..>
tests/rptest/tests/rpk_acl_test.py
Outdated
time.sleep(3) | ||
|
||
# Of the other remaining nodes, none can be declared a leader before |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something missing here? Don't we need to make a request to the isolated node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a missing acl_list() call here nice catch
@graphcareful so we can deterministically repro the issue and fix w/ this PR. the cover letter seems unclear. |
Sorry what is the ask? Also, the tests fail reliably when run without the changes made here. Let me know how I can make the cover letter more clear too |
src/v/cluster/service.cc
Outdated
default: | ||
return controller_committed_offset_reply{ | ||
.result = errc::replication_error}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean not returning replication_error at all, it seems a little out of place in this context of get_last_committed_offset? Also the barrier doesn't replicate anything.
tests/rptest/tests/rpk_acl_test.py
Outdated
# the election timeout occurs; also the "current" leader is technically | ||
# stale so it cannot be sure its returning the freshest data either. In | ||
# all cases the log below should be observed on the node handling the req. | ||
self.redpanda.search_log_any( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to assert the return value?
tests/rptest/tests/rpk_acl_test.py
Outdated
assert 'CREATE' in described, "Failed to modify ACL" | ||
|
||
# Network partition the leader away from the rest of the cluster | ||
fi = make_failure_injector(self.redpanda) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what you mean, I was thinking something like..
with make_failure_injector(self.redpanda) as fi:
fi.isolate(controller)
<... do something..>
b3ef4b7
to
a415a4c
Compare
a415a4c
to
4e1ea51
Compare
During development a bug with linerizable_barrier was discovered, fix is in the works, making note this PR depends on the fix here #12990 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple of nits.
src/v/cluster/security_frontend.cc
Outdated
auto leader = _leaders.local().get_leader(model::controller_ntp); | ||
if (!leader) { | ||
co_return make_error_code(errc::no_leader_controller); | ||
} | ||
|
||
result<model::offset> leader_committed = model::offset{}; | ||
const auto now = model::timeout_clock::now(); | ||
if (leader == _self) { | ||
leader_committed = co_await ss::smp::submit_to( | ||
controller_stm_shard, [this, timeout = now + timeout]() { | ||
return ss::with_timeout( | ||
timeout, _controller->linearizable_barrier()) | ||
.handle_exception_type([](const ss::timed_out_error&) { | ||
return result<model::offset>(errc::timeout); | ||
}); | ||
}); | ||
} else { | ||
/// Waiting up until the leader committed offset means its possible that | ||
/// waiting on an offset higher then neccessary is performed but the | ||
/// alternative of waiting on the last_applied isn't a complete solution | ||
/// as its possible that this offset is behind the actual last_applied | ||
/// of a previously elected leader, resulting in a stale response being | ||
/// returned. | ||
leader_committed = co_await get_leader_committed(*leader, timeout); | ||
} | ||
|
||
if (leader_committed.has_error()) { | ||
co_return leader_committed.error(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think this should belong in get_leader_committed
.
@@ -1037,7 +1037,7 @@ def _kafka_conn_settings(self): | |||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commit message needs an update (no election timeout bump)
4e1ea51
to
f7ef9b7
Compare
- When handling this request the responding broker must be the current leader and will return the last_committed offset within its respective controller log. - This offset represents the highest command that has been completely processed by the controller log.
- This routine will, contact the leader for the last_committed offset within the controller log, then wait for its local stm to have caught up until at least that point.
- Before handling describe ACLs requests, ensure that if this node is a follower, its processing of controller commands has caught up with the leaders at the point in time the request was recieved.
- This raises the election timeout to 10s while network partitioning a leader then making a describeACLs request. - In this scenario any node that is queried should be reporting that stale results may be returned. Previous followers will be reporting this because they cannot reach the leader, and the stale leader will be reporting this because it cannot inject a barrier since it had been network partitioned.
f7ef9b7
to
2814c78
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the expected behavior during a rolling upgrade when a describe ACLs request hits the controller that is on an older version and doesn't support this new RPC?
this might be more common than we expect, too. if a rolling upgrade is having trouble, jumping on console or inspecting the system might have ACLs being inspected.
}); | ||
} | ||
|
||
ss::future<std::error_code> security_frontend::wait_until_caughtup_with_leader( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a security frontend interface? This seems like it should be generic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't want to offer this as a generic solution where there are currently no other consumers, figured someone else could break it out if they needed. I'm fine with either FWIW
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
figured someone else could break it out if they needed
how will the find it?
/// Waiting up until the leader committed offset means its possible that | ||
/// waiting on an offset higher then neccessary is performed but the | ||
/// alternative of waiting on the last_applied isn't a complete solution | ||
/// as its possible that this offset is behind the actual last_applied | ||
/// of a previously elected leader, resulting in a stale response being | ||
/// returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this comment.
We have an offset in the controller log and we want to wait until the controller log has been replayed locally up to at least that position. It seems like that is what we are doing, so I'm not sure what the nuance is that the comment seems to be describing. Could you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that is what is going on. The comment is just explaining how we are just waiting for an offset that is probably higher then necessary; specifically why waiting on leader_committed
is the chosen offset to wait on instead of leader_applied
.
self.superuser.username, | ||
self.superuser.password, | ||
self.superuser.algorithm) | ||
described = superclient.acl_list() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bharathv 3 weeks ago
q: does it always go to a leader? in which case it always returns the freshest response, wondering if we can loop through all the brokers.
The RPC will fail and in all cases when this occurs the behavior is to resort handling the request the way it was handled before these changes. The changes in this PR make a best effort to grab the freshest ACL state , but if that can't happen within a predefined timeout the state that exists on the current node is returned , and a log is printed that described the scenario. |
Yep, it is on a best effort basis. The same outcome is possible even in a fully upgraded cluster where the caller has trouble reaching the controller (RPC timeout for example). We just log the error, move on and return whatever we have in the local cache. |
@bharathv @graphcareful thanks for looking at my review. I was thinking that when the RPC server received a request for an unknown method we wanted errors to be logged, and so to avoid unnecessary errors in the log we needed to use feature gates to ensure invalid RPCs were never dispatched. But it does seem we log those unknown method messages at debug level. Sorry about the noise there! |
…ader Backport of #12716 avoid returning stale describeACL responses
This PR attempts to solve the issue of stale describeACLs responses returned when a preceding modification of an ACL was issued shortly before.
This situation currently occurs because all requests that wish to edit ACL state get proxied to the controller, while all requests to query this state may be handled by a follower. Followers will only have the most up to date copy of the current state when their respective controller_stm's catchup with the leaders. Since the system is eventually consistent w.r.t. replicating the controller commands to followers there is no guarantee for a follower to return the freshest result set.
The proposed fix in this PR is to create a new endpoint to allow a follower to query the last applied offset of the controller log from the leader node. Once obtained, the follower then can wait on this offset to be applied within its local controller_stm. When this wait completes, its guaranteed that the follower is up to date with the leader at least at the offset the leader was, at the time the query was initially made.
Backports Required
Release Notes
Improvements