Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (TimeoutError: failed to wait until status condition) in PartitionBalancerTest.test_unavailable_nodes #9340

Closed
andrwng opened this issue Mar 9, 2023 · 4 comments · Fixed by #9384
Assignees
Labels
area/replication ci-failure kind/bug Something isn't working sev/high loss of availability, pathological performance degradation, recoverable corruption

Comments

@andrwng
Copy link
Contributor

andrwng commented Mar 9, 2023

https://buildkite.com/redpanda/redpanda/builds/24682#0186c43f-d263-4b00-ae7d-c8212dc3b008

Module: rptest.tests.partition_balancer_test
Class:  PartitionBalancerTest
Method: test_unavailable_nodes
test_id:    rptest.tests.partition_balancer_test.PartitionBalancerTest.test_unavailable_nodes
status:     FAIL
run time:   4 minutes 28.549 seconds


    TimeoutError('failed to wait until status condition')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 49, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/partition_balancer_test.py", line 387, in test_unavailable_nodes
    self.wait_until_ready(expected_unavailable_node=node)
  File "/root/tests/rptest/tests/partition_balancer_test.py", line 199, in wait_until_ready
    return self.wait_until_status(predicate, timeout_sec=timeout_sec)
  File "/root/tests/rptest/tests/partition_balancer_test.py", line 176, in wait_until_status
    return wait_until_result(
  File "/root/tests/rptest/util.py", line 88, in wait_until_result
    wait_until(wrapped_condition, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: failed to wait until status condition
@dlex
Copy link
Contributor

dlex commented Mar 10, 2023

@dlex
Copy link
Contributor

dlex commented Mar 10, 2023

could these be related? #9315

@ztlpn
Copy link
Contributor

ztlpn commented Mar 10, 2023

Some of the partition movements could not be finished because of leaderless partitions. E.g. in the original run https://buildkite.com/redpanda/redpanda/builds/24682#0186c43f-d263-4b00-ae7d-c8212dc3b008 partition kafka/topic-vtcmglqqxc/16 had the following problem:

$ grep kafka/topic-vtcmglqqxc/16 docker-rp-19/redpanda.log |  grep 'shard 1' | grep vote_stm
...
TRACE 2023-03-09 02:57:24,309 [shard 1] raft - [group_id:17, {kafka/topic-vtcmglqqxc/16}] prevote_stm.cc:50 - Sending prevote request to {id: {4}, revision: {35}} from {id: {5}, revision: {35}} with tout: 858206
TRACE 2023-03-09 02:57:24,309 [shard 1] raft - [group_id:17, {kafka/topic-vtcmglqqxc/16}] prevote_stm.cc:50 - Sending prevote request to {id: {1}, revision: {108}} from {id: {5}, revision: {35}} with tout: 858206
TRACE 2023-03-09 02:57:24,313 [shard 1] raft - [group_id:17, {kafka/topic-vtcmglqqxc/16}] prevote_stm.cc:89 - prevote ack: node {id: {4}, revision: {35}} caught up with a candidate
TRACE 2023-03-09 02:57:24,313 [shard 1] raft - [group_id:17, {kafka/topic-vtcmglqqxc/16}] vote_stm.cc:309 - Voting for self in term 2
INFO  2023-03-09 02:57:24,326 [shard 1] raft - [group_id:17, {kafka/topic-vtcmglqqxc/16}] vote_stm.cc:52 - Sending vote request to {id: {4}, revision: {35}} with timeout 1500
INFO  2023-03-09 02:57:24,326 [shard 1] raft - [group_id:17, {kafka/topic-vtcmglqqxc/16}] vote_stm.cc:52 - Sending vote request to {id: {1}, revision: {108}} with timeout 1500
INFO  2023-03-09 02:57:24,342 [shard 1] raft - [group_id:17, {kafka/topic-vtcmglqqxc/16}] vote_stm.cc:77 - vote reply from {id: {4}, revision: {35}} - {term:{2}, target_node_id{id: {5}, revision: {35}}, vote_granted: 1, log_ok:1}
TRACE 2023-03-09 02:57:24,342 [shard 1] raft - [group_id:17, {kafka/topic-vtcmglqqxc/16}] vote_stm.cc:255 - vote acks in term 2 from: {{id: {5}, revision: {35}}, {id: {4}, revision: {35}}}
INFO  2023-03-09 02:57:24,342 [shard 1] raft - [group_id:17, {kafka/topic-vtcmglqqxc/16}] vote_stm.cc:264 - becoming the leader term:2
TRACE 2023-03-09 02:57:25,809 [shard 1] raft - [group_id:17, {kafka/topic-vtcmglqqxc/16}] prevote_stm.cc:67 - prevote ack from {id: {1}, revision: {108}} timed out
INFO  2023-03-09 02:57:31,843 [shard 1] raft - [group_id:17, {kafka/topic-vtcmglqqxc/16}] vote_stm.cc:276 - unable to replicate configuration as a leader - error code: 10 - raft::errc::timeout

and that's the last vote_stm log.

So looks like after docker-rp-19 won the election in term 2 it tried to replicate the configuration, failed, but after that nothing happened. Looks like a legit bug in the leader election code so sev/high.

@ztlpn ztlpn added the sev/high loss of availability, pathological performance degradation, recoverable corruption label Mar 10, 2023
@ztlpn
Copy link
Contributor

ztlpn commented Mar 10, 2023

Some recent changes in this area: #9273, can be related?

mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Mar 10, 2023
If a node is not leader it should not update follower sequence number as
it may advanced after the sequence number was reset.

Fixes: redpanda-data#9340

Signed-off-by: Michal Maslanka <michal@redpanda.com>
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue Mar 10, 2023
If a node is not leader it should not update follower sequence number as
it may advanced after the sequence number was reset.

Fixes: redpanda-data#9340

Signed-off-by: Michal Maslanka <michal@redpanda.com>
(cherry picked from commit 8788c2d)
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue Mar 10, 2023
If a node is not leader it should not update follower sequence number as
it may advanced after the sequence number was reset.

Fixes: redpanda-data#9340

Signed-off-by: Michal Maslanka <michal@redpanda.com>
(cherry picked from commit 8788c2d)
ZeDRoman pushed a commit to vbotbuildovich/redpanda that referenced this issue Mar 10, 2023
If a node is not leader it should not update follower sequence number as
it may advanced after the sequence number was reset.

Fixes: redpanda-data#9340

Signed-off-by: Michal Maslanka <michal@redpanda.com>
(cherry picked from commit 8788c2d)
ballard26 pushed a commit to ballard26/redpanda that referenced this issue Mar 20, 2023
If a node is not leader it should not update follower sequence number as
it may advanced after the sequence number was reset.

Fixes: redpanda-data#9340

Signed-off-by: Michal Maslanka <michal@redpanda.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/replication ci-failure kind/bug Something isn't working sev/high loss of availability, pathological performance degradation, recoverable corruption
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants