CI Failure (TimeoutError: failed to wait until status condition) in `PartitionBalancerTest.test_unavailable_nodes` #9340

andrwng · 2023-03-09T08:08:45Z

https://buildkite.com/redpanda/redpanda/builds/24682#0186c43f-d263-4b00-ae7d-c8212dc3b008

Module: rptest.tests.partition_balancer_test
Class:  PartitionBalancerTest
Method: test_unavailable_nodes

test_id:    rptest.tests.partition_balancer_test.PartitionBalancerTest.test_unavailable_nodes
status:     FAIL
run time:   4 minutes 28.549 seconds


    TimeoutError('failed to wait until status condition')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 49, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/partition_balancer_test.py", line 387, in test_unavailable_nodes
    self.wait_until_ready(expected_unavailable_node=node)
  File "/root/tests/rptest/tests/partition_balancer_test.py", line 199, in wait_until_ready
    return self.wait_until_status(predicate, timeout_sec=timeout_sec)
  File "/root/tests/rptest/tests/partition_balancer_test.py", line 176, in wait_until_status
    return wait_until_result(
  File "/root/tests/rptest/util.py", line 88, in wait_until_result
    wait_until(wrapped_condition, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: failed to wait until status condition

The text was updated successfully, but these errors were encountered:

dlex · 2023-03-10T06:56:52Z

on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/24668#0186c32e-1666-454a-9bce-1048439255e9

dlex · 2023-03-10T06:57:47Z

could these be related? #9315

ztlpn · 2023-03-10T12:53:33Z

Some of the partition movements could not be finished because of leaderless partitions. E.g. in the original run https://buildkite.com/redpanda/redpanda/builds/24682#0186c43f-d263-4b00-ae7d-c8212dc3b008 partition kafka/topic-vtcmglqqxc/16 had the following problem:

$ grep kafka/topic-vtcmglqqxc/16 docker-rp-19/redpanda.log |  grep 'shard 1' | grep vote_stm
...
TRACE 2023-03-09 02:57:24,309 [shard 1] raft - [group_id:17, {kafka/topic-vtcmglqqxc/16}] prevote_stm.cc:50 - Sending prevote request to {id: {4}, revision: {35}} from {id: {5}, revision: {35}} with tout: 858206
TRACE 2023-03-09 02:57:24,309 [shard 1] raft - [group_id:17, {kafka/topic-vtcmglqqxc/16}] prevote_stm.cc:50 - Sending prevote request to {id: {1}, revision: {108}} from {id: {5}, revision: {35}} with tout: 858206
TRACE 2023-03-09 02:57:24,313 [shard 1] raft - [group_id:17, {kafka/topic-vtcmglqqxc/16}] prevote_stm.cc:89 - prevote ack: node {id: {4}, revision: {35}} caught up with a candidate
TRACE 2023-03-09 02:57:24,313 [shard 1] raft - [group_id:17, {kafka/topic-vtcmglqqxc/16}] vote_stm.cc:309 - Voting for self in term 2
INFO  2023-03-09 02:57:24,326 [shard 1] raft - [group_id:17, {kafka/topic-vtcmglqqxc/16}] vote_stm.cc:52 - Sending vote request to {id: {4}, revision: {35}} with timeout 1500
INFO  2023-03-09 02:57:24,326 [shard 1] raft - [group_id:17, {kafka/topic-vtcmglqqxc/16}] vote_stm.cc:52 - Sending vote request to {id: {1}, revision: {108}} with timeout 1500
INFO  2023-03-09 02:57:24,342 [shard 1] raft - [group_id:17, {kafka/topic-vtcmglqqxc/16}] vote_stm.cc:77 - vote reply from {id: {4}, revision: {35}} - {term:{2}, target_node_id{id: {5}, revision: {35}}, vote_granted: 1, log_ok:1}
TRACE 2023-03-09 02:57:24,342 [shard 1] raft - [group_id:17, {kafka/topic-vtcmglqqxc/16}] vote_stm.cc:255 - vote acks in term 2 from: {{id: {5}, revision: {35}}, {id: {4}, revision: {35}}}
INFO  2023-03-09 02:57:24,342 [shard 1] raft - [group_id:17, {kafka/topic-vtcmglqqxc/16}] vote_stm.cc:264 - becoming the leader term:2
TRACE 2023-03-09 02:57:25,809 [shard 1] raft - [group_id:17, {kafka/topic-vtcmglqqxc/16}] prevote_stm.cc:67 - prevote ack from {id: {1}, revision: {108}} timed out
INFO  2023-03-09 02:57:31,843 [shard 1] raft - [group_id:17, {kafka/topic-vtcmglqqxc/16}] vote_stm.cc:276 - unable to replicate configuration as a leader - error code: 10 - raft::errc::timeout

and that's the last vote_stm log.

So looks like after docker-rp-19 won the election in term 2 it tried to replicate the configuration, failed, but after that nothing happened. Looks like a legit bug in the leader election code so sev/high.

ztlpn · 2023-03-10T12:56:09Z

Some recent changes in this area: #9273, can be related?

If a node is not leader it should not update follower sequence number as it may advanced after the sequence number was reset. Fixes: redpanda-data#9340 Signed-off-by: Michal Maslanka <michal@redpanda.com>

If a node is not leader it should not update follower sequence number as it may advanced after the sequence number was reset. Fixes: redpanda-data#9340 Signed-off-by: Michal Maslanka <michal@redpanda.com> (cherry picked from commit 8788c2d)

If a node is not leader it should not update follower sequence number as it may advanced after the sequence number was reset. Fixes: redpanda-data#9340 Signed-off-by: Michal Maslanka <michal@redpanda.com>

andrwng added kind/bug Something isn't working ci-failure area/replication labels Mar 9, 2023

andrwng mentioned this issue Mar 9, 2023

node_folder_deletion_test: use homogeneous configs in restarted node #9338

Merged

7 tasks

jcsp mentioned this issue Mar 9, 2023

CI Failure (TimeoutError: failed to wait until status condition) in PartitionBalancerTest.test_fuzz_admin_ops #9315

Closed

ztlpn added the sev/high loss of availability, pathological performance degradation, recoverable corruption label Mar 10, 2023

mmaslankaprv mentioned this issue Mar 10, 2023

Do not update follower received sequence if not leader #9384

Merged

7 tasks

ztlpn mentioned this issue Mar 10, 2023

Controller snapshots: topic_table preparations #9354

Merged

7 tasks

ztlpn assigned ztlpn and mmaslankaprv Mar 10, 2023

mmaslankaprv closed this as completed in #9384 Mar 10, 2023

vbotbuildovich mentioned this issue Mar 10, 2023

[v23.1.x] CI Failure (TimeoutError: failed to wait until status condition) in PartitionBalancerTest.test_unavailable_nodes #9388

Closed

vbotbuildovich mentioned this issue Mar 10, 2023

[v22.3.x] CI Failure (TimeoutError: failed to wait until status condition) in PartitionBalancerTest.test_unavailable_nodes #9390

Closed

ZeDRoman mentioned this issue Mar 13, 2023

Clear topics orphan files with fix #8960

Merged

6 tasks

ztlpn mentioned this issue Mar 28, 2023

Node decommissioning stuck under pressure #9452

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI Failure (TimeoutError: failed to wait until status condition) in `PartitionBalancerTest.test_unavailable_nodes` #9340

CI Failure (TimeoutError: failed to wait until status condition) in `PartitionBalancerTest.test_unavailable_nodes` #9340

andrwng commented Mar 9, 2023

dlex commented Mar 10, 2023

dlex commented Mar 10, 2023

ztlpn commented Mar 10, 2023 •

edited

Loading

ztlpn commented Mar 10, 2023

CI Failure (TimeoutError: failed to wait until status condition) in PartitionBalancerTest.test_unavailable_nodes #9340

CI Failure (TimeoutError: failed to wait until status condition) in PartitionBalancerTest.test_unavailable_nodes #9340

Comments

andrwng commented Mar 9, 2023

dlex commented Mar 10, 2023

dlex commented Mar 10, 2023

ztlpn commented Mar 10, 2023 • edited Loading

ztlpn commented Mar 10, 2023

CI Failure (TimeoutError: failed to wait until status condition) in `PartitionBalancerTest.test_unavailable_nodes` #9340

CI Failure (TimeoutError: failed to wait until status condition) in `PartitionBalancerTest.test_unavailable_nodes` #9340

ztlpn commented Mar 10, 2023 •

edited

Loading