CI Failure ("can't fetch stable replicas") in `PartitionMoveInterruption.test_cancelling_partition_move` #9243

ztlpn · 2023-03-02T22:06:11Z

https://buildkite.com/redpanda/redpanda/builds/24275#0186a0e1-3750-40f1-8228-0be893ddf7dc

Module: rptest.tests.partition_move_interruption_test
Class:  PartitionMoveInterruption
Method: test_cancelling_partition_move
Arguments:
{
  "compacted": true,
  "recovery": "no_recovery",
  "replication_factor": 3,
  "unclean_abort": true
}

====================================================================================================
test_id:    rptest.tests.partition_move_interruption_test.PartitionMoveInterruption.test_cancelling_partition_move.replication_factor=3.unclean_abort=True.recovery=no_recovery.compacted=True
status:     FAIL
run time:   2 minutes 50.916 seconds


    TimeoutError("can't fetch stable replicas for kafka/topic-mbfxjwngaf/16 within 90 sec")
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 49, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/partition_move_interruption_test.py", line 181, in test_cancelling_partition_move
    self._random_move_and_cancel(unclean_abort)
  File "/root/tests/rptest/tests/partition_move_interruption_test.py", line 133, in _random_move_and_cancel
    self._request_move_cancel(unclean_abort=unclean_abort,
  File "/root/tests/rptest/tests/partition_movement.py", line 285, in _request_move_cancel
    self._wait_post_cancel(topic, partition, previous_assignment,
  File "/root/tests/rptest/tests/partition_movement.py", line 146, in _wait_post_cancel
    result_configuration = admin.wait_stable_configuration(
  File "/root/tests/rptest/services/admin.py", line 262, in wait_stable_configuration
    return wait_until_result(
  File "/root/tests/rptest/util.py", line 88, in wait_until_result
    wait_until(wrapped_condition, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: can't fetch stable replicas for kafka/topic-mbfxjwngaf/16 within 90 sec
Test requested 7 nodes, used only 6

The text was updated successfully, but these errors were encountered:

ztlpn · 2023-03-08T00:50:44Z

https://buildkite.com/redpanda/redpanda/builds/24552#0186bbbc-23bd-4edd-a90c-240ddd8eacb3

BenPope · 2023-03-15T10:19:26Z

https://buildkite.com/redpanda/redpanda/builds/25016#0186e089-aac5-46d1-bb09-4f7ae73ddd24

michael-redpanda · 2023-04-04T15:02:34Z

https://buildkite.com/redpanda/redpanda/builds/26350#01874820-d8a0-4395-a449-d5243988190e

NyaliaLui · 2023-05-25T17:00:27Z

https://buildkite.com/redpanda/redpanda/builds/29776#01884e0b-2c8d-4e32-8ce0-21279fb55c9d

andijcr · 2023-06-07T14:18:20Z

https://buildkite.com/redpanda/redpanda/builds/30757#018893e2-bf5e-4fc2-9947-ba8eecf069c2
https://buildkite.com/redpanda/redpanda/builds/30615#01888c65-5d01-4b0f-9e47-850d661f5560
https://buildkite.com/redpanda/redpanda/builds/30747#018893ab-25b3-4b8a-a80f-148a0e3fba1a

ztlpn · 2023-06-21T23:03:13Z

So this is pretty interesting (dissecting logs for build 30615). Looks like a very subtle bug in force-cancellation.

Background:

ntp in question is __consumer_offsets/15
first an update was issued at revision 63: 2/1, 3/1, 4/1 -> 2/0, 3/0, 1/0
then a force-cancellation at revision 64

First order symptom is that force-cancellation was finished, but leadership info diverged between nodes, namely node docker-rp-18 thought that the leader was 4:

[DEBUG - 2023-06-05 16:51:40,484 - admin - _get_configuration - lineno:131]: Dispatching GET http://docker-rp-18:9644/v1/partitions/kafka/__consumer_offsets/15
[DEBUG - 2023-06-05 16:51:40,494 - admin - _get_configuration - lineno:139]: Response OK, JSON: {'ns': 'kafka', 'topic': '__consumer_offsets', 'partition_id': 15, 'status': 'done', 'leader_id': 4, 'raft_group_id': 36, 'replicas': [{'node_id': 3, 'core': 1}, {'node_id': 4, 'core': 1}, {'node_id': 2, 'core': 1}]}
[DEBUG - 2023-06-05 16:51:40,494 - admin - _get_stable_configuration - lineno:213]: got leader:4 but observed 2 before

while everyone else thought that the leader was 2 (and it was true on the raft level).

The reason was that an entry in partition_leaders_table on node docker-rp-18 was with revision 63, but everywhere else it was with revision 50 (and the metadata updates came with revision 50 as well, and were rejected).

How did it happen? Approximate timeline:

node docker-rp-21 (node id: 4, leader in term 4) executes the force_abort_update delta @ 16:51:04,765 and appends the aborting configuration (with revision 64, offset 4061) @ 16:51:04,780
same node docker-rp-21 (immediately after the configuration was flushed to the local log) issues the update finished command @ 16:51:04,842
for some reason this configuration doesn't get replicated to other nodes (I guess other nodes were busy with cross-shard updates)
on node docker-rp-20 (node id: 3) update and force-abort-update are not fully finished (because the finishing operation appears), therefore the node didn't wait for updated configurations to be replicated and the latest configuration there is still with initial revision 50.
after a series of elections, node docker-rp-9 (node id: 2) wins elections in term 7 @ 16:51:16,559
it then tries to replicate its current configuration (revision: 50, offset: 4056) to other nodes.
log at node docker-rp-21 gets truncated, configurations with revision 63 and 64 are erased.
the only place with some memory about the configuration with revision 63 is the partition leaders table on node docker-rp-18 (id: 1)

Now, I guess the immediate symptom can be fixed by something like #9300 but the fact that the configuration revision goes backwards doesn't look right.

I think the proper fix is to wait with issuing the update_finished command. Until when? I would say, until the configuration with revision 64 is committed.

piyushredpanda · 2023-06-22T06:30:57Z

@ztlpn coming in hot with the analysis.

dotnwat · 2023-06-27T17:56:20Z

https://buildkite.com/redpanda/redpanda/builds/31994#0188fa5a-9d72-4485-b901-f22e7437307b

When forcibly aborting reconfiguration we should wait for the new leader to be elected in the configuration that the partition was forced to. This way we can be certain that the new configuration will finally be replicated to the majority of nodes even tough the leader may not exists at the time when configuration is replicated. Fixes: redpanda-data#9243 Signed-off-by: Michal Maslanka <michal@redpanda.com>

When forcibly aborting reconfiguration we should wait for the new leader to be elected in the configuration that the partition was forced to. This way we can be certain that the new configuration will finally be replicated to the majority of nodes even tough the leader may not exists at the time when configuration is replicated. Fixes: redpanda-data#9243 Signed-off-by: Michal Maslanka <michal@redpanda.com> (cherry picked from commit 3948312)

ztlpn added kind/bug Something isn't working ci-failure area/replication labels Mar 2, 2023

ZeDRoman mentioned this issue Mar 13, 2023

Clear topics orphan files with fix #8960

Merged

6 tasks

ballard26 mentioned this issue Mar 15, 2023

Remove cloud-specific topics from topic protection defaults #9453

Merged

7 tasks

dotnwat mentioned this issue Mar 16, 2023

Apply both size and time retention #9487

Merged

7 tasks

dotnwat mentioned this issue Apr 18, 2023

A few fixes for shared library support #10144

Merged

7 tasks

vshtokman closed this as completed May 17, 2023

andijcr mentioned this issue May 18, 2023

cloud_storage: topic_recovery_service coro -> continuation for collect_manifest_paths #10853

Merged

7 tasks

dlex mentioned this issue May 25, 2023

Treat RpkException as retryable in _node_leadership_evacuated #11022

Merged

7 tasks

dlex reopened this May 25, 2023

dlex mentioned this issue May 26, 2023

[v22.2.x] Treat RpkException as retryable in _node_leadership_evacuated #11072

Merged

mmaslankaprv mentioned this issue Jun 13, 2023

Wait for the victim node to apply the dirty offset #11350

Merged

7 tasks

andijcr mentioned this issue Jun 13, 2023

[cloud_storage] segment hydration failure fix #11298

Merged

7 tasks

ztlpn self-assigned this Jun 21, 2023

ztlpn added the sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low. label Jun 21, 2023

BenPope mentioned this issue Jun 26, 2023

schema_registry: Schema ID Validation - compressed batch tests #11637

Merged

7 tasks

mmaslankaprv assigned mmaslankaprv and unassigned ztlpn Jul 6, 2023

mmaslankaprv mentioned this issue Jul 6, 2023

Wait for leader to be elected before finishing abort cancel #11905

Merged

7 tasks

mmaslankaprv closed this as completed in #11905 Jul 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI Failure ("can't fetch stable replicas") in `PartitionMoveInterruption.test_cancelling_partition_move` #9243

CI Failure ("can't fetch stable replicas") in `PartitionMoveInterruption.test_cancelling_partition_move` #9243

ztlpn commented Mar 2, 2023

ztlpn commented Mar 8, 2023

BenPope commented Mar 15, 2023

michael-redpanda commented Apr 4, 2023

NyaliaLui commented May 25, 2023

andijcr commented Jun 7, 2023 •

edited

Loading

ztlpn commented Jun 21, 2023

piyushredpanda commented Jun 22, 2023

dotnwat commented Jun 27, 2023

CI Failure ("can't fetch stable replicas") in PartitionMoveInterruption.test_cancelling_partition_move #9243

CI Failure ("can't fetch stable replicas") in PartitionMoveInterruption.test_cancelling_partition_move #9243

Comments

ztlpn commented Mar 2, 2023

ztlpn commented Mar 8, 2023

BenPope commented Mar 15, 2023

michael-redpanda commented Apr 4, 2023

NyaliaLui commented May 25, 2023

andijcr commented Jun 7, 2023 • edited Loading

ztlpn commented Jun 21, 2023

piyushredpanda commented Jun 22, 2023

dotnwat commented Jun 27, 2023

CI Failure ("can't fetch stable replicas") in `PartitionMoveInterruption.test_cancelling_partition_move` #9243

CI Failure ("can't fetch stable replicas") in `PartitionMoveInterruption.test_cancelling_partition_move` #9243

andijcr commented Jun 7, 2023 •

edited

Loading