Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure ("can't fetch stable replicas") in PartitionMoveInterruption.test_cancelling_partition_move #9243

Closed
ztlpn opened this issue Mar 2, 2023 · 8 comments · Fixed by #11905
Assignees
Labels
area/replication ci-failure kind/bug Something isn't working sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low.

Comments

@ztlpn
Copy link
Contributor

ztlpn commented Mar 2, 2023

https://buildkite.com/redpanda/redpanda/builds/24275#0186a0e1-3750-40f1-8228-0be893ddf7dc

Module: rptest.tests.partition_move_interruption_test
Class:  PartitionMoveInterruption
Method: test_cancelling_partition_move
Arguments:
{
  "compacted": true,
  "recovery": "no_recovery",
  "replication_factor": 3,
  "unclean_abort": true
}
====================================================================================================
test_id:    rptest.tests.partition_move_interruption_test.PartitionMoveInterruption.test_cancelling_partition_move.replication_factor=3.unclean_abort=True.recovery=no_recovery.compacted=True
status:     FAIL
run time:   2 minutes 50.916 seconds


    TimeoutError("can't fetch stable replicas for kafka/topic-mbfxjwngaf/16 within 90 sec")
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 49, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/partition_move_interruption_test.py", line 181, in test_cancelling_partition_move
    self._random_move_and_cancel(unclean_abort)
  File "/root/tests/rptest/tests/partition_move_interruption_test.py", line 133, in _random_move_and_cancel
    self._request_move_cancel(unclean_abort=unclean_abort,
  File "/root/tests/rptest/tests/partition_movement.py", line 285, in _request_move_cancel
    self._wait_post_cancel(topic, partition, previous_assignment,
  File "/root/tests/rptest/tests/partition_movement.py", line 146, in _wait_post_cancel
    result_configuration = admin.wait_stable_configuration(
  File "/root/tests/rptest/services/admin.py", line 262, in wait_stable_configuration
    return wait_until_result(
  File "/root/tests/rptest/util.py", line 88, in wait_until_result
    wait_until(wrapped_condition, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: can't fetch stable replicas for kafka/topic-mbfxjwngaf/16 within 90 sec
Test requested 7 nodes, used only 6
@ztlpn
Copy link
Contributor Author

ztlpn commented Mar 8, 2023

@BenPope
Copy link
Member

BenPope commented Mar 15, 2023

@michael-redpanda
Copy link
Contributor

@NyaliaLui
Copy link
Contributor

@ztlpn
Copy link
Contributor Author

ztlpn commented Jun 21, 2023

So this is pretty interesting (dissecting logs for build 30615). Looks like a very subtle bug in force-cancellation.

Background:

  • ntp in question is __consumer_offsets/15
  • first an update was issued at revision 63: 2/1, 3/1, 4/1 -> 2/0, 3/0, 1/0
  • then a force-cancellation at revision 64

First order symptom is that force-cancellation was finished, but leadership info diverged between nodes, namely node docker-rp-18 thought that the leader was 4:

[DEBUG - 2023-06-05 16:51:40,484 - admin - _get_configuration - lineno:131]: Dispatching GET http://docker-rp-18:9644/v1/partitions/kafka/__consumer_offsets/15
[DEBUG - 2023-06-05 16:51:40,494 - admin - _get_configuration - lineno:139]: Response OK, JSON: {'ns': 'kafka', 'topic': '__consumer_offsets', 'partition_id': 15, 'status': 'done', 'leader_id': 4, 'raft_group_id': 36, 'replicas': [{'node_id': 3, 'core': 1}, {'node_id': 4, 'core': 1}, {'node_id': 2, 'core': 1}]}
[DEBUG - 2023-06-05 16:51:40,494 - admin - _get_stable_configuration - lineno:213]: got leader:4 but observed 2 before

while everyone else thought that the leader was 2 (and it was true on the raft level).

The reason was that an entry in partition_leaders_table on node docker-rp-18 was with revision 63, but everywhere else it was with revision 50 (and the metadata updates came with revision 50 as well, and were rejected).

How did it happen? Approximate timeline:

  • node docker-rp-21 (node id: 4, leader in term 4) executes the force_abort_update delta @ 16:51:04,765 and appends the aborting configuration (with revision 64, offset 4061) @ 16:51:04,780
  • same node docker-rp-21 (immediately after the configuration was flushed to the local log) issues the update finished command @ 16:51:04,842
  • for some reason this configuration doesn't get replicated to other nodes (I guess other nodes were busy with cross-shard updates)
  • on node docker-rp-20 (node id: 3) update and force-abort-update are not fully finished (because the finishing operation appears), therefore the node didn't wait for updated configurations to be replicated and the latest configuration there is still with initial revision 50.
  • after a series of elections, node docker-rp-9 (node id: 2) wins elections in term 7 @ 16:51:16,559
  • it then tries to replicate its current configuration (revision: 50, offset: 4056) to other nodes.
  • log at node docker-rp-21 gets truncated, configurations with revision 63 and 64 are erased.
  • the only place with some memory about the configuration with revision 63 is the partition leaders table on node docker-rp-18 (id: 1)

Now, I guess the immediate symptom can be fixed by something like #9300 but the fact that the configuration revision goes backwards doesn't look right.

I think the proper fix is to wait with issuing the update_finished command. Until when? I would say, until the configuration with revision 64 is committed.

@ztlpn ztlpn added the sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low. label Jun 21, 2023
@piyushredpanda
Copy link
Contributor

@ztlpn coming in hot with the analysis.

@dotnwat
Copy link
Member

dotnwat commented Jun 27, 2023

@mmaslankaprv mmaslankaprv assigned mmaslankaprv and unassigned ztlpn Jul 6, 2023
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jul 6, 2023
When forcibly aborting reconfiguration we should wait for the new leader
to be elected in the configuration that the partition was forced to.
This way we can be certain that the new configuration will finally be
replicated to the majority of nodes even tough the leader may not
exists at the time when configuration is replicated.

Fixes: redpanda-data#9243

Signed-off-by: Michal Maslanka <michal@redpanda.com>
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jul 7, 2023
When forcibly aborting reconfiguration we should wait for the new leader
to be elected in the configuration that the partition was forced to.
This way we can be certain that the new configuration will finally be
replicated to the majority of nodes even tough the leader may not
exists at the time when configuration is replicated.

Fixes: redpanda-data#9243

Signed-off-by: Michal Maslanka <michal@redpanda.com>
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jul 7, 2023
When forcibly aborting reconfiguration we should wait for the new leader
to be elected in the configuration that the partition was forced to.
This way we can be certain that the new configuration will finally be
replicated to the majority of nodes even tough the leader may not
exists at the time when configuration is replicated.

Fixes: redpanda-data#9243

Signed-off-by: Michal Maslanka <michal@redpanda.com>
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jul 13, 2023
When forcibly aborting reconfiguration we should wait for the new leader
to be elected in the configuration that the partition was forced to.
This way we can be certain that the new configuration will finally be
replicated to the majority of nodes even tough the leader may not
exists at the time when configuration is replicated.

Fixes: redpanda-data#9243

Signed-off-by: Michal Maslanka <michal@redpanda.com>
(cherry picked from commit 3948312)
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jul 13, 2023
When forcibly aborting reconfiguration we should wait for the new leader
to be elected in the configuration that the partition was forced to.
This way we can be certain that the new configuration will finally be
replicated to the majority of nodes even tough the leader may not
exists at the time when configuration is replicated.

Fixes: redpanda-data#9243

Signed-off-by: Michal Maslanka <michal@redpanda.com>
(cherry picked from commit 3948312)
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jul 13, 2023
When forcibly aborting reconfiguration we should wait for the new leader
to be elected in the configuration that the partition was forced to.
This way we can be certain that the new configuration will finally be
replicated to the majority of nodes even tough the leader may not
exists at the time when configuration is replicated.

Fixes: redpanda-data#9243

Signed-off-by: Michal Maslanka <michal@redpanda.com>
(cherry picked from commit 3948312)
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jul 20, 2023
When forcibly aborting reconfiguration we should wait for the new leader
to be elected in the configuration that the partition was forced to.
This way we can be certain that the new configuration will finally be
replicated to the majority of nodes even tough the leader may not
exists at the time when configuration is replicated.

Fixes: redpanda-data#9243

Signed-off-by: Michal Maslanka <michal@redpanda.com>
(cherry picked from commit 3948312)
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jul 20, 2023
When forcibly aborting reconfiguration we should wait for the new leader
to be elected in the configuration that the partition was forced to.
This way we can be certain that the new configuration will finally be
replicated to the majority of nodes even tough the leader may not
exists at the time when configuration is replicated.

Fixes: redpanda-data#9243

Signed-off-by: Michal Maslanka <michal@redpanda.com>
(cherry picked from commit 3948312)
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jul 20, 2023
When forcibly aborting reconfiguration we should wait for the new leader
to be elected in the configuration that the partition was forced to.
This way we can be certain that the new configuration will finally be
replicated to the majority of nodes even tough the leader may not
exists at the time when configuration is replicated.

Fixes: redpanda-data#9243

Signed-off-by: Michal Maslanka <michal@redpanda.com>
(cherry picked from commit 3948312)
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jul 20, 2023
When forcibly aborting reconfiguration we should wait for the new leader
to be elected in the configuration that the partition was forced to.
This way we can be certain that the new configuration will finally be
replicated to the majority of nodes even tough the leader may not
exists at the time when configuration is replicated.

Fixes: redpanda-data#9243

Signed-off-by: Michal Maslanka <michal@redpanda.com>
(cherry picked from commit 3948312)
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jul 21, 2023
When forcibly aborting reconfiguration we should wait for the new leader
to be elected in the configuration that the partition was forced to.
This way we can be certain that the new configuration will finally be
replicated to the majority of nodes even tough the leader may not
exists at the time when configuration is replicated.

Fixes: redpanda-data#9243

Signed-off-by: Michal Maslanka <michal@redpanda.com>
(cherry picked from commit 3948312)
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Aug 2, 2023
When forcibly aborting reconfiguration we should wait for the new leader
to be elected in the configuration that the partition was forced to.
This way we can be certain that the new configuration will finally be
replicated to the majority of nodes even tough the leader may not
exists at the time when configuration is replicated.

Fixes: redpanda-data#9243

Signed-off-by: Michal Maslanka <michal@redpanda.com>
(cherry picked from commit 3948312)
rockwotj pushed a commit to rockwotj/redpanda that referenced this issue Aug 15, 2023
When forcibly aborting reconfiguration we should wait for the new leader
to be elected in the configuration that the partition was forced to.
This way we can be certain that the new configuration will finally be
replicated to the majority of nodes even tough the leader may not
exists at the time when configuration is replicated.

Fixes: redpanda-data#9243

Signed-off-by: Michal Maslanka <michal@redpanda.com>
(cherry picked from commit 3948312)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/replication ci-failure kind/bug Something isn't working sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants