Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (ducktape.errors.TimeoutError: topic_storage_purged) in TopicDeleteCloudStorageTest.topic_delete_unavailable_test #14065

Closed
mmaslankaprv opened this issue Oct 10, 2023 · 15 comments · Fixed by #14429
Labels
area/cloud-storage Shadow indexing subsystem ci-failure kind/bug Something isn't working sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages

Comments

@mmaslankaprv
Copy link
Member

mmaslankaprv commented Oct 10, 2023

https://buildkite.com/redpanda/redpanda/builds/38604

Module: rptest.tests.topic_delete_test
Class: TopicDeleteCloudStorageTest
Method: topic_delete_unavailable_test
Arguments: {
    "cloud_storage_type": 2
}
test_id:    TopicDeleteCloudStorageTest.topic_delete_unavailable_test
status:     FAIL
run time:   66.323 seconds

TimeoutError('')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 264, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 481, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/utils/mode_checks.py", line 63, in f
    return func(*args, **kwargs)
  File "/root/tests/rptest/services/cluster.py", line 82, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/topic_delete_test.py", line 625, in topic_delete_unavailable_test
    wait_until(lambda: topic_storage_purged(self.redpanda, self.topic),
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError

JIRA Link: CORE-1496

@mmaslankaprv mmaslankaprv added kind/bug Something isn't working ci-failure labels Oct 10, 2023
@andijcr andijcr added the area/cloud-storage Shadow indexing subsystem label Oct 13, 2023
@andijcr
Copy link
Contributor

andijcr commented Oct 13, 2023

failure in debug

the files that are not being deleted, by the test

[INFO  - 2023-10-10 07:53:16,154 - topic_delete_test - topic_storage_purged - lineno:106]: Files remain for topic topic-jizgvjvxsg:
[INFO  - 2023-10-10 07:53:16,154 - topic_delete_test - topic_storage_purged - lineno:112]:   docker-rp-13: tx.snapshot
[INFO  - 2023-10-10 07:53:16,154 - topic_delete_test - topic_storage_purged - lineno:112]:   docker-rp-13: 64-1-v1.base_index
[INFO  - 2023-10-10 07:53:16,154 - topic_delete_test - topic_storage_purged - lineno:112]:   docker-rp-13: 66-1-v1.log
[INFO  - 2023-10-10 07:53:16,154 - topic_delete_test - topic_storage_purged - lineno:112]:   docker-rp-13: 72-1-v1.base_index
[INFO  - 2023-10-10 07:53:16,154 - topic_delete_test - topic_storage_purged - lineno:112]:   docker-rp-13: archival_metadata.snapshot
[INFO  - 2023-10-10 07:53:16,154 - topic_delete_test - topic_storage_purged - lineno:112]:   docker-rp-13: 68-1-v1.base_index
[INFO  - 2023-10-10 07:53:16,154 - topic_delete_test - topic_storage_purged - lineno:112]:   docker-rp-13: 70-1-v1.log
[INFO  - 2023-10-10 07:53:16,154 - topic_delete_test - topic_storage_purged - lineno:112]:   docker-rp-13: 72-1-v1.log
[INFO  - 2023-10-10 07:53:16,154 - topic_delete_test - topic_storage_purged - lineno:112]:   docker-rp-13: 74-1-v1.log
[INFO  - 2023-10-10 07:53:16,154 - topic_delete_test - topic_storage_purged - lineno:112]:   docker-rp-13: 70-1-v1.base_index
[INFO  - 2023-10-10 07:53:16,154 - topic_delete_test - topic_storage_purged - lineno:112]:   docker-rp-13: 66-1-v1.base_index
[INFO  - 2023-10-10 07:53:16,155 - topic_delete_test - topic_storage_purged - lineno:112]:   docker-rp-13: 68-1-v1.log
[INFO  - 2023-10-10 07:53:16,155 - topic_delete_test - topic_storage_purged - lineno:112]:   docker-rp-13: 64-1-v1.log

looking in docker-rp-13, there are a couple of instances of

DEBUG 2023-10-10 07:52:48,747 [shard 0:main] archival - [fiber1] - upload_housekeeping_service.cc:163 - Activating upload housekeeping
DEBUG 2023-10-10 07:52:48,747 [shard 0:main] archival - upload_housekeeping_service.cc:467 - housekeeping_workflow::resume, state: active, backlog: 1
DEBUG 2023-10-10 07:52:48,747 [shard 0:au  ] archival - upload_housekeeping_service.cc:354 - Running job purger with quota 4
INFO  2023-10-10 07:52:48,747 [shard 0:au  ] archival - purger.cc:350 - Running with 4 quota, 1 topic lifecycle markers
DEBUG 2023-10-10 07:52:48,747 [shard 0:au  ] archival - purger.cc:374 - Grace period for {{kafka}/{topic-jizgvjvxsg}} is still in effect for. Skipping scrub.
INFO  2023-10-10 07:52:48,747 [shard 0:au  ] archival - upload_housekeeping_service.cc:405 - housekeeping_workflow, transition to idle state from active, 1 jobs executed, 0 jobs failed. Housekeeping round lasted approx. 3 sec. Job execution time in the round: 0 sec
DEBUG 2023-10-10 07:52:48,747 [shard 0:au  ] archival - upload_housekeeping_service.cc:311 - housekeeping_workflow, BG job, state: idle, backlog: 1

and then no more until timeout, as if the scrubber stopped running

@andijcr andijcr added the sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low. label Oct 13, 2023
@andijcr andijcr self-assigned this Oct 13, 2023
@andijcr
Copy link
Contributor

andijcr commented Oct 17, 2023

investigating some recent pr that might influence this

#13720
#13916
#13918
#13919

@andijcr
Copy link
Contributor

andijcr commented Oct 17, 2023

so the inconsistency is on partition 1.

partition 0 1 and 2 go through deletion @ 07:52:43

DEBUG 2023-10-10 07:52:43,894 [shard 0:main] cluster - topic_table.cc:166 - Created lifecycle marker for topic {{kafka}/{topic-jizgvjvxsg}} 30
TRACE 2023-10-10 07:52:43,894 [shard 0:main] cluster - controller_backend.cc:970 - [{kafka/topic-jizgvjvxsg/0}] (retry 0) executing operation: {type: deletion, ntp: {kafka/topic-jizgvjvxsg/0}, offset: 48, new_assignment: { id: 0, group_id: 1, replicas: {{node_id: 2, shard: 1}, {node_id: 4, shard: 1}, {node_id: 1, shard: 1}} }, previous_replica_set: {nullopt}, replica_revisions: {nullopt}}}
TRACE 2023-10-10 07:52:43,894 [shard 0:main] cluster - partition_leaders_table.cc:282 - removing {kafka/topic-jizgvjvxsg/0} with revision 30 matched by revision 48
TRACE 2023-10-10 07:52:43,894 [shard 0:main] cluster - controller_backend.cc:970 - [{kafka/topic-jizgvjvxsg/1}] (retry 0) executing operation: {type: deletion, ntp: {kafka/topic-jizgvjvxsg/1}, offset: 48, new_assignment: { id: 1, group_id: 2, replicas: {{node_id: 2, shard: 1}, {node_id: 3, shard: 1}, {node_id: 1, shard: 1}} }, previous_replica_set: {nullopt}, replica_revisions: {nullopt}}}
TRACE 2023-10-10 07:52:43,894 [shard 0:main] cluster - partition_leaders_table.cc:282 - removing {kafka/topic-jizgvjvxsg/1} with revision 30 matched by revision 48
TRACE 2023-10-10 07:52:43,894 [shard 0:main] cluster - controller_backend.cc:970 - [{kafka/topic-jizgvjvxsg/2}] (retry 0) executing operation: {type: deletion, ntp: {kafka/topic-jizgvjvxsg/2}, offset: 48, new_assignment: { id: 2, group_id: 3, replicas: {{node_id: 4, shard: 1}, {node_id: 2, shard: 0}, {node_id: 3, shard: 1}} }, previous_replica_set: {nullopt}, replica_revisions: {nullopt}}}
TRACE 2023-10-10 07:52:43,894 [shard 0:main] cluster - partition_leaders_table.cc:282 - removing {kafka/topic-jizgvjvxsg/2} with revision 30 matched by revision 48
DEBUG 2023-10-10 07:52:43,894 [shard 1:main] cluster - topic_table.cc:166 - Created lifecycle marker for topic {{kafka}/{topic-jizgvjvxsg}} 30
TRACE 2023-10-10 07:52:43,894 [shard 1:main] cluster - partition_leaders_table.cc:282 - removing {kafka/topic-jizgvjvxsg/0} with revision 30 matched by revision 48
TRACE 2023-10-10 07:52:43,894 [shard 1:main] cluster - partition_leaders_table.cc:282 - removing {kafka/topic-jizgvjvxsg/1} with revision 30 matched by revision 48
TRACE 2023-10-10 07:52:43,894 [shard 1:main] cluster - partition_leaders_table.cc:282 - removing {kafka/topic-jizgvjvxsg/2} with revision 30 matched by revision 48
TRACE 2023-10-10 07:52:43,894 [shard 1:main] cluster - controller_backend.cc:970 - [{kafka/topic-jizgvjvxsg/0}] (retry 0) executing operation: {type: deletion, ntp: {kafka/topic-jizgvjvxsg/0}, offset: 48, new_assignment: { id: 0, group_id: 1, replicas: {{node_id: 2, shard: 1}, {node_id: 4, shard: 1}, {node_id: 1, shard: 1}} }, previous_replica_set: {nullopt}, replica_revisions: {nullopt}}}
TRACE 2023-10-10 07:52:43,894 [shard 1:main] cluster - controller_backend.cc:970 - [{kafka/topic-jizgvjvxsg/1}] (retry 0) executing operation: {type: deletion, ntp: {kafka/topic-jizgvjvxsg/1}, offset: 48, new_assignment: { id: 1, group_id: 2, replicas: {{node_id: 2, shard: 1}, {node_id: 3, shard: 1}, {node_id: 1, shard: 1}} }, previous_replica_set: {nullopt}, replica_revisions: {nullopt}}}
TRACE 2023-10-10 07:52:43,894 [shard 1:main] cluster - controller_backend.cc:970 - [{kafka/topic-jizgvjvxsg/2}] (retry 0) executing operation: {type: deletion, ntp: {kafka/topic-jizgvjvxsg/2}, offset: 48, new_assignment: { id: 2, group_id: 3, replicas: {{node_id: 4, shard: 1}, {node_id: 2, shard: 0}, {node_id: 3, shard: 1}} }, previous_replica_set: {nullopt}, replica_revisions: {nullopt}}}
INFO  2023-10-10 07:52:43,894 [shard 0:main] cluster - topic_updates_dispatcher.cc:76 - dispatched to cores: cluster::errc:0 (local delete true)
TRACE 2023-10-10 07:52:43,894 [shard 0:main] cluster - topic_updates_dispatcher.cc:85 - Deallocating ntp: {{kafka}/{topic-jizgvjvxsg}}, in_progress ops: {}
TRACE 2023-10-10 07:52:43,894 [shard 0:main] cluster - allocation_node.cc:122 - deallocation [node: 2, core: 1], total allocated: 2
TRACE 2023-10-10 07:52:43,894 [shard 0:main] cluster - allocation_node.cc:122 - deallocation [node: 4, core: 1], total allocated: 2
TRACE 2023-10-10 07:52:43,894 [shard 0:main] cluster - allocation_node.cc:122 - deallocation [node: 1, core: 1], total allocated: 2
TRACE 2023-10-10 07:52:43,894 [shard 0:main] cluster - topic_updates_dispatcher.cc:546 - Deallocated ntp: {kafka/topic-jizgvjvxsg/0}, current assignment: {node_id: 2, shard: 1},{node_id: 4, shard: 1},{node_id: 1, shard: 1}, to_delete: {node_id: 2, shard: 1},{node_id: 4, shard: 1},{node_id: 1, shard: 1}
TRACE 2023-10-10 07:52:43,894 [shard 0:main] cluster - allocation_node.cc:122 - deallocation [node: 2, core: 1], total allocated: 1
TRACE 2023-10-10 07:52:43,894 [shard 0:main] cluster - allocation_node.cc:122 - deallocation [node: 3, core: 1], total allocated: 2
TRACE 2023-10-10 07:52:43,894 [shard 0:main] cluster - allocation_node.cc:122 - deallocation [node: 1, core: 1], total allocated: 1
TRACE 2023-10-10 07:52:43,894 [shard 0:main] cluster - topic_updates_dispatcher.cc:546 - Deallocated ntp: {kafka/topic-jizgvjvxsg/1}, current assignment: {node_id: 2, shard: 1},{node_id: 3, shard: 1},{node_id: 1, shard: 1}, to_delete: {node_id: 2, shard: 1},{node_id: 3, shard: 1},{node_id: 1, shard: 1}
TRACE 2023-10-10 07:52:43,894 [shard 0:main] cluster - allocation_node.cc:122 - deallocation [node: 4, core: 1], total allocated: 1
TRACE 2023-10-10 07:52:43,894 [shard 0:main] cluster - allocation_node.cc:122 - deallocation [node: 2, core: 0], total allocated: 0
TRACE 2023-10-10 07:52:43,894 [shard 0:main] cluster - allocation_node.cc:122 - deallocation [node: 3, core: 1], total allocated: 1
TRACE 2023-10-10 07:52:43,894 [shard 0:main] cluster - topic_updates_dispatcher.cc:546 - Deallocated ntp: {kafka/topic-jizgvjvxsg/2}, current assignment: {node_id: 4, shard: 1},{node_id: 2, shard: 0},{node_id: 3, shard: 1}, to_delete: {node_id: 4, shard: 1},{node_id: 2, shard: 0},{node_id: 3, shard: 1}
DEBUG 2023-10-10 07:52:43,894 [shard 0:main] cluster - controller_backend.cc:906 - [{kafka/topic-jizgvjvxsg/0}] (retry 0) finished operation: {type: deletion, revision: 48, assignment: { id: 0, group_id: 1, replicas: {{node_id: 2, shard: 1}, {node_id: 4, shard: 1}, {node_id: 1, shard: 1}} }, previous assignment: {nullopt}}
DEBUG 2023-10-10 07:52:43,894 [shard 0:main] cluster - controller_backend.cc:906 - [{kafka/topic-jizgvjvxsg/1}] (retry 0) finished operation: {type: deletion, revision: 48, assignment: { id: 1, group_id: 2, replicas: {{node_id: 2, shard: 1}, {node_id: 3, shard: 1}, {node_id: 1, shard: 1}} }, previous assignment: {nullopt}}
DEBUG 2023-10-10 07:52:43,894 [shard 0:main] cluster - controller_backend.cc:906 - [{kafka/topic-jizgvjvxsg/2}] (retry 0) finished operation: {type: deletion, revision: 48, assignment: { id: 2, group_id: 3, replicas: {{node_id: 4, shard: 1}, {node_id: 2, shard: 0}, {node_id: 3, shard: 1}} }, previous assignment: {nullopt}}
TRACE 2023-10-10 07:52:43,894 [shard 0:main] storage - readers_cache.cc:327 - {redpanda/controller/0} - removing reader: [0,48] lower_bound: 49
TRACE 2023-10-10 07:52:43,895 [shard 1:main] cluster - shard_table.h:134 - [{kafka/topic-jizgvjvxsg/0}] erasing from shard table, rev: 48
TRACE 2023-10-10 07:52:43,895 [shard 1:main] cluster - shard_table.h:134 - [{kafka/topic-jizgvjvxsg/2}] erasing from shard table, rev: 48
DEBUG 2023-10-10 07:52:43,895 [shard 1:main] cluster - controller_backend.cc:906 - [{kafka/topic-jizgvjvxsg/1}] (retry 0) finished operation: {type: deletion, revision: 48, assignment: { id: 1, group_id: 2, replicas: {{node_id: 2, shard: 1}, {node_id: 3, shard: 1}, {node_id: 1, shard: 1}} }, previous assignment: {nullopt}}
TRACE 2023-10-10 07:52:43,895 [shard 0:main] cluster - shard_table.h:134 - [{kafka/topic-jizgvjvxsg/0}] erasing from shard table, rev: 48
TRACE 2023-10-10 07:52:43,895 [shard 0:main] cluster - shard_table.h:134 - [{kafka/topic-jizgvjvxsg/2}] erasing from shard table, rev: 48

then files for 0 are removed @ 07:52:43

INFO  2023-10-10 07:52:43,982 [shard 1:main] storage - log_manager.cc:546 - Asked to remove: {kafka/topic-jizgvjvxsg/0}

files for 1 are removed @ 07:53:18

INFO  2023-10-10 07:53:18,690 [shard 1:main] storage - log_manager.cc:546 - Asked to remove: {kafka/topic-jizgvjvxsg/2}

and the timeout (30 seconds) expires and the node is terminated

so i'm leaning on recent changes making this process take longer overall. The question is if bumping the timeout is the right thing to do

@andijcr
Copy link
Contributor

andijcr commented Oct 19, 2023

Increasing the timeout works, but the original failing result is kinda weird, considering the code.

the delete_partition op should delete all files then launch a background thread to finalize the remote partition, and this takes a sub-second. then the op should immediately process the next delete_partition op.
this is what is done on a normal test run.

in a failing test run instead, I see that after the first delete_operation op the code does nothing on the shard and waits for housekeeping to kick in to process the next, which would mean that either the topic_delta deque contains only one element each time, or that the previous delete_partion op generated an exception (but i don't see traces of this in the log).

edit: once local delete of a partition is requested, _pending_deltas gets all the partition assignments for a topic, and then notifies the fetch_delta() watcher

@andijcr
Copy link
Contributor

andijcr commented Oct 19, 2023

so the sequence of events from https://ci-artifacts.dev.vectorized.cloud/redpanda/38604/018b186e-7e74-47df-853e-7211a9498a1e/vbuild/ducktape/results/2023-10-10--001/TopicDeleteCloudStorageTest/topic_delete_unavailable_test/cloud_storage_type=CloudStorageType.ABS/143/
seems to be

  • topic deletion is requested
  • the request eventually gets propagated to topic_table::do_local_delete, which in turns fills _pending_deltas with all the partitions assignments, and notify the watchers
  • controller_backend::fetch_deltas is a watcher, it receives a copy of _pending_deltas, saves it, and launches reconcile_ntp that for each delta (partition) deletes local files and performs finalization of the remote partition.

In the regular case, all the partitions get processed one after another, but in the slow case only one partition gets processed immediately, and the other after 30 seconds.
Since the partitions move together, it implies that an exception stops the loop after the first. But controller_backend::housekeeping runs every 1s and performs the same reconcile_ntp op, so it would clear the rest quickly.
So this could mean that the partitions are removed separately, with a delay between op of ~30secs, but I don't see a mechanism for this.

@andijcr
Copy link
Contributor

andijcr commented Oct 19, 2023

these lines are after we call Stopping archiver on partition 2

TRACE 2023-10-10 07:52:51,833 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:606 - upload_until_term_change: got units (current 0), paused=false
DEBUG 2023-10-10 07:52:51,833 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:1976 - Uploading next candidates called for {kafka/topic-jizgvjvxsg/2}
DEBUG 2023-10-10 07:52:51,833 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:1623 - scheduling uploads, start offset: 74, last offset: 117, upload kind: non-compacted, uploads remaining: 4
DEBUG 2023-10-10 07:52:51,833 [shard 1:au  ] archival - archival_policy.cc:114 - Upload policy for {kafka/topic-jizgvjvxsg/2} invoked, start offset: 74
DEBUG 2023-10-10 07:52:51,833 [shard 1:au  ] archival - archival_policy.cc:185 - Upload policy for {kafka/topic-jizgvjvxsg/2}: can't find candidate, only non-data batches to upload (kafka start_offset: 72, kafka last_stable_offset: 72)
DEBUG 2023-10-10 07:52:51,833 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:1464 - upload candidate not found, start_upload_offset: 74, last_stable_offset: 117
DEBUG 2023-10-10 07:52:51,833 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:1679 - scheduled 0 uploads for upload kind: non-compacted, uploads remaining: 4
DEBUG 2023-10-10 07:52:51,833 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:1702 - no uploads started for segment upload kind: non-compacted, returning
DEBUG 2023-10-10 07:52:51,833 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:1702 - no uploads started for segment upload kind: compacted, returning
TRACE 2023-10-10 07:52:51,833 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:1967 - Segment uploads complete: 0 successful uploads
DEBUG 2023-10-10 07:52:51,833 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:949 - [upload_loop_epilogue] Manifest is clean, skipping upload
TRACE 2023-10-10 07:52:51,833 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:690 - upload_until_term_change: released units (current 1)
TRACE 2023-10-10 07:52:51,833 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:703 - Nothing to upload, applying backoff algorithm

@andijcr
Copy link
Contributor

andijcr commented Oct 19, 2023

demarcation log line: partition::stop

DEBUG 2023-10-10 07:53:18,680 [shard 1:main] cluster - partition.cc:579 - Stopped partition {kafka/topic-jizgvjvxsg/2}

@andijcr
Copy link
Contributor

andijcr commented Oct 23, 2023

current direction is unserstanding who is holding this fiber active

TRACE 2023-10-10 07:53:01,847 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:606 - upload_until_term_change: got units (current 0), paused=false
DEBUG 2023-10-10 07:53:01,848 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:1976 - Uploading next candidates called for {kafka/topic-jizgvjvxsg/2}
DEBUG 2023-10-10 07:53:01,848 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:1623 - scheduling uploads, start offset: 74, last offset: 117, upload kind: non-compacted, uploads remaining: 4
DEBUG 2023-10-10 07:53:01,848 [shard 1:au  ] archival - archival_policy.cc:114 - Upload policy for {kafka/topic-jizgvjvxsg/2} invoked, start offset: 74
DEBUG 2023-10-10 07:53:01,848 [shard 1:au  ] archival - archival_policy.cc:185 - Upload policy for {kafka/topic-jizgvjvxsg/2}: can't find candidate, only non-data batches to upload (kafka start_offset: 72, kafka last_stable_offset: 72)
DEBUG 2023-10-10 07:53:01,848 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:1464 - upload candidate not found, start_upload_offset: 74, last_stable_offset: 117
DEBUG 2023-10-10 07:53:01,848 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:1679 - scheduled 0 uploads for upload kind: non-compacted, uploads remaining: 4
DEBUG 2023-10-10 07:53:01,848 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:1702 - no uploads started for segment upload kind: non-compacted, returning
DEBUG 2023-10-10 07:53:01,848 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:1702 - no uploads started for segment upload kind: compacted, returning
TRACE 2023-10-10 07:53:01,848 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:1967 - Segment uploads complete: 0 successful uploads
DEBUG 2023-10-10 07:53:01,848 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:949 - [upload_loop_epilogue] Manifest is clean, skipping upload
TRACE 2023-10-10 07:53:01,848 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:690 - upload_until_term_change: released units (current 1)
TRACE 2023-10-10 07:53:01,848 [shard 1:au  ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:703 - Nothing to upload, applying backoff algorithm

well after the archival for partition 2 is stopped. it's a symptom of ntp_archiver::stop not having completed, and this is holding data deletion. for the other partition/nodes, the archival is stop is immediate

@andijcr
Copy link
Contributor

andijcr commented Oct 24, 2023

some other comments:
redpanda_13_deadlock.zip

npt_archiver::stop gets called by partition::stop and it completes, because in the log we see this
DEBUG 2023-10-10 07:52:44,045 [shard 1:main] cluster - partition.cc:558 - Stopping archiver on partition: {kafka/topic-jizgvjvxsg/2}
before calling stop on the archiver and we see this
DEBUG 2023-10-10 07:53:18,680 [shard 1:main] cluster - partition.cc:579 - Stopped partition {kafka/topic-jizgvjvxsg/2}

that comes after awaiting ntp_archiver::stop(),
but the process took >30 seconds (failing the test)

while for partition 0 it's immediate

DEBUG 2023-10-10 07:52:43,971 [shard 1:main] cluster - partition.cc:558 - Stopping archiver on partition: {kafka/topic-jizgvjvxsg/0}
DEBUG 2023-10-10 07:52:43,971 [shard 1:main] cluster - partition.cc:579 - Stopped partition {kafka/topic-jizgvjvxsg/0}

the difference is that rp_13 is leader for partition 2 and not 0, so it could be beneficial to look to the node leader for partition 0

given that fiber9 kafka/topic-jizgvjvxsg/2 is alive for a long period of time, something might be holding the npt_archiver gate, or blocking on the scrubber and preventing the abort source to fire

the scrubber prints this before waiting on gate.close()

DEBUG 2023-10-10 07:52:44,045 [shard 1:main] cluster - partition.cc:558 - Stopping archiver on partition: {kafka/topic-jizgvjvxsg/2}
INFO  2023-10-10 07:52:44,045 [shard 1:main] archival - scrubber.cc:153 - Stopping scrubber (2)...

the (2) means that there are 2 holders for the gate, which seems strange. There should be at max 3 holders for the gate. 1 releases it immediately just after start, the other 2:

  • void scrubber::release() release 1 and should be called before by DEBUG 2023-10-10 07:52:44,045 [shard 1:main] archival - [fiber1] - upload_housekeeping_service.cc:190 - Deregistering job: scrubber
  • scrubber::run(retry_chain_node& rtc_node, run_quota_t quota) holds the gate and runs in a loop, but the period of gate holding is ended by this:
    DEBUG 2023-10-10 07:52:30,869 [shard 1:au ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - scrubber.cc:63 - Skipping cloud partition scrub: next scrub in 0m
    and it does not hold the gate across suspension points in shard 1, so it should not contribute for the count

So the count of 2 is not clear

Then this is the list of methods that hold ntp_archiver's gate

  1. upload_until_abort. Seems unlikely to hang, listens for abort source except for a section in https://github.com/redpanda-data/redpanda/blame/894cd42be66b6113419564e104c5ad571fa7f3f1/src/v/archival/ntp_archiver_service.cc#L696-L706 but these calls listen to the abort source
  2. download_manifest. not called
  3. upload_manifest. last log DEBUG 2023-10-10 07:52:30,499 [shard 1:au ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:1035 - [upload_loop_epilogue] Uploading partition manifest, insync_offset=115, path={"60000000/meta/kafka/topic-jizgvjvxsg/2_30/manifest.bin"} so success, the function terminates and no gate is held DEBUG 2023-10-10 07:52:30,502 [shard 1:au ] cloud_storage - [fiber9~76~0|1|9997ms] - remote.cc:385 - Successfully uploaded partition manifest to {"60000000/meta/kafka/topic-jizgvjvxsg/2_30/manifest.bin"}
  4. get_aborted_transactions. no log in the code. rm_stm (should exist) holds a read_lock on a mutex that has a write lock taken during local state removal (check this). The lock protects rm_stm::do_aborted_transactions, but this should not hang. still a possible candidate. it's waited by ntp_archiver::schedule_uploads, but this terminates. it's called by upload, but i don't see this hanging.
  5. upload_tx. not called
  6. upload_next_candidate. (there is a mutex) last call: DEBUG 2023-10-10 07:53:11,852 [shard 1:au ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:1976 - Uploading next candidates called for {kafka/topic-jizgvjvxsg/2} and ends with TRACE 2023-10-10 07:53:11,852 [shard 1:au ] archival - [fiber9 kafka/topic-jizgvjvxsg/2] - ntp_archiver_service.cc:1967 - Segment uploads complete: 0 successful uploads late but the function does not hang. might be useful to sequence what was called before this.
  7. maybe_truncate_manifest. not called
  8. find_reupload_candidate. not called
  9. upload it shouldn't and is connected either to the internal abort source or by adjacent_segment_merger's abort source, but it's already stopped
  10. prepare_transfer_leadership. not called

next step:
check the node leader for partition 0
check get_aborted_transactions if it can hangs due to the read lock

@andijcr
Copy link
Contributor

andijcr commented Oct 24, 2023

Thanks to @VladLazar for pointing out the source of
INFO 2023-10-10 07:52:44,045 [shard 1:main] archival - scrubber.cc:153 - Stopping scrubber (2)...

retry_chain_node anomaly_detection_rtc(5min, 100ms, &rtc_node);

this retry chain node is not aware of the abort source of the scrubber. a fiber from scrubber::run is created before scrubber::stop, it get's to hang on network operations due to the firewall, and holds the scrubber's gate. at shutdown, the retry_chain is destroyed and the partition deletion completes.

A quick solution is to have a finer-grained retry_chain_node for each network operation, and 30 secs of timeout instead of the current 5min.

another solution would be to modify retry_chain_node with the possibility to listen for multiple abort sources, with an or combinator, so as to be able to respond to scrubber::stop in a direct manner

@andijcr andijcr added sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages and removed sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low. labels Oct 24, 2023
@andijcr
Copy link
Contributor

andijcr commented Oct 24, 2023

sev/low because with this finding we know that the hang is not due to a deadlock, it's a timeout of 5min holding the stop gate

@piyushredpanda
Copy link
Contributor

Thanks!
While sev/low is good, I'd love to see the fix as well :)

@VladLazar
Copy link
Contributor

Andrea and I discussed this today. I'll look into fixing this in the next few days since Andrea's gonna be out for the rest of the week.

@mmaslankaprv
Copy link
Member Author

appeared again in: #14995

@piyushredpanda
Copy link
Contributor

Not seen in at least two months, closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem ci-failure kind/bug Something isn't working sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages
Projects
None yet
4 participants