Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (timeout) in test_archival_service_rpfixture #14254

Closed
VladLazar opened this issue Oct 18, 2023 · 12 comments · Fixed by #15169
Closed

CI Failure (timeout) in test_archival_service_rpfixture #14254

VladLazar opened this issue Oct 18, 2023 · 12 comments · Fixed by #15169
Assignees
Labels
area/cloud-storage Shadow indexing subsystem ci-failure kind/bug Something isn't working rpunit unit test ci-failure (not ducktape)

Comments

@VladLazar
Copy link
Contributor

https://buildkite.com/redpanda/redpanda/builds/39032#018b384f-4683-4253-9ec8-dc24f384f602

This happened on #13896, so it could be related to that, but @nvartolomei mentioned he's seen the issue previously.

@VladLazar VladLazar added kind/bug Something isn't working ci-failure area/cloud-storage Shadow indexing subsystem rpunit unit test ci-failure (not ducktape) labels Oct 18, 2023
@andijcr
Copy link
Contributor

andijcr commented Oct 20, 2023

maybe related to the partition::stop() block i'm seeing in #14065 ?

@VladLazar
Copy link
Contributor Author

maybe related to the partition::stop() block i'm seeing in #14065 ?

Perhaps, but for #14065, it un-wedges itself eventually.

@andijcr
Copy link
Contributor

andijcr commented Nov 2, 2023

as a datapoint, just observing the output in the terminal i see the test hanging here and then continue

likely from test_archival_policy_search_when_a_segment_is_compacted

INFO  2023-11-02 15:49:38,772 [shard 0:main] client_pool - client_pool.cc:164 - Stopping client pool: 2
INFO  2023-11-02 15:49:38,772 [shard 0:main] client_pool - client_pool.cc:184 - Stopped client pool
INFO  2023-11-02 15:49:38,772 [shard 0:main] kafka - server.cc:280 - kafka rpc protocol - Stopping 1 listeners
INFO  2023-11-02 15:49:38,772 [shard 0:main] kafka - server.cc:292 - kafka rpc protocol - Shutting down 0 connections
INFO  2023-11-02 15:49:38,772 [shard 0:main] rpc - server.cc:280 - vectorized internal rpc protocol - Stopping 1 listeners
INFO  2023-11-02 15:49:38,772 [shard 0:main] rpc - server.cc:292 - vectorized internal rpc protocol - Shutting down 0 connections
INFO  2023-11-02 15:49:38,772 [shard 0:main] client_pool - client_pool.cc:189 - Shutting down client pool: 20
INFO  2023-11-02 15:49:38,772 [shard 0:main] client_pool - client_pool.cc:203 - Shut down of client pool complete
INFO  2023-11-02 15:49:38,772 [shard 0:main] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:266 - Stopping
INFO  2023-11-02 15:49:38,772 [shard 0:main] raft - [group_id:1, {kafka/test-topic/42}] consensus.cc:266 - Stopping
INFO  2023-11-02 15:49:41,601 [shard 0:main] cluster - partition_balancer_planner.cc:1528 - counts rebalancing objective in domain 0:      1 ->      1
INFO  2023-11-02 15:49:41,602 [shard 0:main] cluster - partition_balancer_planner.cc:1528 - counts rebalancing objective in domain -1:      1 ->      1
INFO  2023-11-02 15:49:41,602 [shard 0:main] cluster - partition_balancer_backend.cc:420 - last status: in_progress; violations: unavailable nodes: 0, full nodes: 0; nodes to rebalance count: 1; on demand rebalance requested: false; updates in progress: 0; action counts: reassignments: 0, cancellations: 0, failed: 0; counts rebalancing finished: true, force refresh health report: false
WARN  2023-11-02 15:49:41,602 [shard 0:main] cluster - partition_balancer_backend.cc:443 - submitting finish_reallocations for node 1 failed, error: raft protocol shutting down

@abhijat
Copy link
Contributor

abhijat commented Nov 27, 2023

Ran this on repeat in CI and in the debug build I did see this get blocked after one test failed:

t=1701077829694116: /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-03788b59bfc51b527-1/redpanda/vtools/src/v/archival/tests/ntp_archiver_test.cc(1780): fatal error: in "test_manifest_spillover": critical check uploaded.size() != 0 has failed
t=1701077829694116: INFO  2023-11-27 09:37:09,694 [shard 0:main] test - ntp_archiver_test.cc:1766 - Reconciling storage bucket
t=1701077829694116: test_archival_service_rpfixture: /vectorized/include/seastar/core/gate.hh:91: seastar::gate::~gate(): Assertion `!_count && "gate destroyed with outstanding requests"' failed.
t=1701077829694116: ==51==WARNING: ASan is ignoring requested __asan_handle_no_return: stack type: default top: 0x7f226d2bc000; bottom 0x7ffd974b2000; size: 0xffffff24d5e0a000 (-941304537088)
t=1701077829694116: False positive error reports may follow
t=1701077829694116: For details see https://github.com/google/sanitizers/issues/189
t=1701077829694116: unknown location(0): fatal error: in "test_manifest_spillover": signal: SIGABRT (application abort requested)
t=1701077829694116: /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-03788b59bfc51b527-1/redpanda/vtools/src/v/archival/tests/ntp_archiver_test.cc(1780): last checkpoint
t=1701077829694116: /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-03788b59bfc51b527-1/redpanda/vtools/src/v/archival/tests/ntp_archiver_test.cc(1797): Leaving test case "test_manifest_spillover"; testing time: 3501920us
t=1701077829694116: /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-03788b59bfc51b527-1/redpanda/vtools/src/v/archival/tests/ntp_archiver_test.cc(1802): Entering test case "test_upload_with_gap_blocked"

This seems to have got stuck at around 3:07 PM IST, after having been started at 2:29 PM IST, so around 40 minutes. The build was cancelled manually at 6 PM IST so it did not appear to progress for the intervening 3.5 hours at least in buildkite logs.

test_manifest_spillover failed and then we get a subsequent assertion from the gate, which should be fixed with a defer to clean up if the test fails.

test_upload_with_gap_blocked is entered but no further logs are seen.

@abhijat
Copy link
Contributor

abhijat commented Nov 27, 2023

Another run of buildkite with just the debug build and all logs turned on runs into some sort of internal error but downloading the raw logs again shows the test run stuck at

t=1701092691580116: DEBUG 2023-11-27 13:44:51,580 [shard 0:main] test - ntp_archiver_test.cc:1739 - apply spillover, spillover manifest size: 4096
t=1701092691580116: INFO  2023-11-27 13:44:51,580 [shard 0:main] test - ntp_archiver_test.cc:1763 - new_so: {nullopt}, new_kafka: {nullopt}, archive_so: -9223372036854775808, archive_kafka: -9223372036854775808, archive_clean: -9223372036854775808
t=1701092691580116: /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-03788b59bfc51b527-1/redpanda/vtools/src/v/archival/tests/ntp_archiver_test.cc(1780): fatal error: in "test_manifest_spillover": critical check uploaded.size() != 0 has failed
t=1701092691580116: INFO  2023-11-27 13:44:51,580 [shard 0:main] test - ntp_archiver_test.cc:1766 - Reconciling storage bucket
t=1701092691580116: test_archival_service_rpfixture: /vectorized/include/seastar/core/gate.hh:91: seastar::gate::~gate(): Assertion `!_count && "gate destroyed with outstanding requests"' failed.
t=1701092691580116: ==873==WARNING: ASan is ignoring requested __asan_handle_no_return: stack type: default top: 0x7f8a8deff000; bottom 0x7ffc5a511000; size: 0xffffff8e339ee000 (-488760221696)
t=1701092691580116: False positive error reports may follow
t=1701092691580116: For details see https://github.com/google/sanitizers/issues/189
t=1701092691580116: unknown location(0): fatal error: in "test_manifest_spillover": signal: SIGABRT (application abort requested)
t=1701092691580116: /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-03788b59bfc51b527-1/redpanda/vtools/src/v/archival/tests/ntp_archiver_test.cc(1780): last checkpoint
t=1701092691581116: /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-03788b59bfc51b527-1/redpanda/vtools/src/v/archival/tests/ntp_archiver_test.cc(1797): Leaving test case "test_manifest_spillover"; testing time: 3498975us
t=1701092691581116: /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-03788b59bfc51b527-1/redpanda/vtools/src/v/archival/tests/ntp_archiver_test.cc(1802): Entering test case "test_upload_with_gap_blocked"

While the buildkite webpage is stuck at:

⚠️ Oops, something went wrong when we tried to fetch your log
We've been notified of this error, and we'll take a look at it very soon. In the mean time, perhaps try reloading the page?

@abhijat
Copy link
Contributor

abhijat commented Nov 27, 2023

t=1701098806322116: DEBUG 2023-11-27 15:26:46,322 [shard 0:main] test - ntp_archiver_test.cc:1739 - apply spillover, spillover manifest size: 4096
t=1701098806322116: INFO  2023-11-27 15:26:46,322 [shard 0:main] test - ntp_archiver_test.cc:1763 - new_so: {nullopt}, new_kafka: {nullopt}, archive_so: -9223372036854775808, archive_kafka: -9223372036854775808, archive_clean: -9223372036854775808
t=1701098806322116: /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-03788b59bfc51b527-1/redpanda/vtools/src/v/archival/tests/ntp_archiver_test.cc(1780): fatal error: in "test_manifest_spillover": critical check uploaded.size() != 0 has failed
t=1701098806322116: INFO  2023-11-27 15:26:46,322 [shard 0:main] test - ntp_archiver_test.cc:1766 - Reconciling storage bucket
t=1701098806322116: test_archival_service_rpfixture: /vectorized/include/seastar/core/gate.hh:91: seastar::gate::~gate(): Assertion `!_count && "gate destroyed with outstanding requests"' failed.
t=1701098806322116: ==212==WARNING: ASan is ignoring requested __asan_handle_no_return: stack type: default top: 0x7f46772eb000; bottom 0x7ffd44a01000; size: 0xffffff49328ea000 (-785130807296)
t=1701098806322116: False positive error reports may follow
t=1701098806322116: For details see https://github.com/google/sanitizers/issues/189
t=1701098806322116: unknown location(0): fatal error: in "test_manifest_spillover": signal: SIGABRT (application abort requested)
t=1701098806322116: /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-03788b59bfc51b527-1/redpanda/vtools/src/v/archival/tests/ntp_archiver_test.cc(1780): last checkpoint
t=1701098806322116: /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-03788b59bfc51b527-1/redpanda/vtools/src/v/archival/tests/ntp_archiver_test.cc(1797): Leaving test case "test_manifest_spillover"; testing time: 3499867us
t=1701098806322116: /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-03788b59bfc51b527-1/redpanda/vtools/src/v/archival/tests/ntp_archiver_test.cc(1802): Entering test case "test_upload_with_gap_blocked"

Another run failed at the same spot - blocked for 35 minutes. This seems like at least one spot where the test hangs. Will create a PR to fix this.

@abhijat
Copy link
Contributor

abhijat commented Nov 28, 2023

Artificially failing the test at the same assertion point does not trigger the test hang locally. There are the following areas to investigate here:

  1. fixing the test failure itself
  2. cleaning up if the test does fail via shutting down the archiver etc
  3. does this test hang or the next one (test_upload_with_gap_blocked)?

@abhijat
Copy link
Contributor

abhijat commented Nov 28, 2023

The test hanging in CI (when it fails) has a different set of errors than when the test fails locally (artifically failed using a bogus comparison etc)

in CI (the gate related assert and the request to abort are visible):

t=1701077829694116: /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-03788b59bfc51b527-1/redpanda/vtools/src/v/archival/tests/ntp_archiver_test.cc(1780): fatal error: in "test_manifest_spillover": critical check uploaded.size() != 0 has failed
t=1701077829694116: INFO  2023-11-27 09:37:09,694 [shard 0:main] test - ntp_archiver_test.cc:1766 - Reconciling storage bucket
t=1701077829694116: test_archival_service_rpfixture: /vectorized/include/seastar/core/gate.hh:91: seastar::gate::~gate(): Assertion `!_count && "gate destroyed with outstanding requests"' failed.
t=1701077829694116: ==51==WARNING: ASan is ignoring requested __asan_handle_no_return: stack type: default top: 0x7f226d2bc000; bottom 0x7ffd974b2000; size: 0xffffff24d5e0a000 (-941304537088)
t=1701077829694116: False positive error reports may follow
t=1701077829694116: For details see https://github.com/google/sanitizers/issues/189
t=1701077829694116: unknown location(0): fatal error: in "test_manifest_spillover": signal: SIGABRT (application abort requested)
t=1701077829694116: /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-03788b59bfc51b527-1/redpanda/vtools/src/v/archival/tests/ntp_archiver_test.cc(1780): last checkpoint

locally:

INFO  2023-11-28 10:56:25,963 [shard 0:main] test - ntp_archiver_test.cc:1769 - Found /10000000/meta/kafka/test-topic/42_0/manifest.bin.514.770.514.771.514.770
INFO  2023-11-28 10:56:25,964 [shard 0:main] test - ntp_archiver_test.cc:1769 - Found /10000000/meta/kafka/test-topic/42_0/manifest.bin.771.1027.771.1028.771.1027
/development/redpanda/src/v/archival/tests/ntp_archiver_test.cc(1780): fatal error: in "test_manifest_spillover": critical check uploaded.size() == 0 has failed
INFO  2023-11-28 10:56:25,966 [shard 0:main] client_pool - client_pool.cc:189 - Shutting down client pool: 2
INFO  2023-11-28 10:56:25,966 [shard 0:main] client_pool - client_pool.cc:203 - Shut down of client pool complete
INFO  2023-11-28 10:56:25,969 [shard 0:main] client_pool - client_pool.cc:164 - Stopping client pool: 2

In this test we add a set of segments to the archival STM before triggering spillover. For each added segment we should be able to see some logs for the added segment. In CI when the test fails, no such logs are seen:

t=1701077829693116: DEBUG 2023-11-27 09:37:09,692 [shard 0:main] test - ntp_archiver_test.cc:1700 - stm add segments
t=1701077829693116: DEBUG 2023-11-27 09:37:09,693 [shard 0:main] test - ntp_archiver_test.cc:1710 - stm add segments completed
t=1701077829693116: DEBUG 2023-11-27 09:37:09,693 [shard 0:main] test - ntp_archiver_test.cc:1715 - archival metadata stm last offset: 0
...
t=1701077829694116: INFO  2023-11-27 09:37:09,694 [shard 0:main] test - ntp_archiver_test.cc:1763 - new_so: {nullopt}, new_kafka: {nullopt}, archive_so: -9223372036854775808, archive_kafka: -9223372036854775808, archive_clean: -9223372036854775808

The updated offsets are also indicative of no segments being added. When the test is run locally these logs are visible:

INFO  2023-11-28 10:56:23,521 [shard 0:main] cluster - ntp: {kafka/test-topic/42} - archival_metadata_stm.cc:941 - new remote segment added (name: {0-1-v1.log}, meta: {is_compacted: false, size_bytes: 1, base_offset: 0, committed_offset: 0, base_timestamp: {timestamp: 0}, max_timestamp: {timestamp: 0}, delta_offset: 0, ntp_revision: 0, archiver_term: 1, segment_term: 1, delta_offset_end: 0, sname_format: {v1}, metadata_size_hint: 0}remote start_offset: 0, last_offset: 1536, highest_producer_id: -9223372036854775808)
INFO  2023-11-28 10:56:23,522 [shard 0:main] cluster - ntp: {kafka/test-topic/42} - archival_metadata_stm.cc:941 - new remote segment added (name: {1-1-v1.log}, meta: {is_compacted: false, size_bytes: 1, base_offset: 1, committed_offset: 1, base_timestamp: {timestamp: 1}, max_timestamp: {timestamp: 1}, delta_offset: 0, ntp_revision: 0, archiver_term: 1, segment_term: 1, delta_offset_end: 0, sname_format: {v1}, metadata_size_hint: 0}remote start_offset: 0, last_offset: 1536, highest_producer_id: -9223372036854775808)
INFO  2023-11-28 10:56:23,522 [shard 0:main] cluster - ntp: {kafka/test-topic/42} - archival_metadata_stm.cc:941 - new remote segment added (name: {2-1-v1.log}, meta: {is_compacted: false, size_bytes: 1, base_offset: 2, committed_offset: 2, base_timestamp: {timestamp: 2}, max_timestamp: {timestamp: 2}, delta_offset: 0, ntp_revision: 0, archiver_term: 1, segment_term: 1, delta_offset_end: 0, sname_format: {v1}, metadata_size_hint: 0}remote start_offset: 0, last_offset: 1536, highest_producer_id: -9223372036854775808)
INFO  2023-11-28 10:56:23,522 [shard 0:main] cluster - ntp: {kafka/test-topic/42} - archival_metadata_stm.cc:941 - new remote segment added (name: {3-1-v1.log}, meta: {is_compacted: false, size_bytes: 1, base_offset: 3, committed_offset: 3, base_timestamp: {timestamp: 3}, max_timestamp: {timestamp: 3}, delta_offset: 0, ntp_revision: 0, archiver_term: 1, segment_term: 1, delta_offset_end: 0, sname_format: {v1}, metadata_size_hint: 0}remote start_offset: 0, last_offset: 1536, highest_producer_id: -9223372036854775808)
...
INFO  2023-11-28 10:56:25,959 [shard 0:main] test - ntp_archiver_test.cc:1763 - new_so: {1028}, new_kafka: {1028}, archive_so: 0, archive_kafka: 0, archive_clean: 0

To recreate the same scenario (no segments added to STM) if the part to add segments to STM is removed, the test hangs locally just like in CI:

INFO  2023-11-28 11:25:47,664 [shard 0:main] raft - [group_id:1, {kafka/test-topic/42}] vote_stm.cc:281 - became the leader term: 2
DEBUG 2023-11-28 11:25:47,679 [shard 0:main] test - ntp_archiver_test.cc:1700 - stm add segments
DEBUG 2023-11-28 11:25:47,679 [shard 0:main] test - ntp_archiver_test.cc:1710 - stm add segments completed
DEBUG 2023-11-28 11:25:47,679 [shard 0:main] test - ntp_archiver_test.cc:1715 - archival metadata stm last offset: 0
TRACE 2023-11-28 11:25:47,679 [shard 0:main] features - feature_table.cc:493 - Feature cloud_storage_scrubbing already active
DEBUG 2023-11-28 11:25:47,680 [shard 0:main] archival - ntp_archiver_service.cc:257 - created ntp_archiver {kafka/test-topic/42} in term 2
DEBUG 2023-11-28 11:25:47,680 [shard 0:main] test - ntp_archiver_test.cc:1739 - apply spillover, spillover manifest size: 4096
/development/redpanda/src/v/archival/tests/ntp_archiver_test.cc(1780): fatal error: in "test_manifest_spillover": critical check uploaded.size() != 0 has failed
DEBUG 2023-11-28 11:25:47,680 [shard 0:main] archival - [fiber96 kafka/test-topic/42] - ntp_archiver_service.cc:2450 - Manifest size: 312, manifest size limit (x2): 8192
INFO  2023-11-28 11:25:47,680 [shard 0:main] test - ntp_archiver_test.cc:1763 - new_so: {nullopt}, new_kafka: {nullopt}, archive_so: -9223372036854775808, archive_kafka: -9223372036854775808, archive_clean: -9223372036854775808
unknown location(0): fatal error: in "test_manifest_spillover": signal: SIGABRT (application abort requested)
/development/redpanda/src/v/archival/tests/ntp_archiver_test.cc(1780): last checkpoint
/development/redpanda/src/v/archival/tests/ntp_archiver_test.cc(1797): Leaving test case "test_manifest_spillover"; testing time: 2833986us
/development/redpanda/src/v/archival/tests/ntp_archiver_test.cc(1802): Entering test case "test_upload_with_gap_blocked"
INFO  2023-11-28 11:25:47,680 [shard 0:main] test - ntp_archiver_test.cc:1766 - Reconciling storage bucket
test_archival_service_rpfixture: /development/redpanda/vbuild/debug/clang/rp_deps_install/include/seastar/core/gate.hh:91: seastar::gate::~gate(): Assertion `!_count && "gate destroyed with outstanding requests"' failed.
==409617==WARNING: ASan is ignoring requested __asan_handle_no_return: stack type: default top: 0x7f16eb9bc000; bottom 0x7fff34ac0000; size: 0xffffff17b6efc000 (-997658214400)
False positive error reports may follow
For details see https://github.com/google/sanitizers/issues/189

The test run hangs at this point, so the issue seems related closely to the STM state.

@abhijat
Copy link
Contributor

abhijat commented Nov 28, 2023

The assertion which hangs the test run is from the scrubber in the archiver being destroyed, the scrubber seems to have some requests in flight when the gate is destroyed. Adding a defer to stop the archiver seems to fix the hang, but why the original test test_manifest_spillover fails to sometimes add segments remains to be seen. That can be done via a separate ticket perhaps.

@dotnwat
Copy link
Member

dotnwat commented Dec 6, 2023

I see this in #15344 it looks like it is the same thing.

@dotnwat dotnwat reopened this Dec 6, 2023
@abhijat
Copy link
Contributor

abhijat commented Feb 12, 2024

I see this in #15344 it looks like it is the same thing.

It looks like a different issue:

t=1701892909662/var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-0d3b71f9868f3b66e-1/redpanda/redpanda/src/v/archival/tests/ntp_archiver_test.cc(1802): fatal error: in "test_manifest_spillover": critical check uploaded.size() != 0 has failed

@piyushredpanda
Copy link
Contributor

This issue hasn't reoccurred for more than 2 months; closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem ci-failure kind/bug Something isn't working rpunit unit test ci-failure (not ducktape)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants