-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI Failure (timeout) in test_archival_service_rpfixture
#14254
Comments
maybe related to the partition::stop() block i'm seeing in #14065 ? |
as a datapoint, just observing the output in the terminal i see the test hanging here and then continue likely from test_archival_policy_search_when_a_segment_is_compacted
|
Ran this on repeat in CI and in the debug build I did see this get blocked after one test failed:
This seems to have got stuck at around 3:07 PM IST, after having been started at 2:29 PM IST, so around 40 minutes. The build was cancelled manually at 6 PM IST so it did not appear to progress for the intervening 3.5 hours at least in buildkite logs.
|
Another run of buildkite with just the debug build and all logs turned on runs into some sort of internal error but downloading the raw logs again shows the test run stuck at
While the buildkite webpage is stuck at:
|
Another run failed at the same spot - blocked for 35 minutes. This seems like at least one spot where the test hangs. Will create a PR to fix this. |
Artificially failing the test at the same assertion point does not trigger the test hang locally. There are the following areas to investigate here:
|
The test hanging in CI (when it fails) has a different set of errors than when the test fails locally (artifically failed using a bogus comparison etc) in CI (the gate related assert and the request to abort are visible):
locally:
In this test we add a set of segments to the archival STM before triggering spillover. For each added segment we should be able to see some logs for the added segment. In CI when the test fails, no such logs are seen:
The updated offsets are also indicative of no segments being added. When the test is run locally these logs are visible:
To recreate the same scenario (no segments added to STM) if the part to add segments to STM is removed, the test hangs locally just like in CI:
The test run hangs at this point, so the issue seems related closely to the STM state. |
The assertion which hangs the test run is from the scrubber in the archiver being destroyed, the scrubber seems to have some requests in flight when the gate is destroyed. Adding a |
I see this in #15344 it looks like it is the same thing. |
It looks like a different issue:
|
This issue hasn't reoccurred for more than 2 months; closing. |
https://buildkite.com/redpanda/redpanda/builds/39032#018b384f-4683-4253-9ec8-dc24f384f602
This happened on #13896, so it could be related to that, but @nvartolomei mentioned he's seen the issue previously.
The text was updated successfully, but these errors were encountered: