CI Failure (critical check f.available() has failed) in `test_concurrent_append_flush` #13035

Lazin · 2023-08-28T07:46:49Z

The test_concurrent_append_flush test from the storage_single_thread_rpunit has failed.

https://buildkite.com/redpanda/redpanda/builds/35783#018a3818-bc30-40c6-8b1e-d45bc58308b6/6-5451

The error message:

log_segment_appender_test.cc(355): �[4;31;49mfatal error: in "test_concurrent_append_flush": critical check f.available() has failed�[0;39;49m

The text was updated successfully, but these errors were encountered:

andijcr · 2023-09-21T13:59:29Z

can't reproduce locally, but it's kind of a fuzz test so it makes sense that the behavior is difficult to reproduce.

the comment in the test states:

    // now we expect all the prior flush futures to be available
    // we don't guarantee this is in the API currently but it is how it
    // works currently and we might as well assert it
    for (auto& f : futs) {
        BOOST_REQUIRE(f.available());
        f.get(); // propagate any exception
    }

so it might be a symptom of an underlying bug in segment_appender (the source of these futures).

will include further logging to generate a bit more context around the failure.

update:

also a check failed:

/var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-0c2b37205f0b6375f-1/redpanda/redpanda/src/v/storage/tests/log_segment_appender_test.cc(373): Entering test case "test_concurrent_append_flush"
/var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-0c2b37205f0b6375f-1/redpanda/redpanda/src/v/storage/tests/log_segment_appender_test.cc(348): error: in "test_concurrent_append_flush": check access(appender).inflight_dispatched() == 0 has failed [1 != 0]
fatal error: in "test_concurrent_append_flush": critical check f.available() has failed

so this error makes a bit more sense

piyushredpanda · 2023-09-21T20:23:19Z

Do we not have a fixed seed for the fuzz? (I am assuming that fuzz is driven off a seed for reproducibility)

andijcr · 2023-09-29T18:00:03Z

from https://buildkite.com/redpanda/redpanda/builds/37956#018ae10e-3932-417e-964d-8b4bf1aa5355
(obviously it didn't hit the new context)

14711:/var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-00c96147bbc1e2de6-1/redpanda/redpanda/src/v/storage/tests/log_segment_appender_test.cc(392): Entering test case "test_concurrent_append_flush"
14712:/var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-00c96147bbc1e2de6-1/redpanda/redpanda/src/v/storage/tests/log_segment_appender_test.cc(362): error: in "test_concurrent_append_flush": check access(appender).inflight_dispatched() == 0 has failed [1 != 0]
14713:fatal error: in "test_concurrent_append_flush": critical check f.available() has failed
14714:/var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-00c96147bbc1e2de6-1/redpanda/redpanda/src/v/storage/tests/log_segment_appender_test.cc(392): Leaving test case "test_concurrent_append_flush"; testing time: 2718423us

andijcr · 2023-09-29T18:04:15Z

@travisdowns what do you think, is this failure a sev/high, since it's in the context of writes?

travisdowns · 2023-10-03T15:13:55Z

@andijcr - it looks potentially scary since it's on the write path, high might be a good tag until we've diagnosed it a bit more?

travisdowns · 2023-10-03T15:14:44Z

@piyushredpanda wrote:

Do we not have a fixed seed for the fuzz? (I am assuming that fuzz is driven off a seed for reproducibility)

Currently we are not using a fixed seed, but it's a good idea (or at least output the seed).

H/e fixed seed doesn't guarantee too much here as the behavior of the underlying segment appender is timing dependent anyway.

travisdowns · 2023-10-03T21:27:46Z

So the underlying problem is the assumption that when a flush() future x2 resolves which was created after another flush future x1 on the same appender, x1 will necessarily also be available (i.e., flush futures resolve "in order"). Almost always x will be available, but one way it can happen that it does not is:

First, segment_appender::flush() is called which will return future x1: this future comes from the "backup path" here which is only executed if all of the following are true:

There are no pending bytes to flush in the appender (if there are, we enqueue a _flush_op and flush them, which will trigger the fdatasync on the primary path after the write finishes.
file_byte_offset() > _flushed_offset because otherwise we have already flushed everything in the appender and the flush is a no-op
There are no writes tracked in _inflight, since otherwise we can enqueue a _flush_op as above.

An easy way this can happen is if the inactive segment timer fires: this will dispatch a write of the current chunk without subsequently flushing it. So the conditions above will be met. However, that's not what happens here: there is actually a write+flush in progress when the x1 flush call occurs, but it is in-between the dma write and the fdatasync, i.e, at some point after this loop. Here the _inflight array has already been popped and can be empty, even though the fdatasync is still to occur (call this sync0).

So flush x1 will trigger the backup path and do _out.flush().

Then, the sync0 completes its fdatasync and advances _flushed_offset to be equal to _stable_offset and _committed_offset (_flush_bytes_pending is 0 this whole time). So the subsequent flush x2 sees file_byte_offset() == _flushed_offset (condition 2) and resolves immediately with a ready future. However, x1 is unresolved since it will only resolve on once it's _out.flush() resolves. This fdatasync by x1 is redundant in the sense that it could have simply waited for the flush in progress rather than issuing a second flush, but it seems harmless to me.

This situation probably arises at other times during the fuzz test but goes undetected as we only check this condition at the end of the test when we inspect the returned futures.

To fix this we can just remove this check, as its assumption about flush() future resolution order is not valid.

nvartolomei · 2023-11-16T15:36:28Z

https://buildkite.com/redpanda/redpanda/builds/41233#018bd87e-0c60-4f4a-aca8-4abbe6b12163

michael-redpanda · 2023-11-30T21:17:29Z

https://buildkite.com/redpanda/redpanda/builds/42064#018c2182-6347-4533-b75d-f92a35029443

travisdowns · 2023-12-01T16:07:35Z

I will push a fix for this today.

run_concurrent_append_flush is a fuzzer-like test and we may have hard-to-diagnose failures there (e.g., see issue redpanda-data#13035) and to help diagnose it we want to capture some information from the segment_appender at each step of the test. Introduce segment_appender_info to do this.

Relates to log_segment_appender_test::test_concurrent_append_flush, which is a fuzzer-style test, and output it when we fail. In storage_single_thread_rpunit concurrent flush test we now log test context which will be printed if the test fails. Critically this includes the seem used to generate the random series of actions to be performed on the appender. In addition we generate a single seed per invocation and then use that seed rather than the random helper methods which use an unspecified random seed each time. Finally we record more information about the operations performed in test and output the full action sequence on failure. Issue redpanda-data#13035.

In test_concurrent_append_flush, which is a fuzzer style test, we now get() all futures returned by flush calls during the fuzz portion, instead of only the last flush. It is possible in some cases for prior futures to be unavailable even after the last future has resolved which caused occasional CI failures. See 13035 for more analysis. Fixes redpanda-data#13035.

Relates to log_segment_appender_test::test_concurrent_append_flush, which is a fuzzer-style test, and output it when we fail. In storage_single_thread_rpunit concurrent flush test we now log test context which will be printed if the test fails. Critically this includes the seem used to generate the random series of actions to be performed on the appender. In addition we generate a single seed per invocation and then use that seed rather than the random helper methods which use an unspecified random seed each time. Finally we record more information about the operations performed in test and output the full action sequence on failure. Issue redpanda-data#13035.

In test_concurrent_append_flush, which is a fuzzer style test, we now get() all futures returned by flush calls during the fuzz portion, instead of only the last flush. It is possible in some cases for prior futures to be unavailable even after the last future has resolved which caused occasional CI failures. See 13035 for more analysis. Fixes redpanda-data#13035.

abhijat · 2023-12-05T04:36:39Z

https://buildkite.com/redpanda/redpanda/builds/42233#018c3559-f926-479d-8dda-07632c55474e

run_concurrent_append_flush is a fuzzer-like test and we may have hard-to-diagnose failures there (e.g., see issue redpanda-data#13035) and to help diagnose it we want to capture some information from the segment_appender at each step of the test. Introduce segment_appender_info to do this.

Relates to log_segment_appender_test::test_concurrent_append_flush, which is a fuzzer-style test, and output it when we fail. In storage_single_thread_rpunit concurrent flush test we now log test context which will be printed if the test fails. Critically this includes the seem used to generate the random series of actions to be performed on the appender. In addition we generate a single seed per invocation and then use that seed rather than the random helper methods which use an unspecified random seed each time. Finally we record more information about the operations performed in test and output the full action sequence on failure. Issue redpanda-data#13035.

In test_concurrent_append_flush, which is a fuzzer style test, we now get() all futures returned by flush calls during the fuzz portion, instead of only the last flush. It is possible in some cases for prior futures to be unavailable even after the last future has resolved which caused occasional CI failures. See 13035 for more analysis. Fixes redpanda-data#13035.

Relates to log_segment_appender_test::test_concurrent_append_flush, which is a fuzzer-style test, and output it when we fail. In storage_single_thread_rpunit concurrent flush test we now log test context which will be printed if the test fails. Critically this includes the seem used to generate the random series of actions to be performed on the appender. In addition we generate a single seed per invocation and then use that seed rather than the random helper methods which use an unspecified random seed each time. Finally we record more information about the operations performed in test and output the full action sequence on failure. Issue redpanda-data#13035.

In test_concurrent_append_flush, which is a fuzzer style test, we now get() all futures returned by flush calls during the fuzz portion, instead of only the last flush. It is possible in some cases for prior futures to be unavailable even after the last future has resolved which caused occasional CI failures. See 13035 for more analysis. Fixes redpanda-data#13035.

run_concurrent_append_flush is a fuzzer-like test and we may have hard-to-diagnose failures there (e.g., see issue redpanda-data#13035) and to help diagnose it we want to capture some information from the segment_appender at each step of the test. Introduce segment_appender_info to do this. (cherry picked from commit 4e4a1e3)

Relates to log_segment_appender_test::test_concurrent_append_flush, which is a fuzzer-style test, and output it when we fail. In storage_single_thread_rpunit concurrent flush test we now log test context which will be printed if the test fails. Critically this includes the seem used to generate the random series of actions to be performed on the appender. In addition we generate a single seed per invocation and then use that seed rather than the random helper methods which use an unspecified random seed each time. Finally we record more information about the operations performed in test and output the full action sequence on failure. Issue redpanda-data#13035. (cherry picked from commit b02c28c)

In test_concurrent_append_flush, which is a fuzzer style test, we now get() all futures returned by flush calls during the fuzz portion, instead of only the last flush. It is possible in some cases for prior futures to be unavailable even after the last future has resolved which caused occasional CI failures. See 13035 for more analysis. Fixes redpanda-data#13035. (cherry picked from commit cc82d0d)

Lazin added kind/bug Something isn't working ci-failure labels Aug 28, 2023

Lazin mentioned this issue Aug 28, 2023

rptest: Test spillover and timequeries #13034

Merged

7 tasks

Lazin added the area/storage label Aug 28, 2023

rystsov added the ci-ignore Automatic ci analysis tools ignore this issue label Sep 1, 2023

dotnwat added the sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low. label Sep 5, 2023

andrwng mentioned this issue Sep 18, 2023

cloud_metadata: plug upload loop into application #11212

Merged

8 tasks

andijcr self-assigned this Sep 21, 2023

andijcr mentioned this issue Sep 27, 2023

storage/log_segment_appender_test: add test context for run_concurret_append_flush #13733

Merged

7 tasks

BenPope mentioned this issue Sep 29, 2023

security: Break dependency from v::config to v::security #13797

Merged

7 tasks

travisdowns self-assigned this Oct 3, 2023

piyushredpanda unassigned andijcr Oct 3, 2023

michael-redpanda mentioned this issue Nov 30, 2023

audit: Disable auditing in recovery mode #15228

Merged

7 tasks

travisdowns mentioned this issue Dec 2, 2023

Fix CI failure in test_concurrent_append_flush #15271

Merged

7 tasks

abhijat mentioned this issue Dec 5, 2023

ducktape/cloud_storage: Adjust delta expectation #15256

Merged

7 tasks

dotnwat mentioned this issue Dec 11, 2023

admin: split out handlers for usage, transactions #15381

Merged

7 tasks

travisdowns closed this as completed in #15271 Dec 18, 2023

vbotbuildovich mentioned this issue Dec 18, 2023

[v23.3.x] CI Failure (critical check f.available() has failed) in test_concurrent_append_flush #15728

Closed

abhijat mentioned this issue Apr 11, 2024

[v23.2.x] CORE-1752: cst: improved logging (manual backport) #17794

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI Failure (critical check f.available() has failed) in `test_concurrent_append_flush` #13035

CI Failure (critical check f.available() has failed) in `test_concurrent_append_flush` #13035

Lazin commented Aug 28, 2023

andijcr commented Sep 21, 2023 •

edited

Loading

piyushredpanda commented Sep 21, 2023

andijcr commented Sep 29, 2023

andijcr commented Sep 29, 2023

travisdowns commented Oct 3, 2023

travisdowns commented Oct 3, 2023

travisdowns commented Oct 3, 2023 •

edited

Loading

nvartolomei commented Nov 16, 2023

michael-redpanda commented Nov 30, 2023

travisdowns commented Dec 1, 2023

abhijat commented Dec 5, 2023

CI Failure (critical check f.available() has failed) in test_concurrent_append_flush #13035

CI Failure (critical check f.available() has failed) in test_concurrent_append_flush #13035

Comments

Lazin commented Aug 28, 2023

andijcr commented Sep 21, 2023 • edited Loading

piyushredpanda commented Sep 21, 2023

andijcr commented Sep 29, 2023

andijcr commented Sep 29, 2023

travisdowns commented Oct 3, 2023

travisdowns commented Oct 3, 2023

travisdowns commented Oct 3, 2023 • edited Loading

nvartolomei commented Nov 16, 2023

michael-redpanda commented Nov 30, 2023

travisdowns commented Dec 1, 2023

abhijat commented Dec 5, 2023

CI Failure (critical check f.available() has failed) in `test_concurrent_append_flush` #13035

CI Failure (critical check f.available() has failed) in `test_concurrent_append_flush` #13035

andijcr commented Sep 21, 2023 •

edited

Loading

travisdowns commented Oct 3, 2023 •

edited

Loading