Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (critical check f.available() has failed) in test_concurrent_append_flush #13035

Closed
Lazin opened this issue Aug 28, 2023 · 11 comments · Fixed by #15271
Closed

CI Failure (critical check f.available() has failed) in test_concurrent_append_flush #13035

Lazin opened this issue Aug 28, 2023 · 11 comments · Fixed by #15271
Assignees
Labels
area/storage ci-failure ci-ignore Automatic ci analysis tools ignore this issue kind/bug Something isn't working sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low.

Comments

@Lazin
Copy link
Contributor

Lazin commented Aug 28, 2023

The test_concurrent_append_flush test from the storage_single_thread_rpunit has failed.

https://buildkite.com/redpanda/redpanda/builds/35783#018a3818-bc30-40c6-8b1e-d45bc58308b6/6-5451

The error message:

log_segment_appender_test.cc(355): �[4;31;49mfatal error: in "test_concurrent_append_flush": critical check f.available() has failed�[0;39;49m
@Lazin Lazin added kind/bug Something isn't working ci-failure labels Aug 28, 2023
@rystsov rystsov added the ci-ignore Automatic ci analysis tools ignore this issue label Sep 1, 2023
@dotnwat dotnwat added the sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low. label Sep 5, 2023
@andijcr andijcr self-assigned this Sep 21, 2023
@andijcr
Copy link
Contributor

andijcr commented Sep 21, 2023

can't reproduce locally, but it's kind of a fuzz test so it makes sense that the behavior is difficult to reproduce.

the comment in the test states:

    // now we expect all the prior flush futures to be available
    // we don't guarantee this is in the API currently but it is how it
    // works currently and we might as well assert it
    for (auto& f : futs) {
        BOOST_REQUIRE(f.available());
        f.get(); // propagate any exception
    }

so it might be a symptom of an underlying bug in segment_appender (the source of these futures).

will include further logging to generate a bit more context around the failure.

update:

also a check failed:

/var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-0c2b37205f0b6375f-1/redpanda/redpanda/src/v/storage/tests/log_segment_appender_test.cc(373): Entering test case "test_concurrent_append_flush"
/var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-0c2b37205f0b6375f-1/redpanda/redpanda/src/v/storage/tests/log_segment_appender_test.cc(348): error: in "test_concurrent_append_flush": check access(appender).inflight_dispatched() == 0 has failed [1 != 0]
fatal error: in "test_concurrent_append_flush": critical check f.available() has failed

so this error makes a bit more sense

@piyushredpanda
Copy link
Contributor

Do we not have a fixed seed for the fuzz? (I am assuming that fuzz is driven off a seed for reproducibility)

@andijcr
Copy link
Contributor

andijcr commented Sep 29, 2023

from https://buildkite.com/redpanda/redpanda/builds/37956#018ae10e-3932-417e-964d-8b4bf1aa5355
(obviously it didn't hit the new context)

14711:/var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-00c96147bbc1e2de6-1/redpanda/redpanda/src/v/storage/tests/log_segment_appender_test.cc(392): Entering test case "test_concurrent_append_flush"
14712:/var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-00c96147bbc1e2de6-1/redpanda/redpanda/src/v/storage/tests/log_segment_appender_test.cc(362): error: in "test_concurrent_append_flush": check access(appender).inflight_dispatched() == 0 has failed [1 != 0]
14713:fatal error: in "test_concurrent_append_flush": critical check f.available() has failed
14714:/var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-00c96147bbc1e2de6-1/redpanda/redpanda/src/v/storage/tests/log_segment_appender_test.cc(392): Leaving test case "test_concurrent_append_flush"; testing time: 2718423us

@andijcr
Copy link
Contributor

andijcr commented Sep 29, 2023

@travisdowns what do you think, is this failure a sev/high, since it's in the context of writes?

@travisdowns
Copy link
Member

@andijcr - it looks potentially scary since it's on the write path, high might be a good tag until we've diagnosed it a bit more?

@travisdowns
Copy link
Member

@piyushredpanda wrote:

Do we not have a fixed seed for the fuzz? (I am assuming that fuzz is driven off a seed for reproducibility)

Currently we are not using a fixed seed, but it's a good idea (or at least output the seed).

H/e fixed seed doesn't guarantee too much here as the behavior of the underlying segment appender is timing dependent anyway.

@travisdowns travisdowns self-assigned this Oct 3, 2023
@travisdowns
Copy link
Member

travisdowns commented Oct 3, 2023

So the underlying problem is the assumption that when a flush() future x2 resolves which was created after another flush future x1 on the same appender, x1 will necessarily also be available (i.e., flush futures resolve "in order"). Almost always x will be available, but one way it can happen that it does not is:

First, segment_appender::flush() is called which will return future x1: this future comes from the "backup path" here which is only executed if all of the following are true:

  1. There are no pending bytes to flush in the appender (if there are, we enqueue a _flush_op and flush them, which will trigger the fdatasync on the primary path after the write finishes.

  2. file_byte_offset() > _flushed_offset because otherwise we have already flushed everything in the appender and the flush is a no-op

  3. There are no writes tracked in _inflight, since otherwise we can enqueue a _flush_op as above.

An easy way this can happen is if the inactive segment timer fires: this will dispatch a write of the current chunk without subsequently flushing it. So the conditions above will be met. However, that's not what happens here: there is actually a write+flush in progress when the x1 flush call occurs, but it is in-between the dma write and the fdatasync, i.e, at some point after this loop. Here the _inflight array has already been popped and can be empty, even though the fdatasync is still to occur (call this sync0).

So flush x1 will trigger the backup path and do _out.flush().

Then, the sync0 completes its fdatasync and advances _flushed_offset to be equal to _stable_offset and _committed_offset (_flush_bytes_pending is 0 this whole time). So the subsequent flush x2 sees file_byte_offset() == _flushed_offset (condition 2) and resolves immediately with a ready future. However, x1 is unresolved since it will only resolve on once it's _out.flush() resolves. This fdatasync by x1 is redundant in the sense that it could have simply waited for the flush in progress rather than issuing a second flush, but it seems harmless to me.

This situation probably arises at other times during the fuzz test but goes undetected as we only check this condition at the end of the test when we inspect the returned futures.

To fix this we can just remove this check, as its assumption about flush() future resolution order is not valid.

@nvartolomei
Copy link
Contributor

@michael-redpanda
Copy link
Contributor

@travisdowns
Copy link
Member

I will push a fix for this today.

travisdowns added a commit to travisdowns/redpanda that referenced this issue Dec 2, 2023
run_concurrent_append_flush is a fuzzer-like test and we may have
hard-to-diagnose failures there (e.g., see issue redpanda-data#13035) and to help
diagnose it we want to capture
some information from the segment_appender at each step of the
test.

Introduce segment_appender_info to do this.
travisdowns added a commit to travisdowns/redpanda that referenced this issue Dec 2, 2023
Relates to log_segment_appender_test::test_concurrent_append_flush,
which is a fuzzer-style test, and output it when we fail.

In storage_single_thread_rpunit concurrent flush test we now log
test context which will be printed if the test fails. Critically this
includes the seem used to generate the random series of actions to
be performed on the appender.

In addition we generate a single seed per invocation and then use that
seed rather than the random helper methods which use an unspecified
random seed each time.

Finally we record more information about the operations performed in
test and output the full action sequence on failure.

Issue redpanda-data#13035.
travisdowns added a commit to travisdowns/redpanda that referenced this issue Dec 2, 2023
In test_concurrent_append_flush, which is a fuzzer style test,
we now get() all futures returned by flush calls during the fuzz
portion, instead of only the last flush.

It is possible in some cases for prior futures to be unavailable
even after the last future has resolved which caused occasional
CI failures. See 13035 for more analysis.

Fixes redpanda-data#13035.
travisdowns added a commit to travisdowns/redpanda that referenced this issue Dec 2, 2023
Relates to log_segment_appender_test::test_concurrent_append_flush,
which is a fuzzer-style test, and output it when we fail.

In storage_single_thread_rpunit concurrent flush test we now log
test context which will be printed if the test fails. Critically this
includes the seem used to generate the random series of actions to
be performed on the appender.

In addition we generate a single seed per invocation and then use that
seed rather than the random helper methods which use an unspecified
random seed each time.

Finally we record more information about the operations performed in
test and output the full action sequence on failure.

Issue redpanda-data#13035.
travisdowns added a commit to travisdowns/redpanda that referenced this issue Dec 2, 2023
In test_concurrent_append_flush, which is a fuzzer style test,
we now get() all futures returned by flush calls during the fuzz
portion, instead of only the last flush.

It is possible in some cases for prior futures to be unavailable
even after the last future has resolved which caused occasional
CI failures. See 13035 for more analysis.

Fixes redpanda-data#13035.
@abhijat
Copy link
Contributor

abhijat commented Dec 5, 2023

travisdowns added a commit to travisdowns/redpanda that referenced this issue Dec 11, 2023
run_concurrent_append_flush is a fuzzer-like test and we may have
hard-to-diagnose failures there (e.g., see issue redpanda-data#13035) and to help
diagnose it we want to capture
some information from the segment_appender at each step of the
test.

Introduce segment_appender_info to do this.
travisdowns added a commit to travisdowns/redpanda that referenced this issue Dec 11, 2023
Relates to log_segment_appender_test::test_concurrent_append_flush,
which is a fuzzer-style test, and output it when we fail.

In storage_single_thread_rpunit concurrent flush test we now log
test context which will be printed if the test fails. Critically this
includes the seem used to generate the random series of actions to
be performed on the appender.

In addition we generate a single seed per invocation and then use that
seed rather than the random helper methods which use an unspecified
random seed each time.

Finally we record more information about the operations performed in
test and output the full action sequence on failure.

Issue redpanda-data#13035.
travisdowns added a commit to travisdowns/redpanda that referenced this issue Dec 11, 2023
In test_concurrent_append_flush, which is a fuzzer style test,
we now get() all futures returned by flush calls during the fuzz
portion, instead of only the last flush.

It is possible in some cases for prior futures to be unavailable
even after the last future has resolved which caused occasional
CI failures. See 13035 for more analysis.

Fixes redpanda-data#13035.
travisdowns added a commit to travisdowns/redpanda that referenced this issue Dec 11, 2023
Relates to log_segment_appender_test::test_concurrent_append_flush,
which is a fuzzer-style test, and output it when we fail.

In storage_single_thread_rpunit concurrent flush test we now log
test context which will be printed if the test fails. Critically this
includes the seem used to generate the random series of actions to
be performed on the appender.

In addition we generate a single seed per invocation and then use that
seed rather than the random helper methods which use an unspecified
random seed each time.

Finally we record more information about the operations performed in
test and output the full action sequence on failure.

Issue redpanda-data#13035.
travisdowns added a commit to travisdowns/redpanda that referenced this issue Dec 11, 2023
In test_concurrent_append_flush, which is a fuzzer style test,
we now get() all futures returned by flush calls during the fuzz
portion, instead of only the last flush.

It is possible in some cases for prior futures to be unavailable
even after the last future has resolved which caused occasional
CI failures. See 13035 for more analysis.

Fixes redpanda-data#13035.
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue Dec 18, 2023
run_concurrent_append_flush is a fuzzer-like test and we may have
hard-to-diagnose failures there (e.g., see issue redpanda-data#13035) and to help
diagnose it we want to capture
some information from the segment_appender at each step of the
test.

Introduce segment_appender_info to do this.

(cherry picked from commit 4e4a1e3)
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue Dec 18, 2023
Relates to log_segment_appender_test::test_concurrent_append_flush,
which is a fuzzer-style test, and output it when we fail.

In storage_single_thread_rpunit concurrent flush test we now log
test context which will be printed if the test fails. Critically this
includes the seem used to generate the random series of actions to
be performed on the appender.

In addition we generate a single seed per invocation and then use that
seed rather than the random helper methods which use an unspecified
random seed each time.

Finally we record more information about the operations performed in
test and output the full action sequence on failure.

Issue redpanda-data#13035.

(cherry picked from commit b02c28c)
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue Dec 18, 2023
In test_concurrent_append_flush, which is a fuzzer style test,
we now get() all futures returned by flush calls during the fuzz
portion, instead of only the last flush.

It is possible in some cases for prior futures to be unavailable
even after the last future has resolved which caused occasional
CI failures. See 13035 for more analysis.

Fixes redpanda-data#13035.

(cherry picked from commit cc82d0d)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/storage ci-failure ci-ignore Automatic ci analysis tools ignore this issue kind/bug Something isn't working sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants