-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed race condition in disk_log_impl::truncate_prefix
#21324
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some initial concerns about whether we should be taking read or write locks on the segment earlier than we are today, but after staring at this for a bit, but I think this makes sense. Great find!
new failures in https://buildkite.com/redpanda/redpanda/builds/51509#0190b603-dc86-4835-b105-6dcff19192b6:
new failures in https://buildkite.com/redpanda/redpanda/builds/51509#0190b603-dc88-4eaa-a173-397c56268725:
new failures in https://buildkite.com/redpanda/redpanda/builds/51641#0190c068-d198-4d25-b607-7195e6f10086:
new failures in https://buildkite.com/redpanda/redpanda/builds/51641#0190c068-d197-4263-bef1-1267ed95edc7:
new failures in https://buildkite.com/redpanda/redpanda/builds/51858#0190d99d-8ec0-420d-ab69-1e5ad487f7af:
new failures in https://buildkite.com/redpanda/redpanda/builds/52039#0190eab0-a9be-4575-9103-3ca771c980ac:
new failures in https://buildkite.com/redpanda/redpanda/builds/52083#0190eed6-9de0-4afc-a737-9fe9ffbfc3da:
new failures in https://buildkite.com/redpanda/redpanda/builds/52191#0190fdc5-b9dd-4bbd-ae36-22ac33cfd811:
new failures in https://buildkite.com/redpanda/redpanda/builds/52191#0190fdc5-b9d2-4ecd-8aac-c559b586d11f:
new failures in https://buildkite.com/redpanda/redpanda/builds/52464#0191220a-6543-432f-a7e8-e0fdc561cb2a:
new failures in https://buildkite.com/redpanda/redpanda/builds/52464#0191220a-6545-40b1-b7aa-39fe92ddf336:
new failures in https://buildkite.com/redpanda/redpanda/builds/52464#0191220a-6541-4aaa-9efe-12e75975200c:
new failures in https://buildkite.com/redpanda/redpanda/builds/52464#01912d27-fc7c-4e39-abd1-74d1d57a5563:
new failures in https://buildkite.com/redpanda/redpanda/builds/53032#01915a52-6f84-4a3d-893a-2c54cc7fe788:
new failures in https://buildkite.com/redpanda/redpanda/builds/53123#019169f3-9f22-4e5d-b9c3-7ce94194145b:
new failures in https://buildkite.com/redpanda/redpanda/builds/53673#01919887-5513-4f01-9166-b1782a63aeca:
new failures in https://buildkite.com/redpanda/redpanda/builds/53673#01919887-5512-4721-9e85-96bde9017bc1:
non flaky failures in https://buildkite.com/redpanda/redpanda/builds/58339#01934a49-bc1c-4f66-a198-bcf3aff56156:
non flaky failures in https://buildkite.com/redpanda/redpanda/builds/58339#01934a49-bc1f-4f14-a66a-ce037b975712:
non flaky failures in https://buildkite.com/redpanda/redpanda/builds/58548#01935389-b385-48d1-bd21-a29fc520d66b:
non flaky failures in https://buildkite.com/redpanda/redpanda/builds/58548#01935389-b388-4f7d-90c8-9c3be969b299:
|
f218661
to
3355c14
Compare
src/v/storage/disk_log_impl.cc
Outdated
* naturally yield to the appender and it will be retried | ||
*/ | ||
if (ptr->has_outstanding_locks()) { | ||
return ss::sleep(1ms); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't we just wait for the write lock here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in order to avoid forcing caller to wait for the segment write lock. The write lock is blocked whenever a reader is active hence it may take a long time to close the segment.
41c9f46
to
cc27381
Compare
/ci-repeat 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mostly LGTM. I'm thinking that:
- it probably makes sense to wait a bit before backporting, to see if there is additional fallout/races that pop up in CI
- wondering if we need to be taking the write lock every time we pop from
_segs
. That would at least be a clear heuristic for other developers that's easily auditable as a reviewer
/ci-repeat 1 |
2 similar comments
/ci-repeat 1 |
/ci-repeat 1 |
Added log entry showing a next offset used to initialize log appender. Signed-off-by: Michał Maślanka <michal@redpanda.com>
130bb74
to
fff6346
Compare
Retry command for Build#58339please wait until all jobs are finished before running the slash command
|
fff6346
to
0a2b0e9
Compare
Retry command for Build#58548please wait until all jobs are finished before running the slash command
|
Fixed a race condition which may lead to a situation in which the same offset was assigned to two different records appended to the log. The race condition was happening when one one fiber was appending batches to the log while the other one was prefix truncating it. (Raft does it when taking a snapshot). In this situation it might happened that the batches were appended to the segment which was about to be deleted in `remove_segment_permanently`. This lead to a situation in which an appended batch was lost and the same offset was assigned to the next one. This lead to assertion triggered in `mux_state_machine` Fixes: Signed-off-by: Michał Maślanka <michal@redpanda.com>
Added a test validating concurrent operations of writing and prefix truncating a log. Signed-off-by: Michał Maślanka <michal@redpanda.com>
Removed the line that informed the user about outstanding locks keeping the segment alive as now the lock is cleared before the segment is closed. Signed-off-by: Michał Maślanka <michal@redpanda.com>
0a2b0e9
to
09feb36
Compare
/backport v24.3.x |
/backport v24.2.x |
/backport v24.1.x |
auto ptr = _segs.front(); | ||
_segs.pop_front(); | ||
_probe->add_bytes_prefix_truncated(ptr->file_size()); | ||
return remove_segment_permanently(ptr, "remove_prefix_full_segments"); | ||
// evict readers before trying to grab a write lock to prevent | ||
// contention | ||
return _readers_cache->evict_segment_readers(ptr).then( | ||
[this, ptr](readers_cache::range_lock_holder cache_lock) { | ||
/** | ||
* If segment has outstanding locks wait for it to be unlocked | ||
* before prefixing truncating it, this way the prefix | ||
* truncation will not overlap with appends. | ||
*/ | ||
return ptr->write_lock().then( | ||
[this, ptr, cache_lock = std::move(cache_lock)]( | ||
ss::rwlock::holder lock_holder) { | ||
_segs.pop_front(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prior to this change, ptr
and the segment popped are the same.
auto ptr = _segs.front();
_segs.pop_front();
But now, there are two interleaved scheduling points between setting ptr
, and popping the segment. What guarantees that they are still the same? Does it matter if they are different?
s->tombstone(); | ||
if (s->has_outstanding_locks()) { | ||
vlog( | ||
stlog.info, | ||
"Segment has outstanding locks. Might take a while to close:{}", | ||
s->reader().filename()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as now the lock is cleared before the segment is closed.
But isn't this called in many contexts (e.g. truncation, compaction), not just when the segment is being closed?
Fixed a race condition which may lead to a situation in which the same
offset was assigned to two different records appended to the log. The
race condition was happening when one one fiber was appending batches to
the log while the other one was prefix truncating it. (Raft does it when
taking a snapshot). In this situation it might happened that the
batches were appended to the segment which was about to be deleted in
remove_segment_permanently
. This lead to a situation in which anappended batch was lost and the same offset was assigned to the next
one. This lead to assertion triggered in
mux_state_machine
Fixes: #20758
Fixes: CORE-6888
Fixes: CORE-8007
Backports Required
Release Notes