Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed race condition in disk_log_impl::truncate_prefix #21324

Merged
merged 4 commits into from
Nov 25, 2024

Conversation

mmaslankaprv
Copy link
Member

@mmaslankaprv mmaslankaprv commented Jul 10, 2024

Fixed a race condition which may lead to a situation in which the same
offset was assigned to two different records appended to the log. The
race condition was happening when one one fiber was appending batches to
the log while the other one was prefix truncating it. (Raft does it when
taking a snapshot). In this situation it might happened that the
batches were appended to the segment which was about to be deleted in
remove_segment_permanently. This lead to a situation in which an
appended batch was lost and the same offset was assigned to the next
one. This lead to assertion triggered in mux_state_machine

Fixes: #20758
Fixes: CORE-6888
Fixes: CORE-8007

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x
  • v23.3.x

Release Notes

  • none

andrwng
andrwng previously approved these changes Jul 12, 2024
Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some initial concerns about whether we should be taking read or write locks on the segment earlier than we are today, but after staring at this for a bit, but I think this makes sense. Great find!

src/v/storage/disk_log_impl.cc Outdated Show resolved Hide resolved
@andrwng andrwng dismissed their stale review July 12, 2024 22:51

Suspect there's still a race here

@mmaslankaprv mmaslankaprv requested review from bharathv and andrwng July 15, 2024 09:39
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jul 15, 2024

new failures in https://buildkite.com/redpanda/redpanda/builds/51509#0190b603-dc86-4835-b105-6dcff19192b6:

"rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.ABS.2.virtual_host.test_case=.TS_Read==True.SegmentRolledByTimeout==True"

new failures in https://buildkite.com/redpanda/redpanda/builds/51509#0190b603-dc88-4eaa-a173-397c56268725:

"rptest.tests.partition_movement_test.SIPartitionMovementTest.test_cross_shard.num_to_upgrade=0.cloud_storage_type=CloudStorageType.ABS"

new failures in https://buildkite.com/redpanda/redpanda/builds/51641#0190c068-d198-4d25-b607-7195e6f10086:

"rptest.tests.partition_movement_test.SIPartitionMovementTest.test_cross_shard.num_to_upgrade=0.cloud_storage_type=CloudStorageType.S3"

new failures in https://buildkite.com/redpanda/redpanda/builds/51641#0190c068-d197-4263-bef1-1267ed95edc7:

"rptest.tests.partition_movement_test.SIPartitionMovementTest.test_cross_shard.num_to_upgrade=0.cloud_storage_type=CloudStorageType.ABS"

new failures in https://buildkite.com/redpanda/redpanda/builds/51858#0190d99d-8ec0-420d-ab69-1e5ad487f7af:

"rptest.tests.partition_movement_test.SIPartitionMovementTest.test_cross_shard.num_to_upgrade=0.cloud_storage_type=CloudStorageType.S3"

new failures in https://buildkite.com/redpanda/redpanda/builds/52039#0190eab0-a9be-4575-9103-3ca771c980ac:

"rptest.tests.e2e_topic_recovery_test.EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.cloud_storage_type=CloudStorageType.S3"

new failures in https://buildkite.com/redpanda/redpanda/builds/52083#0190eed6-9de0-4afc-a737-9fe9ffbfc3da:

"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_and_upgrade"
"rptest.tests.e2e_topic_recovery_test.EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.cloud_storage_type=CloudStorageType.ABS"

new failures in https://buildkite.com/redpanda/redpanda/builds/52191#0190fdc5-b9dd-4bbd-ae36-22ac33cfd811:

"rptest.tests.e2e_topic_recovery_test.EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.cloud_storage_type=CloudStorageType.ABS"

new failures in https://buildkite.com/redpanda/redpanda/builds/52191#0190fdc5-b9d2-4ecd-8aac-c559b586d11f:

"rptest.tests.e2e_topic_recovery_test.EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.cloud_storage_type=CloudStorageType.S3"

new failures in https://buildkite.com/redpanda/redpanda/builds/52464#0191220a-6543-432f-a7e8-e0fdc561cb2a:

"rptest.tests.e2e_topic_recovery_test.EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.retention.local.target.bytes.1024.redpanda.remote.write.True.redpanda.remote.read.True.cloud_storage_type=CloudStorageType.ABS"

new failures in https://buildkite.com/redpanda/redpanda/builds/52464#0191220a-6545-40b1-b7aa-39fe92ddf336:

"rptest.tests.e2e_topic_recovery_test.EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.retention.local.target.bytes.1024.redpanda.remote.write.True.redpanda.remote.read.True.cloud_storage_type=CloudStorageType.S3"

new failures in https://buildkite.com/redpanda/redpanda/builds/52464#0191220a-6541-4aaa-9efe-12e75975200c:

"rptest.tests.partition_movement_test.SIPartitionMovementTest.test_cross_shard.num_to_upgrade=0.cloud_storage_type=CloudStorageType.ABS"

new failures in https://buildkite.com/redpanda/redpanda/builds/52464#01912d27-fc7c-4e39-abd1-74d1d57a5563:

"rptest.tests.partition_movement_test.SIPartitionMovementTest.test_cross_shard.num_to_upgrade=0.cloud_storage_type=CloudStorageType.ABS"

new failures in https://buildkite.com/redpanda/redpanda/builds/53032#01915a52-6f84-4a3d-893a-2c54cc7fe788:

"rptest.tests.partition_movement_test.SIPartitionMovementTest.test_cross_shard.num_to_upgrade=0.cloud_storage_type=CloudStorageType.ABS"

new failures in https://buildkite.com/redpanda/redpanda/builds/53123#019169f3-9f22-4e5d-b9c3-7ce94194145b:

"rptest.tests.consumer_group_test.ConsumerGroupStaticMembersRebalance.test_static_member_rejoining_group"

new failures in https://buildkite.com/redpanda/redpanda/builds/53673#01919887-5513-4f01-9166-b1782a63aeca:

"rptest.tests.e2e_topic_recovery_test.EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.retention.local.target.bytes.1024.redpanda.remote.write.True.redpanda.remote.read.True.cloud_storage_type=CloudStorageType.S3"

new failures in https://buildkite.com/redpanda/redpanda/builds/53673#01919887-5512-4721-9e85-96bde9017bc1:

"rptest.tests.e2e_topic_recovery_test.EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.retention.local.target.bytes.1024.redpanda.remote.write.True.redpanda.remote.read.True.cloud_storage_type=CloudStorageType.ABS"
"rptest.tests.partition_movement_test.SIPartitionMovementTest.test_cross_shard.num_to_upgrade=0.cloud_storage_type=CloudStorageType.S3"

non flaky failures in https://buildkite.com/redpanda/redpanda/builds/58339#01934a49-bc1c-4f66-a198-bcf3aff56156:

"rptest.tests.e2e_topic_recovery_test.EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.retention.local.target.bytes.1024.redpanda.remote.write.True.redpanda.remote.read.True.cloud_storage_type=CloudStorageType.S3"
"rptest.tests.partition_movement_test.SIPartitionMovementTest.test_cross_shard.num_to_upgrade=0.cloud_storage_type=CloudStorageType.S3"

non flaky failures in https://buildkite.com/redpanda/redpanda/builds/58339#01934a49-bc1f-4f14-a66a-ce037b975712:

"rptest.tests.e2e_topic_recovery_test.EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.cloud_storage_type=CloudStorageType.ABS"

non flaky failures in https://buildkite.com/redpanda/redpanda/builds/58548#01935389-b385-48d1-bd21-a29fc520d66b:

"rptest.tests.partition_movement_test.SIPartitionMovementTest.test_cross_shard.num_to_upgrade=0.cloud_storage_type=CloudStorageType.S3"

non flaky failures in https://buildkite.com/redpanda/redpanda/builds/58548#01935389-b388-4f7d-90c8-9c3be969b299:

"rptest.tests.partition_movement_test.SIPartitionMovementTest.test_cross_shard.num_to_upgrade=2.cloud_storage_type=CloudStorageType.S3"

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jul 15, 2024

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/51509#0190b603-dc88-4eaa-a173-397c56268725

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/51668#0190c1c8-9bd0-44eb-b8c8-50ed7cb6d062

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/51668#0190c1ca-f001-4347-939f-a8584431cef5

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/51858#0190d99d-8ebe-47cd-9e1a-28715332f487

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/52039#0190eab0-a9ba-4987-9dd2-391bf49f6fe7

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/52083#0190eebe-6119-48e2-8842-05ccdb5b8fc5

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/52083#0190eed6-9de0-4afc-a737-9fe9ffbfc3da

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/52191#0190fdc5-b9db-4f93-acdd-277474441116

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/52191#0190fdc5-b9d4-42bf-957c-2ca7015f5eab

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/52191#0190fdc5-b9dd-4bbd-ae36-22ac33cfd811

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/52464#0191220a-6543-432f-a7e8-e0fdc561cb2a

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/53032#01915a51-35ed-408e-80ed-45d4b7c3d191

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/53123#019169f3-9f1e-4a79-b1b4-44fd6954a455

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/58548#0193536f-1dd3-41f3-b894-09a7f9d92910
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/58548#01935389-b384-40e1-a229-a90e52e181b5

@mmaslankaprv mmaslankaprv force-pushed the fix-20758 branch 2 times, most recently from f218661 to 3355c14 Compare July 17, 2024 10:04
src/v/storage/tests/storage_e2e_test.cc Outdated Show resolved Hide resolved
src/v/storage/tests/storage_e2e_test.cc Outdated Show resolved Hide resolved
* naturally yield to the appender and it will be retried
*/
if (ptr->has_outstanding_locks()) {
return ss::sleep(1ms);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't we just wait for the write lock here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in order to avoid forcing caller to wait for the segment write lock. The write lock is blocked whenever a reader is active hence it may take a long time to close the segment.

@mmaslankaprv mmaslankaprv force-pushed the fix-20758 branch 2 times, most recently from 41c9f46 to cc27381 Compare July 22, 2024 07:30
@mmaslankaprv
Copy link
Member Author

/ci-repeat 1

ztlpn
ztlpn previously approved these changes Jul 26, 2024
src/v/storage/disk_log_impl.cc Outdated Show resolved Hide resolved
src/v/storage/disk_log_appender.cc Show resolved Hide resolved
andrwng
andrwng previously approved these changes Aug 6, 2024
Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mostly LGTM. I'm thinking that:

  • it probably makes sense to wait a bit before backporting, to see if there is additional fallout/races that pop up in CI
  • wondering if we need to be taking the write lock every time we pop from _segs. That would at least be a clear heuristic for other developers that's easily auditable as a reviewer

@mmaslankaprv
Copy link
Member Author

mmaslankaprv commented Aug 16, 2024

/ci-repeat 1

2 similar comments
@mmaslankaprv
Copy link
Member Author

/ci-repeat 1

@mmaslankaprv
Copy link
Member Author

/ci-repeat 1

Added log entry showing a next offset used to initialize log appender.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Nov 20, 2024

Retry command for Build#58339

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/e2e_topic_recovery_test.py::EndToEndTopicRecovery.test_restore_with_aborted_tx@{"cloud_storage_type":1,"recovery_overrides":{"redpanda.remote.read":true,"redpanda.remote.write":true,"retention.local.target.bytes":1024}}
tests/rptest/tests/partition_movement_test.py::SIPartitionMovementTest.test_cross_shard@{"cloud_storage_type":1,"num_to_upgrade":0}
tests/rptest/tests/e2e_topic_recovery_test.py::EndToEndTopicRecovery.test_restore_with_aborted_tx@{"cloud_storage_type":2,"recovery_overrides":{}}

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Nov 22, 2024

Retry command for Build#58548

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/partition_movement_test.py::SIPartitionMovementTest.test_cross_shard@{"cloud_storage_type":1,"num_to_upgrade":0}
tests/rptest/tests/partition_movement_test.py::SIPartitionMovementTest.test_cross_shard@{"cloud_storage_type":1,"num_to_upgrade":2}

Fixed a race condition which may lead to a situation in which the same
offset was assigned to two different records appended to the log. The
race condition was happening when one one fiber was appending batches to
the log while the other one was prefix truncating it. (Raft does it when
taking a snapshot). In this situation it might happened that the
batches were appended to the segment which was about to be deleted in
`remove_segment_permanently`. This lead to a situation in which an
appended batch was lost and the same offset was assigned to the next
one. This lead to assertion triggered in `mux_state_machine`

Fixes:

Signed-off-by: Michał Maślanka <michal@redpanda.com>
Added a test validating concurrent operations of writing and prefix
truncating a log.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
Removed the line that informed the user about outstanding locks keeping
the segment alive as now the lock is cleared before the segment is
closed.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
@mmaslankaprv mmaslankaprv merged commit 1530ad0 into redpanda-data:dev Nov 25, 2024
19 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v24.3.x

@vbotbuildovich
Copy link
Collaborator

/backport v24.2.x

@vbotbuildovich
Copy link
Collaborator

/backport v24.1.x

Comment on lines 2576 to +2589
auto ptr = _segs.front();
_segs.pop_front();
_probe->add_bytes_prefix_truncated(ptr->file_size());
return remove_segment_permanently(ptr, "remove_prefix_full_segments");
// evict readers before trying to grab a write lock to prevent
// contention
return _readers_cache->evict_segment_readers(ptr).then(
[this, ptr](readers_cache::range_lock_holder cache_lock) {
/**
* If segment has outstanding locks wait for it to be unlocked
* before prefixing truncating it, this way the prefix
* truncation will not overlap with appends.
*/
return ptr->write_lock().then(
[this, ptr, cache_lock = std::move(cache_lock)](
ss::rwlock::holder lock_holder) {
_segs.pop_front();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prior to this change, ptr and the segment popped are the same.

          auto ptr = _segs.front();
          _segs.pop_front();

But now, there are two interleaved scheduling points between setting ptr, and popping the segment. What guarantees that they are still the same? Does it matter if they are different?

Comment on lines 2536 to -2542
s->tombstone();
if (s->has_outstanding_locks()) {
vlog(
stlog.info,
"Segment has outstanding locks. Might take a while to close:{}",
s->reader().filename());
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as now the lock is cleared before the segment is closed.

But isn't this called in many contexts (e.g. truncation, compaction), not just when the segment is being closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI Failure (key symptom) in PartitionMoveInterruption.test_cancelling_partition_move
7 participants