Fixed race condition in `disk_log_impl::truncate_prefix` #21324

mmaslankaprv · 2024-07-10T15:30:06Z

Fixed a race condition which may lead to a situation in which the same
offset was assigned to two different records appended to the log. The
race condition was happening when one one fiber was appending batches to
the log while the other one was prefix truncating it. (Raft does it when
taking a snapshot). In this situation it might happened that the
batches were appended to the segment which was about to be deleted in
remove_segment_permanently. This lead to a situation in which an
appended batch was lost and the same offset was assigned to the next
one. This lead to assertion triggered in mux_state_machine

Fixes: #20758
Fixes: CORE-6888
Fixes: CORE-8007

Backports Required

Release Notes

none

src/v/storage/disk_log_impl.cc

andrwng

I had some initial concerns about whether we should be taking read or write locks on the segment earlier than we are today, but after staring at this for a bit, but I think this makes sense. Great find!

src/v/storage/disk_log_impl.cc

Suspect there's still a race here

vbotbuildovich · 2024-07-15T12:10:38Z

new failures in https://buildkite.com/redpanda/redpanda/builds/51509#0190b603-dc86-4835-b105-6dcff19192b6:

"rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.ABS.2.virtual_host.test_case=.TS_Read==True.SegmentRolledByTimeout==True"

new failures in https://buildkite.com/redpanda/redpanda/builds/51509#0190b603-dc88-4eaa-a173-397c56268725:

"rptest.tests.partition_movement_test.SIPartitionMovementTest.test_cross_shard.num_to_upgrade=0.cloud_storage_type=CloudStorageType.ABS"

new failures in https://buildkite.com/redpanda/redpanda/builds/51641#0190c068-d198-4d25-b607-7195e6f10086:

"rptest.tests.partition_movement_test.SIPartitionMovementTest.test_cross_shard.num_to_upgrade=0.cloud_storage_type=CloudStorageType.S3"

new failures in https://buildkite.com/redpanda/redpanda/builds/51641#0190c068-d197-4263-bef1-1267ed95edc7:

"rptest.tests.partition_movement_test.SIPartitionMovementTest.test_cross_shard.num_to_upgrade=0.cloud_storage_type=CloudStorageType.ABS"

new failures in https://buildkite.com/redpanda/redpanda/builds/51858#0190d99d-8ec0-420d-ab69-1e5ad487f7af:

"rptest.tests.partition_movement_test.SIPartitionMovementTest.test_cross_shard.num_to_upgrade=0.cloud_storage_type=CloudStorageType.S3"

new failures in https://buildkite.com/redpanda/redpanda/builds/52039#0190eab0-a9be-4575-9103-3ca771c980ac:

"rptest.tests.e2e_topic_recovery_test.EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.cloud_storage_type=CloudStorageType.S3"

new failures in https://buildkite.com/redpanda/redpanda/builds/52083#0190eed6-9de0-4afc-a737-9fe9ffbfc3da:

"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_and_upgrade"
"rptest.tests.e2e_topic_recovery_test.EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.cloud_storage_type=CloudStorageType.ABS"

new failures in https://buildkite.com/redpanda/redpanda/builds/52191#0190fdc5-b9dd-4bbd-ae36-22ac33cfd811:

"rptest.tests.e2e_topic_recovery_test.EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.cloud_storage_type=CloudStorageType.ABS"

new failures in https://buildkite.com/redpanda/redpanda/builds/52191#0190fdc5-b9d2-4ecd-8aac-c559b586d11f:

"rptest.tests.e2e_topic_recovery_test.EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.cloud_storage_type=CloudStorageType.S3"

new failures in https://buildkite.com/redpanda/redpanda/builds/52464#0191220a-6543-432f-a7e8-e0fdc561cb2a:

"rptest.tests.e2e_topic_recovery_test.EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.retention.local.target.bytes.1024.redpanda.remote.write.True.redpanda.remote.read.True.cloud_storage_type=CloudStorageType.ABS"

new failures in https://buildkite.com/redpanda/redpanda/builds/52464#0191220a-6545-40b1-b7aa-39fe92ddf336:

"rptest.tests.e2e_topic_recovery_test.EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.retention.local.target.bytes.1024.redpanda.remote.write.True.redpanda.remote.read.True.cloud_storage_type=CloudStorageType.S3"

new failures in https://buildkite.com/redpanda/redpanda/builds/52464#0191220a-6541-4aaa-9efe-12e75975200c:

"rptest.tests.partition_movement_test.SIPartitionMovementTest.test_cross_shard.num_to_upgrade=0.cloud_storage_type=CloudStorageType.ABS"

new failures in https://buildkite.com/redpanda/redpanda/builds/52464#01912d27-fc7c-4e39-abd1-74d1d57a5563:

"rptest.tests.partition_movement_test.SIPartitionMovementTest.test_cross_shard.num_to_upgrade=0.cloud_storage_type=CloudStorageType.ABS"

new failures in https://buildkite.com/redpanda/redpanda/builds/53032#01915a52-6f84-4a3d-893a-2c54cc7fe788:

"rptest.tests.partition_movement_test.SIPartitionMovementTest.test_cross_shard.num_to_upgrade=0.cloud_storage_type=CloudStorageType.ABS"

new failures in https://buildkite.com/redpanda/redpanda/builds/53123#019169f3-9f22-4e5d-b9c3-7ce94194145b:

"rptest.tests.consumer_group_test.ConsumerGroupStaticMembersRebalance.test_static_member_rejoining_group"

new failures in https://buildkite.com/redpanda/redpanda/builds/53673#01919887-5513-4f01-9166-b1782a63aeca:

"rptest.tests.e2e_topic_recovery_test.EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.retention.local.target.bytes.1024.redpanda.remote.write.True.redpanda.remote.read.True.cloud_storage_type=CloudStorageType.S3"

new failures in https://buildkite.com/redpanda/redpanda/builds/53673#01919887-5512-4721-9e85-96bde9017bc1:

"rptest.tests.e2e_topic_recovery_test.EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.retention.local.target.bytes.1024.redpanda.remote.write.True.redpanda.remote.read.True.cloud_storage_type=CloudStorageType.ABS"
"rptest.tests.partition_movement_test.SIPartitionMovementTest.test_cross_shard.num_to_upgrade=0.cloud_storage_type=CloudStorageType.S3"

non flaky failures in https://buildkite.com/redpanda/redpanda/builds/58339#01934a49-bc1c-4f66-a198-bcf3aff56156:

"rptest.tests.e2e_topic_recovery_test.EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.retention.local.target.bytes.1024.redpanda.remote.write.True.redpanda.remote.read.True.cloud_storage_type=CloudStorageType.S3"
"rptest.tests.partition_movement_test.SIPartitionMovementTest.test_cross_shard.num_to_upgrade=0.cloud_storage_type=CloudStorageType.S3"

non flaky failures in https://buildkite.com/redpanda/redpanda/builds/58339#01934a49-bc1f-4f14-a66a-ce037b975712:

"rptest.tests.e2e_topic_recovery_test.EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.cloud_storage_type=CloudStorageType.ABS"

non flaky failures in https://buildkite.com/redpanda/redpanda/builds/58548#01935389-b385-48d1-bd21-a29fc520d66b:

"rptest.tests.partition_movement_test.SIPartitionMovementTest.test_cross_shard.num_to_upgrade=0.cloud_storage_type=CloudStorageType.S3"

non flaky failures in https://buildkite.com/redpanda/redpanda/builds/58548#01935389-b388-4f7d-90c8-9c3be969b299:

"rptest.tests.partition_movement_test.SIPartitionMovementTest.test_cross_shard.num_to_upgrade=2.cloud_storage_type=CloudStorageType.S3"

vbotbuildovich · 2024-07-15T12:15:01Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/51509#0190b603-dc88-4eaa-a173-397c56268725

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/51668#0190c1c8-9bd0-44eb-b8c8-50ed7cb6d062

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/51668#0190c1ca-f001-4347-939f-a8584431cef5

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/51858#0190d99d-8ebe-47cd-9e1a-28715332f487

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/52039#0190eab0-a9ba-4987-9dd2-391bf49f6fe7

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/52083#0190eebe-6119-48e2-8842-05ccdb5b8fc5

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/52083#0190eed6-9de0-4afc-a737-9fe9ffbfc3da

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/52191#0190fdc5-b9db-4f93-acdd-277474441116

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/52191#0190fdc5-b9d4-42bf-957c-2ca7015f5eab

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/52191#0190fdc5-b9dd-4bbd-ae36-22ac33cfd811

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/52464#0191220a-6543-432f-a7e8-e0fdc561cb2a

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/53032#01915a51-35ed-408e-80ed-45d4b7c3d191

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/53123#019169f3-9f1e-4a79-b1b4-44fd6954a455

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/58548#0193536f-1dd3-41f3-b894-09a7f9d92910
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/58548#01935389-b384-40e1-a229-a90e52e181b5

src/v/storage/disk_log_impl.cc

src/v/storage/tests/storage_e2e_test.cc

ztlpn · 2024-07-17T15:44:31Z

src/v/storage/disk_log_impl.cc

+           * naturally yield to the appender and it will be retried
+           */
+          if (ptr->has_outstanding_locks()) {
+              return ss::sleep(1ms);


why can't we just wait for the write lock here?

in order to avoid forcing caller to wait for the segment write lock. The write lock is blocked whenever a reader is active hence it may take a long time to close the segment.

mmaslankaprv · 2024-07-25T15:03:20Z

/ci-repeat 1

src/v/storage/disk_log_impl.cc

src/v/storage/disk_log_appender.cc

andrwng

This mostly LGTM. I'm thinking that:

it probably makes sense to wait a bit before backporting, to see if there is additional fallout/races that pop up in CI
wondering if we need to be taking the write lock every time we pop from _segs. That would at least be a clear heuristic for other developers that's easily auditable as a reviewer

mmaslankaprv · 2024-08-16T06:52:09Z

/ci-repeat 1

mmaslankaprv · 2024-08-19T07:43:08Z

/ci-repeat 1

mmaslankaprv · 2024-08-28T09:12:24Z

/ci-repeat 1

Added log entry showing a next offset used to initialize log appender. Signed-off-by: Michał Maślanka <michal@redpanda.com>

vbotbuildovich · 2024-11-20T17:29:46Z

Retry command for Build#58339

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/e2e_topic_recovery_test.py::EndToEndTopicRecovery.test_restore_with_aborted_tx@{"cloud_storage_type":1,"recovery_overrides":{"redpanda.remote.read":true,"redpanda.remote.write":true,"retention.local.target.bytes":1024}}
tests/rptest/tests/partition_movement_test.py::SIPartitionMovementTest.test_cross_shard@{"cloud_storage_type":1,"num_to_upgrade":0}
tests/rptest/tests/e2e_topic_recovery_test.py::EndToEndTopicRecovery.test_restore_with_aborted_tx@{"cloud_storage_type":2,"recovery_overrides":{}}

vbotbuildovich · 2024-11-22T12:38:59Z

Retry command for Build#58548

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/partition_movement_test.py::SIPartitionMovementTest.test_cross_shard@{"cloud_storage_type":1,"num_to_upgrade":0}
tests/rptest/tests/partition_movement_test.py::SIPartitionMovementTest.test_cross_shard@{"cloud_storage_type":1,"num_to_upgrade":2}

Fixed a race condition which may lead to a situation in which the same offset was assigned to two different records appended to the log. The race condition was happening when one one fiber was appending batches to the log while the other one was prefix truncating it. (Raft does it when taking a snapshot). In this situation it might happened that the batches were appended to the segment which was about to be deleted in `remove_segment_permanently`. This lead to a situation in which an appended batch was lost and the same offset was assigned to the next one. This lead to assertion triggered in `mux_state_machine` Fixes: Signed-off-by: Michał Maślanka <michal@redpanda.com>

Added a test validating concurrent operations of writing and prefix truncating a log. Signed-off-by: Michał Maślanka <michal@redpanda.com>

Removed the line that informed the user about outstanding locks keeping the segment alive as now the lock is cleared before the segment is closed. Signed-off-by: Michał Maślanka <michal@redpanda.com>

vbotbuildovich · 2024-11-25T14:48:56Z

/backport v24.3.x

vbotbuildovich · 2024-11-25T14:48:57Z

/backport v24.2.x

vbotbuildovich · 2024-11-25T14:48:58Z

/backport v24.1.x

dotnwat · 2024-11-26T03:22:01Z

src/v/storage/disk_log_impl.cc

          auto ptr = _segs.front();
-          _segs.pop_front();
-          _probe->add_bytes_prefix_truncated(ptr->file_size());
-          return remove_segment_permanently(ptr, "remove_prefix_full_segments");
+          // evict readers before trying to grab a write lock to prevent
+          // contention
+          return _readers_cache->evict_segment_readers(ptr).then(
+            [this, ptr](readers_cache::range_lock_holder cache_lock) {
+                /**
+                 * If segment has outstanding locks wait for it to be unlocked
+                 * before prefixing truncating it, this way the prefix
+                 * truncation will not overlap with appends.
+                 */
+                return ptr->write_lock().then(
+                  [this, ptr, cache_lock = std::move(cache_lock)](
+                    ss::rwlock::holder lock_holder) {
+                      _segs.pop_front();


Prior to this change, ptr and the segment popped are the same.

auto ptr = _segs.front(); _segs.pop_front();

But now, there are two interleaved scheduling points between setting ptr, and popping the segment. What guarantees that they are still the same? Does it matter if they are different?

dotnwat · 2024-11-26T03:25:55Z

src/v/storage/disk_log_impl.cc

    s->tombstone();
-    if (s->has_outstanding_locks()) {
-        vlog(
-          stlog.info,
-          "Segment has outstanding locks. Might take a while to close:{}",
-          s->reader().filename());
-    }


as now the lock is cleared before the segment is closed.

But isn't this called in many contexts (e.g. truncation, compaction), not just when the segment is being closed?

mmaslankaprv requested a review from andrwng July 10, 2024 15:30

github-actions bot added the area/redpanda label Jul 10, 2024

mmaslankaprv requested review from dotnwat, bharathv, bashtanov and ztlpn July 10, 2024 15:30

mmaslankaprv force-pushed the fix-20758 branch from 03a4adb to 1ce6767 Compare July 10, 2024 15:36

bharathv reviewed Jul 12, 2024

View reviewed changes

src/v/storage/disk_log_impl.cc Outdated Show resolved Hide resolved

andrwng previously approved these changes Jul 12, 2024

View reviewed changes

src/v/storage/disk_log_impl.cc Outdated Show resolved Hide resolved

mmaslankaprv force-pushed the fix-20758 branch from 1ce6767 to b6430ee Compare July 15, 2024 09:37

mmaslankaprv requested review from bharathv and andrwng July 15, 2024 09:39

bharathv reviewed Jul 15, 2024

View reviewed changes

src/v/storage/disk_log_impl.cc Outdated Show resolved Hide resolved

mmaslankaprv force-pushed the fix-20758 branch 2 times, most recently from f218661 to 3355c14 Compare July 17, 2024 10:04

ztlpn reviewed Jul 17, 2024

View reviewed changes

mmaslankaprv force-pushed the fix-20758 branch 2 times, most recently from 41c9f46 to cc27381 Compare July 22, 2024 07:30

mmaslankaprv force-pushed the fix-20758 branch from cc27381 to 279c0c1 Compare July 26, 2024 09:59

mmaslankaprv requested review from bharathv and ztlpn July 26, 2024 09:59

ztlpn previously approved these changes Jul 26, 2024

View reviewed changes

andrwng reviewed Jul 26, 2024

View reviewed changes

src/v/storage/disk_log_impl.cc Outdated Show resolved Hide resolved

src/v/storage/disk_log_appender.cc Show resolved Hide resolved

mmaslankaprv dismissed ztlpn’s stale review via a442e93 July 29, 2024 07:56

mmaslankaprv force-pushed the fix-20758 branch from a442e93 to 130bb74 Compare August 5, 2024 08:35

mmaslankaprv requested a review from andrwng August 6, 2024 05:13

andrwng previously approved these changes Aug 6, 2024

View reviewed changes

s/disk_log_impl: added log entry when creating log appender

b0c152d

Added log entry showing a next offset used to initialize log appender. Signed-off-by: Michał Maślanka <michal@redpanda.com>

mmaslankaprv force-pushed the fix-20758 branch from 130bb74 to fff6346 Compare November 20, 2024 14:11

mmaslankaprv dismissed stale reviews from ztlpn and andrwng via 0a2b0e9 November 22, 2024 09:19

mmaslankaprv force-pushed the fix-20758 branch from fff6346 to 0a2b0e9 Compare November 22, 2024 09:19

mmaslankaprv added 3 commits November 25, 2024 08:57

s/tests: added test validating concurrent operations

3a9d721

Added a test validating concurrent operations of writing and prefix truncating a log. Signed-off-by: Michał Maślanka <michal@redpanda.com>

s/log: removed the line informing user about outstanding locks

09feb36

Removed the line that informed the user about outstanding locks keeping the segment alive as now the lock is cleared before the segment is closed. Signed-off-by: Michał Maślanka <michal@redpanda.com>

mmaslankaprv force-pushed the fix-20758 branch from 0a2b0e9 to 09feb36 Compare November 25, 2024 08:01

mmaslankaprv requested review from ztlpn and andrwng November 25, 2024 12:22

Lazin approved these changes Nov 25, 2024

View reviewed changes

mmaslankaprv merged commit 1530ad0 into redpanda-data:dev Nov 25, 2024
19 checks passed

mmaslankaprv deleted the fix-20758 branch November 25, 2024 15:08

dotnwat reviewed Nov 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed race condition in `disk_log_impl::truncate_prefix` #21324

Fixed race condition in `disk_log_impl::truncate_prefix` #21324

mmaslankaprv commented Jul 10, 2024 •

edited

Loading

andrwng left a comment

vbotbuildovich commented Jul 15, 2024 •

edited

Loading

vbotbuildovich commented Jul 15, 2024 •

edited

Loading

ztlpn Jul 17, 2024

mmaslankaprv Jul 22, 2024

mmaslankaprv commented Jul 25, 2024

andrwng left a comment

mmaslankaprv commented Aug 16, 2024 •

edited

Loading

mmaslankaprv commented Aug 19, 2024

mmaslankaprv commented Aug 28, 2024

vbotbuildovich commented Nov 20, 2024 •

edited

Loading

vbotbuildovich commented Nov 22, 2024 •

edited

Loading

vbotbuildovich commented Nov 25, 2024

vbotbuildovich commented Nov 25, 2024

vbotbuildovich commented Nov 25, 2024

dotnwat Nov 26, 2024

dotnwat Nov 26, 2024

Fixed race condition in disk_log_impl::truncate_prefix #21324

Fixed race condition in disk_log_impl::truncate_prefix #21324

Conversation

mmaslankaprv commented Jul 10, 2024 • edited Loading

Backports Required

Release Notes

andrwng left a comment

Choose a reason for hiding this comment

vbotbuildovich commented Jul 15, 2024 • edited Loading

vbotbuildovich commented Jul 15, 2024 • edited Loading

ztlpn Jul 17, 2024

Choose a reason for hiding this comment

mmaslankaprv Jul 22, 2024

Choose a reason for hiding this comment

mmaslankaprv commented Jul 25, 2024

andrwng left a comment

Choose a reason for hiding this comment

mmaslankaprv commented Aug 16, 2024 • edited Loading

mmaslankaprv commented Aug 19, 2024

mmaslankaprv commented Aug 28, 2024

vbotbuildovich commented Nov 20, 2024 • edited Loading

Retry command for Build#58339

vbotbuildovich commented Nov 22, 2024 • edited Loading

Retry command for Build#58548

vbotbuildovich commented Nov 25, 2024

vbotbuildovich commented Nov 25, 2024

vbotbuildovich commented Nov 25, 2024

dotnwat Nov 26, 2024

Choose a reason for hiding this comment

dotnwat Nov 26, 2024

Choose a reason for hiding this comment

Fixed race condition in `disk_log_impl::truncate_prefix` #21324

Fixed race condition in `disk_log_impl::truncate_prefix` #21324

mmaslankaprv commented Jul 10, 2024 •

edited

Loading

vbotbuildovich commented Jul 15, 2024 •

edited

Loading

vbotbuildovich commented Jul 15, 2024 •

edited

Loading

mmaslankaprv commented Aug 16, 2024 •

edited

Loading

vbotbuildovich commented Nov 20, 2024 •

edited

Loading

vbotbuildovich commented Nov 22, 2024 •

edited

Loading