Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Liveness: overflow in prepare_timeout.attempts leads to node crash #8

Closed
ThreeFx opened this issue Sep 14, 2021 · 3 comments
Closed

Comments

@ThreeFx
Copy link
Contributor

ThreeFx commented Sep 14, 2021

Description and Impact

Once the on_prepare_timeout.attempts counter (which is an u8) overflows the assertion self.prepare_timeout.attempts > 0 fails, and the node crashes with an assertion failure. This is a liveness issue, as crashing a node reduces system availability.

// in src/vsr/replica.zig

fn on_prepare_timeout(self: *Self) void {
    // [snip]

    // increases prepare_timeout.attempts
    self.prepare_timeout.backoff(&self.prng);
    
    // [snip]

    // Cycle through the list to reach live replicas and get around partitions:
    
    // This is where the overflow comes back to bite us.
    assert(self.prepare_timeout.attempts > 0);
    const replica = waiting[self.prepare_timeout.attempts % waiting_len];
    assert(replica != self.replica);


    // [snip]
}

Steps to Reproduce the Bug

  1. Apply patch
  2. Run ./vopr.sh 15502461157524088066 -OReleaseSafe
    • After 25 transitions, the system will crash with the following message (line number may differ):
thread 18126 panic: reached unreachable code
./vopr.sh: line 23: 18126 Aborted                 (core dumped) zig run src/simulator.zig $BUILD_MODE -- $1
  1. Run ./vopr.sh 15502461157524088066 -ODebug 2> /tmp/logfile
    • Note: Takes a long time, about half an hour on my RX5900
    • Stacktrace:
thread 11349 panic: reached unreachable code
/usr/lib/zig/std/debug.zig:226:14: 0x22f90b in std.debug.assert (simulator)
    if (!ok) unreachable; // assertion failure
             ^
/home/bfiedler/projects/tigerbeetle-challenge/src/vsr/replica.zig:1479:19: 0x27c073 in vsr.replica.Replica(test.state_machine.StateMachine,test.message_bus.MessageBus,test.storage.Storage,test.time.Time).on_prepare_timeout (simulator)
            assert(self.prepare_timeout.attempts > 0);
                  ^
/home/bfiedler/projects/tigerbeetle-challenge/src/vsr/replica.zig:398:70: 0x2621bd in vsr.replica.Replica(test.state_machine.StateMachine,test.message_bus.MessageBus,test.storage.Storage,test.time.Time).tick (simulator)
            if (self.prepare_timeout.fired()) self.on_prepare_timeout();
                                                                     ^
/home/bfiedler/projects/tigerbeetle-challenge/src/simulator.zig:177:25: 0x258a92 in main (simulator)
            replica.tick();
                        ^
/usr/lib/zig/std/start.zig:458:37: 0x25084a in std.start.callMain (simulator)
            const result = root.main() catch |err| {
                                    ^
/usr/lib/zig/std/start.zig:400:12: 0x23431e in std.start.callMainWithArgs (simulator)
    return @call(.{ .modifier = .always_inline }, callMain, .{});
           ^
/usr/lib/zig/std/start.zig:319:17: 0x233325 in std.start.posixCallMainAndExit (simulator)
    std.os.exit(@call(.{ .modifier = .always_inline }, callMainWithArgs, .{ argc, argv, envp }));
                ^
/usr/lib/zig/std/start.zig:244:5: 0x233082 in std.start._start (simulator)
    @call(.{ .modifier = .never_inline }, posixCallMainAndExit, .{});
    ^

diff --git a/src/config.zig b/src/config.zig
index f4c3781..d976dad 100644
--- a/src/config.zig
+++ b/src/config.zig
@@ -6,14 +6,14 @@ pub const log_level = 6;
 
 /// The maximum number of replicas allowed in a cluster.
 /// This has been limited to 5 just to decrease the amount of memory required by the VOPR simulator.
-pub const replicas_max = 5;
+pub const replicas_max = 6;
 
 /// The maximum number of clients allowed per cluster, where each client has a unique 128-bit ID.
 /// This impacts the amount of memory allocated at initialization by the server.
 /// This determines the size of the VR client table used to cache replies to clients by client ID.
 /// Each client has one entry in the VR client table to store the latest `message_size_max` reply.
 /// This has been limited to 3 just to decrease the amount of memory required by the VOPR simulator.
-pub const clients_max = 3;
+pub const clients_max = 20;
 
 /// The minimum number of nodes required to form a quorum for replication:
 /// Majority quorums are only required across view change and replication phases (not within).
diff --git a/src/simulator.zig b/src/simulator.zig
index 89a7dca..210052e 100644
--- a/src/simulator.zig
+++ b/src/simulator.zig
@@ -55,14 +55,14 @@ pub fn main() !void {
     var prng = std.rand.DefaultPrng.init(seed);
     const random = &prng.random;
 
-    const replica_count = 1 + prng.random.uintLessThan(u8, config.replicas_max);
-    const client_count = 1 + prng.random.uintLessThan(u8, config.clients_max);
+    const replica_count = config.replicas_max; //1 + prng.random.uintLessThan(u8, config.replicas_max);
+    const client_count = config.clients_max; //1 + prng.random.uintLessThan(u8, config.clients_max);
     const node_count = replica_count + client_count;
 
     const ticks_max = 100_000_000;
     const transitions_max = config.journal_size_max / config.message_size_max;
-    const request_probability = 1 + prng.random.uintLessThan(u8, 99);
-    const idle_on_probability = prng.random.uintLessThan(u8, 20);
+    const request_probability = 100; //1 + prng.random.uintLessThan(u8, 99);
+    const idle_on_probability = 0; //prng.random.uintLessThan(u8, 20);
     const idle_off_probability = 10 + prng.random.uintLessThan(u8, 10);
 
     cluster = try Cluster.create(allocator, &prng.random, .{
@@ -74,13 +74,13 @@ pub fn main() !void {
             .packet_simulator_options = .{
                 .node_count = node_count,
                 .seed = prng.random.int(u64),
-                .one_way_delay_mean = 3 + prng.random.uintLessThan(u16, 10),
-                .one_way_delay_min = prng.random.uintLessThan(u16, 3),
-                .packet_loss_probability = prng.random.uintLessThan(u8, 30),
-                .path_maximum_capacity = 20 + prng.random.uintLessThan(u8, 20),
+                .one_way_delay_mean = 0, //3 + prng.random.uintLessThan(u16, 10),
+                .one_way_delay_min = 0, //prng.random.uintLessThan(u16, 3),
+                .packet_loss_probability = 0, //prng.random.uintLessThan(u8, 30),
+                .path_maximum_capacity = 100, //20 + prng.random.uintLessThan(u8, 20),
                 .path_clog_duration_mean = prng.random.uintLessThan(u16, 500),
                 .path_clog_probability = prng.random.uintLessThan(u8, 2),
-                .packet_replay_probability = prng.random.uintLessThan(u8, 50),
+                .packet_replay_probability = 80, // prng.random.uintLessThan(u8, 50),
             },
         },
         .storage_options = .{
diff --git a/src/test/storage.zig b/src/test/storage.zig
index ed3cbfa..d81270a 100644
--- a/src/test/storage.zig
+++ b/src/test/storage.zig
@@ -281,52 +281,25 @@ pub const Storage = struct {
         // We need to ensure there is message_size_max fault-free padding
         // between faulty areas of memory so that a single message
         // cannot straddle the corruptable areas of a majority of replicas.
-        switch (replica_count) {
-            1 => {
-                // If there is only one replica in the cluster, storage faults are not recoverable.
-                out[0] = .{ .first_offset = size, .period = 1 };
-            },
-            2 => {
-                //  0123456789
-                // 0X   X   X
-                // 1  X   X   X
-                out[0] = .{ .first_offset = 0 * message_size_max, .period = 4 * message_size_max };
-                out[1] = .{ .first_offset = 2 * message_size_max, .period = 4 * message_size_max };
-            },
-            3 => {
-                //  0123456789
-                // 0X     X
-                // 1  X     X
-                // 2    X     X
-                out[0] = .{ .first_offset = 0 * message_size_max, .period = 6 * message_size_max };
-                out[1] = .{ .first_offset = 2 * message_size_max, .period = 6 * message_size_max };
-                out[2] = .{ .first_offset = 4 * message_size_max, .period = 6 * message_size_max };
-            },
-            4 => {
-                //  0123456789
-                // 0X   X   X
-                // 1X   X   X
-                // 2  X   X   X
-                // 3  X   X   X
-                out[0] = .{ .first_offset = 0 * message_size_max, .period = 4 * message_size_max };
-                out[1] = .{ .first_offset = 0 * message_size_max, .period = 4 * message_size_max };
-                out[2] = .{ .first_offset = 2 * message_size_max, .period = 4 * message_size_max };
-                out[3] = .{ .first_offset = 2 * message_size_max, .period = 4 * message_size_max };
-            },
-            5 => {
-                //  0123456789
-                // 0X     X
-                // 1X     X
-                // 2  X     X
-                // 3  X     X
-                // 4    X     X
-                out[0] = .{ .first_offset = 0 * message_size_max, .period = 6 * message_size_max };
-                out[1] = .{ .first_offset = 0 * message_size_max, .period = 6 * message_size_max };
-                out[2] = .{ .first_offset = 2 * message_size_max, .period = 6 * message_size_max };
-                out[3] = .{ .first_offset = 2 * message_size_max, .period = 6 * message_size_max };
-                out[4] = .{ .first_offset = 4 * message_size_max, .period = 6 * message_size_max };
-            },
-            else => unreachable,
+        var i: usize = 0;
+        while (i < config.replicas_max) : (i += 3) {
+            //  0123456789
+            // 0X     X
+            // 1  X     X
+            // 2    X     X
+            // 3X     X
+            // 4  X     X
+            // 5    X     X
+            // ...
+            out[i] = .{ .first_offset = 0 * message_size_max, .period = 6 * message_size_max };
+
+            if (i + 1 < config.replicas_max) {
+                out[i + 1] = .{ .first_offset = 2 * message_size_max, .period = 6 * message_size_max };
+            }
+
+            if (i + 2 < config.replicas_max) {
+                out[i + 2] = .{ .first_offset = 4 * message_size_max, .period = 6 * message_size_max };
+            }
         }
 
         prng.shuffle(FaultyAreas, out[0..replica_count]);

Suggested Fix

A short-term fix is to increase the attempts counter to something that doesn't overflow as quickly. However that does not take care of the root cause.

I'm not sure why that assertion is there, I think it is a bug and should be removed, as the timer code explicitly states that it allows overflows:

pub const Timeout = struct {
    // [snip]
    attempts: u8 = 0,
    // [snip]

    /// Increments the attempts counter and resets the timeout with exponential backoff and jitter.
    /// Allows the attempts counter to wrap from time to time.
    /// The overflow period is kept short to surface any related bugs sooner rather than later.
    /// We do not saturate the counter as this would cause round-robin retries to get stuck.
    pub fn backoff(self: *Timeout, prng: *std.rand.DefaultPrng) void {
        assert(self.ticking);

        self.ticks = 0;
        self.attempts +%= 1;

        log.debug("{}: {s} backing off", .{ self.id, self.name });
        self.set_after_for_rtt_and_attempts(prng);
    }
    // [snip]
}

There do not seem to be similar bugs lurking in the code (at least according to grep 'attempts > 0').

The Story Behind the Bug

Just playing around. My current idea is stress-testing the system by replaying a lot of messages and generating a high load (20 clients each generating many requests), which lead to the overflow during this prepare timeout.

Songs Enjoyed During the Production of This Issue

You guessed it: Liquicity Yearmix 2020

I'll throw out a recommendation though: "High" by Polygon and Lois Lauri

Literature

No response

The Last Word

No response

@ThreeFx
Copy link
Contributor Author

ThreeFx commented Sep 14, 2021

Note that due to the high replay rate (80% per packet), at some point during the test the network gets overloaded, and the test may still not complete after removing the assertion. But at least it doesn't segfault anymore.

@jorangreef
Copy link
Member

Thanks @ThreeFx for a good find and yet another excellent report!

The assert was indeed overstrict and has been removed as part of the fix. The reason for all these assertions is defense-in-depth for our human fault model: we would rather be overstrict in our assumptions and learn this through a crash, than be lax and learn this through data loss.

We also really appreciate that you grepped the source for similar bugs.

This assertion failure does indeed reduce write availability and is a liveness bug. However, it's not as pernicious as the worst liveness bugs, because while it affects availability, it also results in a clean crash that is more readily detected, and because as per the comment in the source: /// The overflow period is kept short to surface any related bugs sooner rather than later.

We have therefore decided to award you with a $350 liveness bounty for this issue.

You are now officially holding the Number ONE, Number TWO and Number THREE positions on the TigerBeetle leaderboard!

🥇🥈🥉


P.S. Thanks for the cool Polygon track recommendation that's jamming right now. Here's one for you: https://www.tigerbeetle.com/tiger-tracks

@ThreeFx
Copy link
Contributor Author

ThreeFx commented Sep 15, 2021

5a256b3f2354c28fbbc529a846bf597f2186e5a9

jorangreef added a commit to tigerbeetle/tigerbeetle that referenced this issue Sep 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants