-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Liveness: overflow in prepare_timeout.attempts leads to node crash #8
Comments
Note that due to the high replay rate (80% per packet), at some point during the test the network gets overloaded, and the test may still not complete after removing the assertion. But at least it doesn't segfault anymore. |
Thanks @ThreeFx for a good find and yet another excellent report! The assert was indeed overstrict and has been removed as part of the fix. The reason for all these assertions is defense-in-depth for our human fault model: we would rather be overstrict in our assumptions and learn this through a crash, than be lax and learn this through data loss. We also really appreciate that you grepped the source for similar bugs. This assertion failure does indeed reduce write availability and is a liveness bug. However, it's not as pernicious as the worst liveness bugs, because while it affects availability, it also results in a clean crash that is more readily detected, and because as per the comment in the source: We have therefore decided to award you with a $350 liveness bounty for this issue. You are now officially holding the Number ONE, Number TWO and Number THREE positions on the TigerBeetle leaderboard! 🥇🥈🥉 P.S. Thanks for the cool Polygon track recommendation that's jamming right now. Here's one for you: https://www.tigerbeetle.com/tiger-tracks |
5a256b3f2354c28fbbc529a846bf597f2186e5a9 |
Timeout attempt counters may in fact wrap around to zero. Reported-by: @ThreeFx Refs: tigerbeetle/viewstamped-replication-made-famous#8
Description and Impact
Once the
on_prepare_timeout.attempts
counter (which is anu8
) overflows the assertionself.prepare_timeout.attempts > 0
fails, and the node crashes with an assertion failure. This is a liveness issue, as crashing a node reduces system availability.Steps to Reproduce the Bug
./vopr.sh 15502461157524088066 -OReleaseSafe
./vopr.sh 15502461157524088066 -ODebug 2> /tmp/logfile
Suggested Fix
A short-term fix is to increase the attempts counter to something that doesn't overflow as quickly. However that does not take care of the root cause.
I'm not sure why that assertion is there, I think it is a bug and should be removed, as the timer code explicitly states that it allows overflows:
There do not seem to be similar bugs lurking in the code (at least according to
grep 'attempts > 0'
).The Story Behind the Bug
Just playing around. My current idea is stress-testing the system by replaying a lot of messages and generating a high load (20 clients each generating many requests), which lead to the overflow during this prepare timeout.
Songs Enjoyed During the Production of This Issue
You guessed it: Liquicity Yearmix 2020
I'll throw out a recommendation though: "High" by Polygon and Lois Lauri
Literature
No response
The Last Word
No response
The text was updated successfully, but these errors were encountered: