-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
3.2: when ship node exits with error it usually doesn't start up again properly with snapshot #596
Comments
It appears that from 3.1 to 3.2 that an unclean kill of the nodeos process is more likely to corrupt the SHiP logs. Further discussion internally is required to take a stance on what level of resiliency we want these logs to have at this time. Because this is more likely to occur, possibly due to #592 we will revisit after spending some time with that issue. |
Due to ongoing changes to SHiP targeted for this next release, we will continue to hold off on this issue for now. |
fwiw there appears to be a difference in behavior between 2.0 and 3.x SHIP. 2.0 flushes both the index & log per block, where as 3.x only flushes the log. This means a crash (in addition to a power failure, etc) leaves the index+log in a state it will attempt a recovery upon relaunching. It's not clear if this is a regression or simply a change in behavior. It's not clear what data file consistency the ship log intends to guarantee. |
Seems like we might as well add back in the flush until time when determination of intended file consistency is made. |
yeah I think it's fine to add it back |
[3.2 -> 4.0] SHiP flush logs on write
[3.2] forkdb reset in replay since blocks are signaled
[4.0 -> main] SHiP flush logs on write
[3.2 -> 4.0] forkdb reset in replay since blocks are signaled
[4.0 -> main] forkdb reset in replay since blocks are signaled
Cat: History |
Normally when a ship node crashes, you can start it again by using a snapshot, and then it will start with log messages like:
Then it runs replay and all is good.
With 3.2 it seems very often the trace history index is somehow corrupted, and it re-builds the entire state history from scratch.. which takes an inordinate amount of time on a large blockchain (eg wax), so just better to restore from backup than to let this continue.
It is always the case that the trace history index index can be corrupt on an unclean shutdown, but there is something in 3.2 that makes it corrupt more than in previous versions. It seems like it is never valid?
The text was updated successfully, but these errors were encountered: