Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.2: when ship node exits with error it usually doesn't start up again properly with snapshot #596

Closed
Tracked by #1440
matthewdarwin opened this issue Dec 23, 2022 · 7 comments · Fixed by #937 or #944
Closed
Tracked by #1440
Assignees
Labels
bug Something isn't working discussion 👍 lgtm OCI Work exclusive to OCI team

Comments

@matthewdarwin
Copy link

matthewdarwin commented Dec 23, 2022

Normally when a ship node crashes, you can start it again by using a snapshot, and then it will start with log messages like:

info  2022-12-23T20:32:22.959 nodeos    controller.cpp:494            replay               ] existing block log, attempting to replay from 218085489 to 218258039 blocks
info  2022-12-23T20:32:24.373 nodeos    log.hpp:479                   truncate             ] fork or replay: removed 172884 blocks from trace_history.log
info  2022-12-23T20:32:24.629 nodeos    log.hpp:479                   truncate             ] fork or replay: removed 172884 blocks from chain_state_history.log

Then it runs replay and all is good.

With 3.2 it seems very often the trace history index is somehow corrupted, and it re-builds the entire state history from scratch.. which takes an inordinate amount of time on a large blockchain (eg wax), so just better to restore from backup than to let this continue.

image

It is always the case that the trace history index index can be corrupt on an unclean shutdown, but there is something in 3.2 that makes it corrupt more than in previous versions. It seems like it is never valid?

@stephenpdeos
Copy link
Member

It appears that from 3.1 to 3.2 that an unclean kill of the nodeos process is more likely to corrupt the SHiP logs. Further discussion internally is required to take a stance on what level of resiliency we want these logs to have at this time. Because this is more likely to occur, possibly due to #592 we will revisit after spending some time with that issue.

@stephenpdeos
Copy link
Member

Due to ongoing changes to SHiP targeted for this next release, we will continue to hold off on this issue for now.

@spoonincode
Copy link
Member

fwiw there appears to be a difference in behavior between 2.0 and 3.x SHIP. 2.0 flushes both the index & log per block, where as 3.x only flushes the log. This means a crash (in addition to a power failure, etc) leaves the index+log in a state it will attempt a recovery upon relaunching.

It's not clear if this is a regression or simply a change in behavior. It's not clear what data file consistency the ship log intends to guarantee.

@heifner
Copy link
Member

heifner commented Feb 23, 2023

Seems like we might as well add back in the flush until time when determination of intended file consistency is made.

@spoonincode
Copy link
Member

yeah I think it's fine to add it back

@greg7mdp
Copy link
Contributor

Is it this change that's missing?
Screenshot from 2023-02-28 10-46-59

@heifner heifner self-assigned this Mar 29, 2023
@heifner heifner added the OCI Work exclusive to OCI team label Mar 29, 2023
@heifner heifner moved this from Todo to In Progress in Team Backlog Mar 29, 2023
heifner added a commit that referenced this issue Mar 30, 2023
@heifner heifner moved this from In Progress to Awaiting Review in Team Backlog Mar 30, 2023
heifner added a commit that referenced this issue Mar 30, 2023
heifner added a commit that referenced this issue Mar 31, 2023
heifner added a commit that referenced this issue Mar 31, 2023
heifner added a commit that referenced this issue Mar 31, 2023
[3.2 -> 4.0] SHiP flush logs on write
heifner added a commit that referenced this issue Mar 31, 2023
[3.2] forkdb reset in replay since blocks are signaled
heifner added a commit that referenced this issue Mar 31, 2023
[4.0 -> main] SHiP flush logs on write
@github-project-automation github-project-automation bot moved this from Awaiting Review to Done in Team Backlog Mar 31, 2023
heifner added a commit that referenced this issue Mar 31, 2023
[3.2 -> 4.0] forkdb reset in replay since blocks are signaled
heifner added a commit that referenced this issue Apr 1, 2023
[4.0 -> main] forkdb reset in replay since blocks are signaled
@heifner
Copy link
Member

heifner commented Jul 24, 2023

Cat: History

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working discussion 👍 lgtm OCI Work exclusive to OCI team
Projects
Archived in project
6 participants