-
Notifications
You must be signed in to change notification settings - Fork 647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve startup time after crash #339
Conversation
…ional blocks in block_database Fixed tests wrt db.open
Dont remove blocks from block db when popping blocks, handle edge case in replay wrt fork_db, adapted unit tests
The code is black magic for me, but the motives are great!! Please let's have this! |
very needed, i am in the process of test and review everything peter is submitting including this pull, we will definitely have this included with minor or no changes. |
Sounds great.
//Update: it will only flush on startup during reindex, so it's ok. |
skip_tapos_check | | ||
skip_witness_schedule_check | | ||
skip_authority_check); | ||
} | ||
} | ||
_undo_db.enable(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line need to be removed? We may have enabled it earlier. Also possible it hasn't been disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strictly speaking yes, but it only sets a flag in undo_db. Doesn't hurt if it gets set more than once.
libraries/chain/db_management.cpp
Outdated
const auto last_block_num = last_block->block_num(); | ||
uint32_t flush_point = last_block_num - 10000; | ||
uint32_t undo_point = last_block_num - 50; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps need to check for Integer overflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, but doesn't matter in practice IMO.
If last_block_num < 10000 then flush_point > MAX_INT-10000, which means we never flush. We don't lose much though, only the state from less than 10k blocks.
If last_block_num < 50 then undo_point > MAX_INT - 50, which means the fork_database isn't used during replay. But that is only a problem if your block_database ends in a fork, which is unlikely.
libraries/chain/db_management.cpp
Outdated
_undo_db.disable(); | ||
for( uint32_t i = 1; i <= last_block_num; ++i ) | ||
if( head_block_num() >= undo_point ) | ||
_fork_db.start_block( *fetch_block_by_number( head_block_num() ) ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is a bit strange. I think we need to initialize _fork_db
correctly before switching to push_block()
. The start_block
should be the block before the first push
. Will it also be initialized in else
?
Similar code in steem: abitmore/steem@77aaa00
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original code doesn't call _fork_db.start_block in all cases either. AFAICS fork_db uses the first pushed block as the start block if no start block has been set explicitly.
libraries/chain/db_management.cpp
Outdated
const auto last_block_num = last_block->block_num(); | ||
uint32_t flush_point = last_block_num - 10000; | ||
uint32_t undo_point = last_block_num - 50; | ||
|
||
ilog( "Replaying blocks..." ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Best if we have the start block number in the log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add block number to the log message.
I cannot comment on the specifics of the code, but the logic seems sound and the importance of this cannot be understated. I especially like that you considered to add point 4., because I've had quite a few crashes when replaying (esp. towards the end), or had to ctrl+c a node replaying, and having to start replaying from scratch was really annoying. Great work! |
reviewing this from the nodemaster point of view as i think it is a kick ass addition. on an initial test i started to sync the chain and when it was starting 2017 i stopped it with 2 control-c. this will result in the tedious "unclean shutdown" that the patch is trying to avoid. however when starting it back i am getting an error, pasting full output for reference:
the only chance i had left was to remove the data dir and start over ... |
i think the previous msg(#339 (comment)) may be caused by the double ctrl-c because i did the same thing again but killed the process from another terminal. when running the program again it went straight to where i was left and continue sync. more test to come but looking good :) |
i can confirm the kill with double ctrl-c cause the above problem, i think it may worth researching it as it is kind of a natural way to kill it if you need the terminal and can't wait for the clean shutdown specially on full node(no partial and max ops). |
I suppose the first ctrl-c causes another flush which is then interrupted by the second ctrl-c. The patch should notice the inconsistent object_database in that case and start a full replay, though. Apparently that doesn't work, will check. (IMO nobody should be running a node in a terminal window in the foreground, but that's not an excuse of course.) |
I made a few tests of this pull, the last 2 cases failed but the rest worked as expected. i am wondering if we can know if the data is corrupted and launch a full replay for the last 2 cases.
logs for the 2 failing cases(sorry if they are too long, included everything from start to end of the restart):
|
Thanks for testing thoroughly. :-/ Will investigate more. Did you send SIGKILL / shutdown during replay, during normal operation/sync, or after initiating normal shutdown (SIGTERM / CTRL-C)? |
The problems you were seeing resulted from a corrupted block_log. I have improved the consistency checking in block_database. |
hi Peter, i like where it is heading with the new commits however it still have some issues.
another one:
please note this were while chain was syncing and it was not all the times, sometimes it just exit clean as expected and no replay is needed. seems to be dependent on where on the chain the process is stopped. when segfaulting at the exit on the restart at 1 time the replay was only on the last part of the chain while in other the data was corrupted and had to delete the datadir and startover. during an unexpected shutdown i got a full replay, this is bettera than previous version where |
I believe the errors when pushing blocks after ctrl-c/SIGTERM are unrelated to this PR, I have seen similar messages when terminating an unmodified node. It looks like the P2P code is still active while the database is being shut down. Note sure why this happens, application::shutdown closes the P2P layer before closing the database. Perhaps a race condition. Still looking into the data corruption. I believe I have found a bug in the undo_database that is not triggered in the original code, but can be triggered during a partial replay. Did you try a full replay instead of deleting the datadir? In the case of an unexpected shutdown, evil things can happen, especially when you pull the power plug. Disk buffers may not be flushed etc., which means you most likely have an inconsistent on-disk state. |
i am pretty sure the segfaults are related to the last commits of this pull but ill do some testing with nodes without this changes to confirm. the control-c or sigterm kill is always a clean shutdown if you have ram enough or using ram reducing node options. i don't remember this kind of segfaults when trying to exit clean from a node without the code in this pull request. will confirm by comparing with a node without that code and get back here to update. i understand what you mean about the unexpected shutdown, lets forget that case. |
Found and fixed a bug in undo_database: undo_database contains a stack of undo "frames", each of which should be tied to an undo session. The undo session can commit or undo (i. e. roll back) the top frame, or merge the two top frames on the stack. The merge is used for a kind of sub-transaction: when a proposal is executed, a sub-session of the current session is created. If the proposal fails for some reason, the sub-session is undone. If the proposal executes successfully, the sub-session is merged with its "parent" session. Problem: during replay, a parent session does not exist, because undo is disabled, therefore the merge fails. In a full replay, this situation does not occur, most likely because a previous proposal failed, and the resulting undo creates an empty frame on the stack! https://github.com/cryptonomex/graphene/blob/master/libraries/db/undo_database.cpp#L121-L122 The latest commit resolves this problem by treating the merge as a no-op if the undo stack is empty. This is compatible with previous behaviour. Regarding the segfaults, I have also seen these in an unmodified node. |
PC said: "Regarding the segfaults, I have also seen these in an unmodified node." Yes, as have I. It's actually quite frequent, even with track-account / partial operations. Likelihood of a segfault with a ^C stop increases the longer the node has been running. I don't stop with ^C in production, I use systemD, and coded a stop event using: I like this work and the quality I'm seeing in testing it. Great job guys, glad to see these issues being addressed. This also provides more reliability and is of immediate benefit to all witnesses. |
Good work here team. Just like Tom, I use systemD to run the nodes and capture output into the logs. Stopping services of systemD crashes it every time and requires replay. The output varies, sometimes it will show a clean exit and sometimes it will show an error like this: 3107456ms th_a main.cpp:195 main ] Exiting with error: In both cases replay is required. |
@oxarbitrage @abitmore it is important to get this into the next release too, please finish your reviews and merge |
@@ -118,15 +118,19 @@ void undo_database::undo() | |||
_db.insert( std::move(*item.second) ); | |||
|
|||
_stack.pop_back(); | |||
if( _stack.empty() ) | |||
_stack.emplace_back(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious why we had this in the first place, and why we think these two lines should be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these two lines were added as a workaround for the problem I describe here: #339 (comment)
4822c21 contains a better fix to the underlying problem, IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure I understand all the corner cases, but it looks better. thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
definitely this is an improve, when killed and the node segfaults(happens often) the full chain will not be replayed, among other scenarios where the new code is helpful. it might not solve all the cases but it is still a time saver and smart way to reuse data saved before crash. looks good.
Improve startup time after crash
Tl;dr: if during startup we have a (possibly old) object_database available, we should use it.
We have up to two on-disk representations of the blockchain state: the block log and the object_database. The object database can be reconstructed from the block log, this is what a "replay" does. A replay takes a long time, and this time is constantly growing. This PR is an attempt to mitigate that.
During startup, the application checks the availability and state of the on-disk representation. The old code applies some rather restrictive logic (object_database.head_block == block_log.last_block) to determine if the state from the object_database can be used. This prevents creating and using snapshots of the object_database.
The logic implemented by this PR is more natural, IMO:
This change will greatly improve startup time for nodes, which will be particularly important
ifwhen we experience another blockchain halt, like a couple of weeks ago.