-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Swing-store gets unexpectedly committed between blocks #8423
Comments
Found the issue. SQLite autocommits when releasing a savepoint that is not wrapped in a transaction. See https://stackoverflow.com/q/38622587 and https://www.sqlite.org/c3ref/get_autocommit.html |
The second tests fails if I comment out the #8423 fix, as expected.
Here is what happened:
This explains why only the first run would have been committed. As such this problem occurs when there is leftover work from the previous block AND the host is interrupted after at least one crank of that left over work has been completed, and before the block commit is reached. It does not technically matter whether that block has any other actions to process or not, however it would be very difficult to interrupt the host before reaching the commit point without such extra work. Interestingly implementing the defensive check detailed in #8424 would actually have solved the issue, and would prevent any similar issue like this since it would effectively force a block transaction to be started when beginning a block. |
More notes for future consideration:
The buggy sequence was:
If our
That defense would still depend upon having Of course, we had no idea that |
The second tests fails if I comment out the #8423 fix, as expected.
The second tests fails if I comment out the #8423 fix, as expected.
The second tests fails if I comment out the #8423 fix, as expected.
Describe the bug
Polkachu reported on Discord one of their backup node experienced a wrong AppHash after commiting block 11880370 on mainnet.
After a long analysis we have reproduced and confirmed this is caused by the swing-store apparently getting committed mid-block. In this case it got committed after run-0 (complete left-over work), run-1 (high priority push price), run-2 (timer poll), but before run-3 (first action queue item). If a restart of the node happens after such a mid-block commit, it results in an execution divergence.
Because the hangover inconsistency logic assumes that midblock commits do not happen, the saved block height only gets updated at the end of "end block", once all runs have completed. As such the mechanism to detect and handle hangovers does not realize that the current block was partially processed, and attempts re-execution, but starting from a partial block state. In this case the swing-store state ends up identical, which suggests only run-0 had been committed, but the exported data to cosmos DB (for state-sync and consistency) does not contain the same entries as a node that executed normally (it's missing the run-0 related exports), resulting in an AppHash error.
To Reproduce
Artifacts related to this issue in shared a GDrive folder
Steps to reproduce the behavior:
agoric-11880351.tgz
and extract to~/.agoric
following usual steps to restore a cosmos snapshotagoric-upgrade-11
and buildpackages/cosmic-swingset/src/launch-chain.js
with the patch in0001-Pause-during-block-11880370.patch
agd start
block 11880370 begin
wait 30s andkillall node
halt-mid-11880370.tgz
state was taken at this pointagd start
once again afterblock 11880370 commit
The state directory capture by Polkachu after the AppHash is also included in the GDrive folder, as well as a genesis export (modified to ignore missing replay artifacts) and a slog extracted from the flight-recorder.
Also included is a slog extract of the diverging block from reference follower node.
Expected behavior
The swing-store does not get committed in the middle of blocks.
Platform Environment
agoric-upgrade-11
tag.Additional context
The restart slogs show that run-0 is empty, but that run-1 (PushPrice from oracle) executes like a normal node (besides the vat replay which is expected), which means the unexpected commit happened between those 2 runs, and not after run-1 or run-2 which were also executed before the restart.
One mitigation to detect this state earlier (before commit of full block) is detailed in #8424
Screenshots
See slogs and other logs in the GDrive
The text was updated successfully, but these errors were encountered: