Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(state-sync): Remove state_fetch_horizon because it causes nodes to crash #9380

Merged
merged 19 commits into from
Aug 7, 2023

Conversation

nikurt
Copy link
Contributor

@nikurt nikurt commented Aug 2, 2023

The main issue is that find_sync_hash() is not consistent with check_state_needed().
check_state_needed compares epoch_id of header_head and head.
find_sync_hash() moves back a number of blocks before determining which epoch to sync.
It can lead to a node deciding to state sync, and downloading the same state again.

I've observed that it leads to a crash in this complicated scenario:

  • State sync to epoch X
  • Finalize the state sync and execute block at height H
  • Check orphans and unblock a block at height H+1
  • At the same time check if a node needs a state sync - determine that yes, and reset Flat Storage to prepare it to execute a block at height H.
  • The orhpaned block gets executed, but flat storage doesn't support height H+1, leading to a panic.

@nikurt nikurt changed the title fix(flat-storage): Logging fix(flat-storage): Remove state_fetch_horizon because it causes nodes to crash Aug 7, 2023
@nikurt nikurt marked this pull request as ready for review August 7, 2023 13:22
@nikurt nikurt requested a review from a team as a code owner August 7, 2023 13:22
@nikurt nikurt requested a review from Longarithm August 7, 2023 13:22
@pugachAG pugachAG requested review from pugachAG and removed request for Longarithm August 7, 2023 13:35
@nikurt nikurt requested a review from ppca August 7, 2023 14:03
Copy link
Contributor

@pugachAG pugachAG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a small suggestion: maybe change the title to fix(state-sync) since to me it seems like this change really fixes state sync and flat storage failure was just a symptom

chain/client/src/client_actor.rs Outdated Show resolved Hide resolved
@nikurt nikurt changed the title fix(flat-storage): Remove state_fetch_horizon because it causes nodes to crash fix(state-sync): Remove state_fetch_horizon because it causes nodes to crash Aug 7, 2023
@near-bulldozer near-bulldozer bot merged commit 7a25e7d into master Aug 7, 2023
@near-bulldozer near-bulldozer bot deleted the nikurt-fs-debug branch August 7, 2023 17:16
ppca pushed a commit that referenced this pull request Aug 9, 2023
… to crash (#9380)

The main issue is that `find_sync_hash()` is not consistent with `check_state_needed()`.
`check_state_needed` compares epoch_id of header_head and head.
`find_sync_hash()` moves back a number of blocks before determining which epoch to sync.
It can lead to a node deciding to state sync, and downloading the same state again.

I've observed that it leads to a crash in this complicated scenario:
* State sync to epoch X
* Finalize the state sync and execute block at height H
* Check orphans and unblock a block at height H+1
* At the same time check if a node needs a state sync - determine that yes, and reset Flat Storage to prepare it to execute a block at height H.
* The orhpaned block gets executed, but flat storage doesn't support height H+1, leading to a panic.
nikurt added a commit that referenced this pull request Aug 24, 2023
… to crash (#9380)

The main issue is that `find_sync_hash()` is not consistent with `check_state_needed()`.
`check_state_needed` compares epoch_id of header_head and head.
`find_sync_hash()` moves back a number of blocks before determining which epoch to sync.
It can lead to a node deciding to state sync, and downloading the same state again.

I've observed that it leads to a crash in this complicated scenario:
* State sync to epoch X
* Finalize the state sync and execute block at height H
* Check orphans and unblock a block at height H+1
* At the same time check if a node needs a state sync - determine that yes, and reset Flat Storage to prepare it to execute a block at height H.
* The orhpaned block gets executed, but flat storage doesn't support height H+1, leading to a panic.
nikurt added a commit to nikurt/nearcore that referenced this pull request Aug 24, 2023
… to crash (near#9380)

The main issue is that `find_sync_hash()` is not consistent with `check_state_needed()`.
`check_state_needed` compares epoch_id of header_head and head.
`find_sync_hash()` moves back a number of blocks before determining which epoch to sync.
It can lead to a node deciding to state sync, and downloading the same state again.

I've observed that it leads to a crash in this complicated scenario:
* State sync to epoch X
* Finalize the state sync and execute block at height H
* Check orphans and unblock a block at height H+1
* At the same time check if a node needs a state sync - determine that yes, and reset Flat Storage to prepare it to execute a block at height H.
* The orhpaned block gets executed, but flat storage doesn't support height H+1, leading to a panic.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants