Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update upload_confirmed_blocks() return value when no blocks to upload #33861

Merged
merged 1 commit into from
Oct 26, 2023

Conversation

steviez
Copy link
Contributor

@steviez steviez commented Oct 25, 2023

Problem

See #33831 (comment) for a detailed walk-through.

Summary of Changes

While this problem of looping on the same slot seems to be caused by an operator restarting their warehouse node without --no-snapshot-fetch (which can cause a gap in the blockstore), we can still be friendly and break out of this loop

Fixes #33831

upload_confirmed_blocks() states that it will return the passed in
ending_slot when there are no blocks to upload. This is enforced in one
early return but not the other. The result is that BigTableUploadService
could potentially get stuck in a loop of trying to upload the same slot.

While this case seems to be caused when an operator restarts their node
with a fetched snapshot that causes a gap, we can still be friendly and
allow them to break out of this loop.
@codecov
Copy link

codecov bot commented Oct 25, 2023

Codecov Report

Merging #33861 (50eaa6f) into master (a3b0348) will decrease coverage by 0.1%.
Report is 1 commits behind head on master.
The diff coverage is 0.0%.

@@            Coverage Diff            @@
##           master   #33861     +/-   ##
=========================================
- Coverage    81.9%    81.9%   -0.1%     
=========================================
  Files         809      809             
  Lines      217717   217717             
=========================================
- Hits       178349   178317     -32     
- Misses      39368    39400     +32     

@steviez steviez marked this pull request as ready for review October 25, 2023 13:46
Copy link
Contributor

@CriesofCarrots CriesofCarrots left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like the right fix to advance past the first slot at the beginning of a gap. I confirmed using a modified solana-ledger-tool bigtable upload with a local ledger and BigTable emulator.

The modification I had to make to ledger tool points to another change we might want to make. I was thinking of asking for it here, but I think I'll just open my own PR and explain there. :)

@@ -138,7 +138,7 @@ pub async fn upload_confirmed_blocks(
"No blocks between {} and {} need to be uploaded to bigtable",
starting_slot, ending_slot
);
return Ok(last_blockstore_slot);
return Ok(ending_slot);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sigh kind of obvious, given the vars in the message immediately above. Thanks for tracking this down!

@steviez steviez merged commit a799a90 into solana-labs:master Oct 26, 2023
32 checks passed
@steviez steviez deleted the bt_loop branch October 26, 2023 08:34
@steviez
Copy link
Contributor Author

steviez commented Oct 26, 2023

What do you think about backports for this one @CriesofCarrots ? It is a bug and fairly self-contained, but also, nodes running --no-snapshot-fetch should not encounter this scenario. I could see maybe v1.17 but probably not v1.16

@CriesofCarrots
Copy link
Contributor

What do you think about backports for this one @CriesofCarrots ? It is a bug and fairly self-contained, but also, nodes running --no-snapshot-fetch should not encounter this scenario. I could see maybe v1.17 but probably not v1.16

Yeah, I'm agreed with just v1.17

@steviez steviez added the v1.17 PRs that should be backported to v1.17 label Oct 26, 2023
mergify bot pushed a commit that referenced this pull request Oct 26, 2023
#33861)

upload_confirmed_blocks() states that it will return the passed in
ending_slot when there are no blocks to upload. This is enforced in one
early return but not the other. The result is that BigTableUploadService
could potentially get stuck in a loop of trying to upload the same slot.

While this case seems to be caused when an operator restarts their node
without --no-snapshot-fetch (which can cause a gap in blockstore), we
can still be friendly and allow them to break out of this loop.

(cherry picked from commit a799a90)
steviez pushed a commit that referenced this pull request Oct 26, 2023
anwayde pushed a commit to firedancer-io/solana that referenced this pull request Nov 16, 2023
anwayde pushed a commit to firedancer-io/solana that referenced this pull request Nov 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v1.17 PRs that should be backported to v1.17
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bigtable Upload Service Loops on Same Block Repeatedly
2 participants