Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: move garbage collection into a separate thread #11022

Merged
merged 14 commits into from
Apr 19, 2024

Conversation

bowenwang1996
Copy link
Collaborator

Move garbage collection into a separate actor to prevent it from blocking synchronously inside client actor. Fixes #10971. For testing, garbage_collection_intense.py sends a lot of insertion and deletion transactions on the same key and make sure that nodes do not crash after gc is done. The change is also run on a mainnet node and so far it works fine.

TODO: run all nayduck tests

@bowenwang1996 bowenwang1996 requested a review from a team as a code owner April 11, 2024 00:08
@bowenwang1996 bowenwang1996 requested a review from wacban April 11, 2024 00:08
@bowenwang1996 bowenwang1996 changed the title Separate gc thread feat: move garbage collection into a thread Apr 11, 2024
@bowenwang1996 bowenwang1996 changed the title feat: move garbage collection into a thread feat: move garbage collection into a separate thread Apr 11, 2024
@bowenwang1996
Copy link
Collaborator Author

@wacban @posvyatokum what is the best way to test this for archival nodes with split storage?

@posvyatokum
Copy link
Member

Off the top of my head

To make sure that cold loop is working properly:

  • Run canary and check that cold head is increasing, and it is smaller than tail.

To make sure that we didn't accidentally deleted important data before backup without impacting cold loop:

  • Replaying history
  • Checking column contents against prod split storage nodes
  • Traversing Trie

Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good overall and it's a nice idea

Please see comments - I think we need to handle archival nodes a bit differently.

Some tests are failing, please have a look at it as well.

In order to trigger nayduck you can checkout this branch and run ./scripts/nayduck.py from nearcore.

chain/client/src/client_actor.rs Outdated Show resolved Hide resolved
chain/client/src/gc_actor.rs Show resolved Hide resolved
chain/client/src/gc_actor.rs Show resolved Hide resolved
chain/client/src/gc_actor.rs Outdated Show resolved Hide resolved
Comment on lines 7 to 9
[lints]
workspace = true

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with lints, what does this do and why did you delete it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't compile otherwise

nightly/pytest-sanity.txt Show resolved Hide resolved
@bowenwang1996
Copy link
Collaborator Author

@wacban @posvyatokum I have an idea: instead of testing whether it works for archival nodes with split storage, we don't change the logic for archival nodes, i.e, they still move data from hot storage to cold storage as part of the main thread. I think this is okay because right now the main goal of archival node is to store historical data and it does not need to be able to process new blocks very fast. This would save us some trouble in testing archival nodes and we can potentially do the change for archival node as well at a later time, if necessary. What do you think?

@wacban
Copy link
Contributor

wacban commented Apr 12, 2024

@wacban @posvyatokum I have an idea: instead of testing whether it works for archival nodes with split storage, we don't change the logic for archival nodes, i.e, they still move data from hot storage to cold storage as part of the main thread. I think this is okay because right now the main goal of archival node is to store historical data and it does not need to be able to process new blocks very fast. This would save us some trouble in testing archival nodes and we can potentially do the change for archival node as well at a later time, if necessary. What do you think?

Personally I don't like it as it would mean we have two entirely different code paths and even threads for garbage collection. The CI is failing so it seems like we have at least some test coverage. I would suggest fixing those first and if coverage is missing for split storage we should add it.

If you want to test it on a real node that shouldn't be too hard as well, it takes about 10min to spin it up. Then I suppose it would be sufficient to add and check some logs.

Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

integration-tests/src/tests/client/process_blocks.rs Outdated Show resolved Hide resolved
Copy link

codecov bot commented Apr 18, 2024

Codecov Report

Attention: Patch coverage is 86.27451% with 28 lines in your changes are missing coverage. Please review.

Project coverage is 71.08%. Comparing base (2bbde59) to head (72c2ec4).
Report is 12 commits behind head on master.

Files Patch % Lines
chain/client/src/gc_actor.rs 75.32% 19 Missing ⚠️
chain/chain/src/garbage_collection.rs 90.74% 0 Missing and 5 partials ⚠️
chain/client/src/client_actions.rs 0.00% 4 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##           master   #11022    +/-   ##
========================================
  Coverage   71.07%   71.08%            
========================================
  Files         767      769     +2     
  Lines      153367   153659   +292     
  Branches   153367   153659   +292     
========================================
+ Hits       109005   109227   +222     
- Misses      39909    39992    +83     
+ Partials     4453     4440    -13     
Flag Coverage Δ
backward-compatibility 0.24% <0.00%> (-0.01%) ⬇️
db-migration 0.24% <0.00%> (-0.01%) ⬇️
genesis-check 1.43% <0.52%> (-0.01%) ⬇️
integration-tests 36.87% <80.39%> (+0.09%) ⬆️
linux 69.48% <86.27%> (+<0.01%) ⬆️
linux-nightly 70.57% <86.27%> (+0.03%) ⬆️
macos 52.57% <66.66%> (-1.69%) ⬇️
pytests 1.66% <0.52%> (-0.01%) ⬇️
sanity-checks 1.45% <0.52%> (-0.01%) ⬇️
unittests 66.74% <66.66%> (-0.02%) ⬇️
upgradability 0.29% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bowenwang1996
Copy link
Collaborator Author

I am going to merge this PR. For testing we have:

  • Nayduck run
  • It has been running on a regular mainnet node for over a week without any issues
  • It has been running on an archival mainnet node with split storage for 2 days and cold store head increases as expected.

@bowenwang1996 bowenwang1996 enabled auto-merge April 19, 2024 00:31
@bowenwang1996 bowenwang1996 added this pull request to the merge queue Apr 19, 2024
Merged via the queue into master with commit 2c80ee0 Apr 19, 2024
28 of 29 checks passed
@bowenwang1996 bowenwang1996 deleted the separate-gc-thread branch April 19, 2024 01:13
github-merge-queue bot pushed a commit that referenced this pull request Apr 23, 2024
Re-enable store validator tests that were disabled in #11022. It works
by first sending a signal to GC actor to stop garbage collection, then
run the store validator, and send a signal to re-enable gc afterwards.

Nayduck run https://nayduck.nearone.org/#/run/58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Garbage collection may take a long time
4 participants