-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: move garbage collection into a separate thread #11022
Conversation
@wacban @posvyatokum what is the best way to test this for archival nodes with split storage? |
Off the top of my head To make sure that cold loop is working properly:
To make sure that we didn't accidentally deleted important data before backup without impacting cold loop:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good overall and it's a nice idea
Please see comments - I think we need to handle archival nodes a bit differently.
Some tests are failing, please have a look at it as well.
In order to trigger nayduck you can checkout this branch and run ./scripts/nayduck.py
from nearcore.
[lints] | ||
workspace = true | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with lints, what does this do and why did you delete it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't compile otherwise
@wacban @posvyatokum I have an idea: instead of testing whether it works for archival nodes with split storage, we don't change the logic for archival nodes, i.e, they still move data from hot storage to cold storage as part of the main thread. I think this is okay because right now the main goal of archival node is to store historical data and it does not need to be able to process new blocks very fast. This would save us some trouble in testing archival nodes and we can potentially do the change for archival node as well at a later time, if necessary. What do you think? |
Personally I don't like it as it would mean we have two entirely different code paths and even threads for garbage collection. The CI is failing so it seems like we have at least some test coverage. I would suggest fixing those first and if coverage is missing for split storage we should add it. If you want to test it on a real node that shouldn't be too hard as well, it takes about 10min to spin it up. Then I suppose it would be sufficient to add and check some logs. |
e0bd3e0
to
37b9cd8
Compare
37b9cd8
to
ac6adfe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #11022 +/- ##
========================================
Coverage 71.07% 71.08%
========================================
Files 767 769 +2
Lines 153367 153659 +292
Branches 153367 153659 +292
========================================
+ Hits 109005 109227 +222
- Misses 39909 39992 +83
+ Partials 4453 4440 -13
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
I am going to merge this PR. For testing we have:
|
Re-enable store validator tests that were disabled in #11022. It works by first sending a signal to GC actor to stop garbage collection, then run the store validator, and send a signal to re-enable gc afterwards. Nayduck run https://nayduck.nearone.org/#/run/58
Move garbage collection into a separate actor to prevent it from blocking synchronously inside client actor. Fixes #10971. For testing,
garbage_collection_intense.py
sends a lot of insertion and deletion transactions on the same key and make sure that nodes do not crash after gc is done. The change is also run on a mainnet node and so far it works fine.TODO: run all nayduck tests