feat: move garbage collection into a separate thread #11022

bowenwang1996 · 2024-04-11T00:08:43Z

Move garbage collection into a separate actor to prevent it from blocking synchronously inside client actor. Fixes #10971. For testing, garbage_collection_intense.py sends a lot of insertion and deletion transactions on the same key and make sure that nodes do not crash after gc is done. The change is also run on a mainnet node and so far it works fine.

TODO: run all nayduck tests

bowenwang1996 · 2024-04-11T01:13:55Z

@wacban @posvyatokum what is the best way to test this for archival nodes with split storage?

posvyatokum · 2024-04-11T11:09:20Z

Off the top of my head

To make sure that cold loop is working properly:

Run canary and check that cold head is increasing, and it is smaller than tail.

To make sure that we didn't accidentally deleted important data before backup without impacting cold loop:

Replaying history
Checking column contents against prod split storage nodes
Traversing Trie

wacban

looks good overall and it's a nice idea

Please see comments - I think we need to handle archival nodes a bit differently.

Some tests are failing, please have a look at it as well.

In order to trigger nayduck you can checkout this branch and run ./scripts/nayduck.py from nearcore.

chain/client/src/client_actor.rs

chain/client/src/gc_actor.rs

wacban · 2024-04-11T12:09:38Z

pytest/tests/loadtest/contract/Cargo.toml

-[lints]
-workspace = true
-


I'm not familiar with lints, what does this do and why did you delete it?

Doesn't compile otherwise

nightly/pytest-sanity.txt

bowenwang1996 · 2024-04-12T00:26:27Z

@wacban @posvyatokum I have an idea: instead of testing whether it works for archival nodes with split storage, we don't change the logic for archival nodes, i.e, they still move data from hot storage to cold storage as part of the main thread. I think this is okay because right now the main goal of archival node is to store historical data and it does not need to be able to process new blocks very fast. This would save us some trouble in testing archival nodes and we can potentially do the change for archival node as well at a later time, if necessary. What do you think?

wacban · 2024-04-12T07:07:42Z

@wacban @posvyatokum I have an idea: instead of testing whether it works for archival nodes with split storage, we don't change the logic for archival nodes, i.e, they still move data from hot storage to cold storage as part of the main thread. I think this is okay because right now the main goal of archival node is to store historical data and it does not need to be able to process new blocks very fast. This would save us some trouble in testing archival nodes and we can potentially do the change for archival node as well at a later time, if necessary. What do you think?

Personally I don't like it as it would mean we have two entirely different code paths and even threads for garbage collection. The CI is failing so it seems like we have at least some test coverage. I would suggest fixing those first and if coverage is missing for split storage we should add it.

If you want to test it on a real node that shouldn't be too hard as well, it takes about 10min to spin it up. Then I suppose it would be sufficient to add and check some logs.

chain/client/src/client_actor.rs

wacban

LGTM

integration-tests/src/tests/client/process_blocks.rs

codecov · 2024-04-18T04:59:35Z

Codecov Report

Attention: Patch coverage is 86.27451% with 28 lines in your changes are missing coverage. Please review.

Project coverage is 71.08%. Comparing base (2bbde59) to head (72c2ec4).
Report is 12 commits behind head on master.

Files	Patch %	Lines
chain/client/src/gc_actor.rs	75.32%	19 Missing ⚠️
chain/chain/src/garbage_collection.rs	90.74%	0 Missing and 5 partials ⚠️
chain/client/src/client_actions.rs	0.00%	4 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff            @@
##           master   #11022    +/-   ##
========================================
  Coverage   71.07%   71.08%            
========================================
  Files         767      769     +2     
  Lines      153367   153659   +292     
  Branches   153367   153659   +292     
========================================
+ Hits       109005   109227   +222     
- Misses      39909    39992    +83     
+ Partials     4453     4440    -13

Flag	Coverage Δ
backward-compatibility	`0.24% <0.00%> (-0.01%)`	⬇️
db-migration	`0.24% <0.00%> (-0.01%)`	⬇️
genesis-check	`1.43% <0.52%> (-0.01%)`	⬇️
integration-tests	`36.87% <80.39%> (+0.09%)`	⬆️
linux	`69.48% <86.27%> (+<0.01%)`	⬆️
linux-nightly	`70.57% <86.27%> (+0.03%)`	⬆️
macos	`52.57% <66.66%> (-1.69%)`	⬇️
pytests	`1.66% <0.52%> (-0.01%)`	⬇️
sanity-checks	`1.45% <0.52%> (-0.01%)`	⬇️
unittests	`66.74% <66.66%> (-0.02%)`	⬇️
upgradability	`0.29% <0.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

bowenwang1996 · 2024-04-19T00:31:10Z

I am going to merge this PR. For testing we have:

Nayduck run
It has been running on a regular mainnet node for over a week without any issues
It has been running on an archival mainnet node with split storage for 2 days and cold store head increases as expected.

Re-enable store validator tests that were disabled in #11022. It works by first sending a signal to GC actor to stop garbage collection, then run the store validator, and send a signal to re-enable gc afterwards. Nayduck run https://nayduck.nearone.org/#/run/58

bowenwang1996 added 3 commits April 10, 2024 11:29

feat: move garbage collection into its own thread

49c4704

instrumentation

64de1d2

update test

45a1233

bowenwang1996 requested a review from a team as a code owner April 11, 2024 00:08

bowenwang1996 requested a review from wacban April 11, 2024 00:08

bowenwang1996 changed the title ~~Separate gc thread~~ feat: move garbage collection into a thread Apr 11, 2024

bowenwang1996 changed the title ~~feat: move garbage collection into a thread~~ feat: move garbage collection into a separate thread Apr 11, 2024

wacban requested changes Apr 11, 2024

View reviewed changes

fix some tests

9e42635

shreyan-gupta reviewed Apr 12, 2024

View reviewed changes

chain/client/src/client_actor.rs Outdated Show resolved Hide resolved

bowenwang1996 force-pushed the separate-gc-thread branch from e0bd3e0 to 37b9cd8 Compare April 13, 2024 18:18

address comments

ac6adfe

bowenwang1996 force-pushed the separate-gc-thread branch from 37b9cd8 to ac6adfe Compare April 14, 2024 14:26

add test for achival node

837641a

wacban approved these changes Apr 16, 2024

View reviewed changes

integration-tests/src/tests/client/process_blocks.rs Outdated Show resolved Hide resolved

bowenwang1996 added 5 commits April 17, 2024 11:10

fix and bring back archival tests

5a332a6

fix resharding tests

f5f2397

fix archival pytest

28bc493

gc pytests

064a425

Merge branch 'master' into separate-gc-thread

2dee102

akhi3030 mentioned this pull request Apr 18, 2024

[Tracking issue] contract runtime work on alleviating congestion on mainnet #10981

Open

bowenwang1996 added 3 commits April 18, 2024 08:58

disable store checks

6781224

fix tests

2fba635

fix formatting

72c2ec4

bowenwang1996 enabled auto-merge April 19, 2024 00:31

bowenwang1996 added this pull request to the merge queue Apr 19, 2024

Merged via the queue into master with commit 2c80ee0 Apr 19, 2024
28 of 29 checks passed

bowenwang1996 deleted the separate-gc-thread branch April 19, 2024 01:13

bowenwang1996 mentioned this pull request Apr 20, 2024

feat(pytest): reenable store validator #11127

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: move garbage collection into a separate thread #11022

feat: move garbage collection into a separate thread #11022

bowenwang1996 commented Apr 11, 2024

bowenwang1996 commented Apr 11, 2024

posvyatokum commented Apr 11, 2024

wacban left a comment •

edited

Loading

wacban Apr 11, 2024

bowenwang1996 Apr 13, 2024

bowenwang1996 commented Apr 12, 2024

wacban commented Apr 12, 2024

wacban left a comment

codecov bot commented Apr 18, 2024 •

edited

Loading

bowenwang1996 commented Apr 19, 2024

		[lints]
		workspace = true

feat: move garbage collection into a separate thread #11022

feat: move garbage collection into a separate thread #11022

Conversation

bowenwang1996 commented Apr 11, 2024

bowenwang1996 commented Apr 11, 2024

posvyatokum commented Apr 11, 2024

wacban left a comment • edited Loading

Choose a reason for hiding this comment

wacban Apr 11, 2024

Choose a reason for hiding this comment

bowenwang1996 Apr 13, 2024

Choose a reason for hiding this comment

bowenwang1996 commented Apr 12, 2024

wacban commented Apr 12, 2024

wacban left a comment

Choose a reason for hiding this comment

codecov bot commented Apr 18, 2024 • edited Loading

Codecov Report

bowenwang1996 commented Apr 19, 2024

wacban left a comment •

edited

Loading

codecov bot commented Apr 18, 2024 •

edited

Loading