Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike: Pause compaction when out of sync #10392

Closed
wants to merge 5 commits into from

Conversation

ZenGround0
Copy link
Contributor

Related Issues

I ended up playing with some of the chain sync contention protection ideas from #10388

Proposed Changes

Use head change out of sync detection to signal compaction to pause. Signal continue when chain is back in sync

Additional Info

@vyzo if you can take a look at this for feasibility it would be a big help.

Checklist

Before you mark the PR ready for review, please make sure that:

  • Commits have a clear commit message.
  • PR title is in the form of of <PR type>: <area>: <change being made>
    • example: fix: mempool: Introduce a cache for valid signatures
    • PR type: fix, feat, build, chore, ci, docs, perf, refactor, revert, style, test
    • area, e.g. api, chain, state, market, mempool, multisig, networking, paych, proving, sealing, wallet, deps
  • New features have usage guidelines and / or documentation updates in
  • Tests exist for new functionality or change in behavior
  • CI is green

Copy link
Contributor

@vyzo vyzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not unreasonable i think, but we should test it in mainnet node.

Copy link
Contributor

@magik6k magik6k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't see anything massively wrong here, definitely would be good to test on a mainnet node.

@@ -396,6 +396,9 @@ func (b *Blockstore) doCopy(from, to *badger.DB) error {
if workers < 2 {
workers = 2
}
if workers > 8 {
workers = 8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have confidence in this number? Did we check that it's "good enough" on mainnet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to check on this. There's some evidence that it might be too high right now resulting in chain getting out of sync. Alternate strategy we can use is leave as is and see if chain sync issues persist correlated to moving GC.

@vyzo for reference do you have an estimate of a reasonable time for moving GC to complete today?

cc @TippyFlitsUK you might have an even better idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe 5-10 min.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets take a fresh measurement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference I got 50m on my fairly resourced mainnet node

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ with current default of half CPUs
I'll measure with 8 and compare when I get to similar garbage levels

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

working without major issues on mainnet (cc @TippyFlitsUK ) and this seems to be preventing out of sync on moving GC

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last thing to note: I'm suspicious that reducing goroutines here is the main thing causing improvement since moving GC is anecdotally the thing which causes chain sync to get left behind and we are not doing any yielding during move (yet) in this change. If we are concerned about risk of compaction deadlock we could just include the simple go routine limiting of moving GC.

blockstore/badger/blockstore.go Outdated Show resolved Hide resolved
// already out of sync, no signaling necessary

}
// TODO: ok to use hysteresis with no transitions between 30s and 1m?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we're syncing, the head will only be taken when we get fully in sync, so this is probably fine

(well, really the sync new head is only selected when we complete the previous sync, but the effect is similar)

@ZenGround0 ZenGround0 force-pushed the feat/splitstore-gc-wait-for-sync branch from d9acd45 to 36d274a Compare March 30, 2023 18:52
@ZenGround0 ZenGround0 marked this pull request as ready for review March 30, 2023 18:53
@ZenGround0 ZenGround0 requested a review from a team as a code owner March 30, 2023 18:53
@ZenGround0 ZenGround0 requested review from vyzo and magik6k March 30, 2023 18:53
@ZenGround0 ZenGround0 force-pushed the feat/splitstore-gc-wait-for-sync branch from ad0a54b to 7a4082c Compare March 30, 2023 20:56
@magik6k
Copy link
Contributor

magik6k commented Mar 31, 2023

Tests are mad

2023-03-30T18:59:10.196Z	INFO	chainstore	store/store.go:657	New heaviest tipset! [bafy2bzacebdx6sp2za7e6l3quy54ix5asgydurixiesjjehj6mcxclhnoeyzy bafy2bzacectqgvabrqfhsarubiqcfk3ykwmuvumwbm5vbc27ft65ia4gsrtoq] (height=50)
fatal error: sync: Unlock of unlocked RWMutex
�[31m///////////////////////////////////////////////////�[39b

goroutine 710 [running]:
sync.fatal({0x4a5ee70?, 0x57b1880?})
	/usr/local/go/src/runtime/panic.go:1031 +0x1e
sync.(*RWMutex).Unlock(0xc00e4c60fc)
	/usr/local/go/src/sync/rwmutex.go:209 +0x4a
sync.(*Cond).Wait(0xc00049a610?)
	/usr/local/go/src/sync/cond.go:69 +0x7e
github.com/filecoin-project/lotus/blockstore/splitstore.(*SplitStore).waitForSync(0xc00e4c6000)
	/home/circleci/lotus/blockstore/splitstore/splitstore_compact.go:924 +0xab
github.com/filecoin-project/lotus/blockstore/splitstore.(*SplitStore).checkYield(0xc00e4c6000)
	/home/circleci/lotus/blockstore/splitstore/splitstore_compact.go:930 +0x1e
github.com/filecoin-project/lotus/blockstore/splitstore.(*SplitStore).walkObjectIncomplete(0xc00bd1c0b0?, {{0xc00c7d56e0?, 0x0?}}, {0x7ff5be2d4388, 0xc00c7c0140}, 0xc00e304180, 0x543d498)
	/home/circleci/lotus/blockstore/splitstore/splitstore_compact.go:1225 +0x206
github.com/filecoin-project/lotus/blockstore/splitstore.(*SplitStore).walkChain(0xc00e4c6000, 0xc00e658300, 0x0, 0x33, {0x7ff5be2d4388?, 0xc00c7c0140}, 0xc00e304180, 0x543d488)
	/home/circleci/lotus/blockstore/splitstore/splitstore_compact.go:1079 +0x39b
github.com/filecoin-project/lotus/blockstore/splitstore.(*SplitStore).doWarmup(0xc00e4c6000, 0xc00e658300)
	/home/circleci/lotus/blockstore/splitstore/splitstore_warmup.go:71 +0x34e
github.com/filecoin-project/lotus/blockstore/splitstore.(*SplitStore).warmup.func1()
	/home/circleci/lotus/blockstore/splitstore/splitstore_warmup.go:38 +0x106
created by github.com/filecoin-project/lotus/blockstore/splitstore.(*SplitStore).warmup
	/home/circleci/lotus/blockstore/splitstore/splitstore_warmup.go:32 +0x98


Conditions always call "unlock" so we can't safely use the condition
with both the read and write side of lock. So we might as well revert
back to a regular lock.

fixes #10616
@Stebalien
Copy link
Member

@arajasek I don't think you meant to close this, right?

@rjan90 rjan90 mentioned this pull request Apr 10, 2023
7 tasks
ZenGround0 added a commit that referenced this pull request Apr 10, 2023
@ZenGround0
Copy link
Contributor Author

Closed by #10641 backported and merged back to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants