-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Splitstore: Online Garbage Collection for the Coldstore #6577
Comments
Problem: manually copying blocks from one badger datastore to another is slow (need to re-build indexes, etc). However, to do this "generically", we'll likely need to extend the blockstore interface (or implement optional extensions). Solution 1: Remove everything not matching the given filter function: type BlockstoreFilter interface {
Filter(ctx context.Context, cb func(context.Context, func(multihash.Multihash, []byte)) (bool, error))
} In practice, the underlying blockstore would either:
For badger-backed blockstores, we'd likely do the latter. If we want to get fancy, we could do a bit of random sampling and pick copy/in-place depending on the amount of data we expect to delete. Solution 2: More generally, we could implement a type BlockstoreCopy interface {
CopyTo(ctx context.Context, target Blockstore, cb func(context.Context, func(multihash.Multihash, []byte)) (bool, error))
} This is significantly more general purpose (would work for estuary as well) but:
These solutions aren't exclusive so we could just implement the solution best for this case (likely solution 1) and implement |
Has any consideration been given to time bounds of the GC? What is the target timeframe for a GC to complete? A moving collector is essentially unbounded in time if the collection rate is lower than the rate of growth of the chain and state tree. This may not be a problem at present but exponential growth in deals and capacity is going to rapidly increase the amount of garbage. An alternate approach is to keep two cold stores, green and blue. Initially all compaction writes go to the green store. Reads go to both. When its size exceeds a threshold all writes are switched to the blue store. Records are kept of which blocks are dead in the green store as they become unreachable due to new blocks being written to blue. When the number of live blocks in the green store falls below a fixed threshold they are all copied to the blue store, the green store is replaced with an empty one and the roles are switched. The fixed threshold at which the copy takes place gives a time bound to the operation. |
Has any consideration been given to time bounds of the GC? What is the target timeframe for a GC to complete? A moving collector is essentially unbounded in time if the collection rate is lower than the rate of growth of the chain and state tree. This may not be a problem at present but exponential growth in deals and capacity is going to rapidly increase the amount of garbage.
The state is growing, but the _churn_ (the state that would need to be
archived every day) isn't a pretty consistent 21GiB/day.
An alternate approach is to keep two cold stores, green and blue. Initially all compaction writes go to the green store. Reads go to both. When its size exceeds a threshold all writes are switched to the blue store. Records are kept of which blocks are dead in the green store as they become unreachable due to new blocks being written to blue. When the number of live blocks in the green store falls below a fixed threshold they are all copied to the blue store, the green store is replaced with an empty one and the roles are switched. The fixed threshold at which the copy takes place gives a time bound to the operation.
I have some concerns, but this is probably worth exploring more. Concerns:
1. In the short-term, we can set a threshold of something like 20% live and
be fine. But over time, "perminent state" will grow and "20%" could be
hundreds of gigabytes (meaning we wouldn't GC until we hit (statesize/0.2)).
2. This could increase latency/memory usage. Ideally, the coldstore is
pretty much _never_ accessed but this would effectively double everything.
|
My intention was that the threshold would be a fixed amount of work not a fraction of total. |
Unfortunately, that would end up with lots of copying once the base state-tree starts approaching the threshold. |
I don't follow. It's a fixed amount of work (target number of bytes as a function of io capacity) so would be set to a level that is an acceptable level of copying. |
At some point, we'll stop hitting that threshold because the state-tree will grow to 100s of GiB. |
I.e., the "green" store will have all the sector infos, and those sector infos will remain live for 6-18 months. |
Implementation in #6728 |
Another way would be to force badger to completely rewrite the value log by setting a very low value for |
Need for Space
Once the splitstore has been deployed (see also #6474) we have the ability to perform online garbage collection for the coldstore, as we control writes.
Specifically, we only write to the coldstore during compaction, when we move newly cold objects, protected by the compaction lock.
That means we can perform gc on the coldstore without disrupting regular node operations or requiring downtime.
Garbage collecting the coldstore is an essential operation for keeping space usage bounded in non-archival nodes -- see also #4701.
Design Considerations
Garbage collection must effectively reclaim space; hence we can't use native badger gc which is horrible at reclaiming space and requires hacks in order to convince the thing to reclaim as much space as possible.
Furthermore, even if we do manage to reclaim all space possible, the gc'ed blockstore has the tendency to quickly balloon back up in size.
Instead, we propose a moving garbage collector for the coldstore, which also allows us to tune the gc process to the user's needs.
Fundamentally, the gc operation will instantiate a new (empty) coldstore, walk the chain for live objects in the coldstore according to user retention policies, and then move live objects to the new coldstore.
Once the move is complete, the new coldstore becomes the actual coldstore and the old coldstore is deleted.
Retention Policies:
At a minimum, we must ensure that we retain chain headers (all the way to genesis), as it is not currently safe to discard them due to unlimited randomness lookback (this may change in the future, but we still want to keep them in order to be able to navigate the chain).
Apart from that it is up to the user:
So the garbage collection interface must allow the user to specify preferences/policies that match their own demand.
A possible sane default:
Additional considerations:
.lotus/datastore/chain
path.In order to address these issues, we propose that the user supplies the new coldstore path at the time of move, which is then symlinked by the system itself into
~/.lotus/datastore/chain
.Interface
We propose to introduce a new (v1) API which can be invoked to trigger gc on demand, perhaps through a cron job.
The API handler will try to cast the blockstore to the splitstore, and if successful invoke the relevant interface with the options supplied by the user.
The cli frontend can be either a lotus command or a lotus-shed command; it doesn't really matter.
We might want to use a lotus-shed command while the splitstore remains experimental and later migrate to the lotus binary when it becomes the default.
What About the Hotstore?
The hotstore is gc'ed online with badger's gc after every compaction.
This doesn't reclaim all the space that it can, but over time it does reclaim enough space to not balloon out of control.
If uncontrolled growth of the hotstore is observed, we can add a lotus-shed command to implement moving gc for the hotstore itself.
This will require node downtime however.
The text was updated successfully, but these errors were encountered: