-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement DAGStore#GC. #52
Conversation
// | ||
// However, the event loop checks for safety prior to deletion, so it will skip | ||
// over shards that are no longer safe to delete. | ||
func (d *DAGStore) GC(ctx context.Context) (map[shard.Key]error, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raulk Please can we launch a go-routine to do this periodically (configurable period) in the DAG Store ? Otherwise, we'll have to place this burden on all clients.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced periodic GC is the best approach, although I filed it here: #56. For now, we probably want to keep it manual, but in the future we will want to monitor disk usage of the transients dir and perform GC when it reaches a watermark, not periodically.
for _, s := range d.shards { | ||
s.lk.RLock() | ||
if s.state == ShardStateAvailable || s.state == ShardStateErrored { | ||
reclaim = append(reclaim, s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raulk Since we have the Shard lock here, why not delete the transient here ?
Deleting a transient needs the Upgrader lock and it dosen't mutate any Shard state, Why should the deletion of the transient happen in the event loop if it dosen't lead us to updating the Shard state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because there could be acquire operations queued at this point that still haven't been reflected on the state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: deleting a transient doesn't mutate the shard state directly in the sense that it doesn't mutate the direct fields of the Shard
object, but it does mutate the state in a more abstract/conceptual manner, so it's cleaner and easier to reason about if handled inside the event loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I agree with the conceptual clarity point, please note that there is still a bug here:
- We spin a go-routine to acquire a shard. It fetches the transient .
- A GC runs -> enters the event loop -> deletes the transient as shard state is still available and not serving.
- The initial acquire go-routine completes and we return the broken accessor to the client.
The fix here is to optimistically mark Shard State as Serving
before we spin up a go-routine to acquire a shard since we also optimistically increment the shard access refcount there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #59.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a tiny window, close to impossible, but it's still there:
- Async fetch finished and updated the transient path in the Upgrader.
- Async fetch sends
OpShardMakeAvailable
to the event loop. - In parallel, GC marked this shard for reclaim.
- The
OpShardGC
arrives to the event loop before theOpShardMakeAvailable
. Thus the shard hasn't been moved toOpShardServing
yet, and its transient gets deleted.
The way to solve this is for OpShardGC
to refuse to delete the transient if there are pending acquirers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are okay acquiring the refcount optimistically, why not mark the shard state as serving optimistically ? The shard state will be changed to Available later when the refcount drops to zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually do set the shard state to "Serving" optimistically, if it's in "Available" state. If this happens while initializing, and an acquirer comes midway, we park it but we don't update the state to "Serving", because the true main state is "Initializing".
Implements the
GC()
method outlined in #26.Closes #26.