Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement DAGStore#GC. #52

Merged
merged 2 commits into from
Jul 12, 2021
Merged

implement DAGStore#GC. #52

merged 2 commits into from
Jul 12, 2021

Conversation

raulk
Copy link
Member

@raulk raulk commented Jul 10, 2021

Implements the GC() method outlined in #26.
Closes #26.

//
// However, the event loop checks for safety prior to deletion, so it will skip
// over shards that are no longer safe to delete.
func (d *DAGStore) GC(ctx context.Context) (map[shard.Key]error, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raulk Please can we launch a go-routine to do this periodically (configurable period) in the DAG Store ? Otherwise, we'll have to place this burden on all clients.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced periodic GC is the best approach, although I filed it here: #56. For now, we probably want to keep it manual, but in the future we will want to monitor disk usage of the transients dir and perform GC when it reaches a watermark, not periodically.

for _, s := range d.shards {
s.lk.RLock()
if s.state == ShardStateAvailable || s.state == ShardStateErrored {
reclaim = append(reclaim, s)
Copy link
Contributor

@aarshkshah1992 aarshkshah1992 Jul 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raulk Since we have the Shard lock here, why not delete the transient here ?

Deleting a transient needs the Upgrader lock and it dosen't mutate any Shard state, Why should the deletion of the transient happen in the event loop if it dosen't lead us to updating the Shard state?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because there could be acquire operations queued at this point that still haven't been reflected on the state.

Copy link
Member Author

@raulk raulk Jul 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: deleting a transient doesn't mutate the shard state directly in the sense that it doesn't mutate the direct fields of the Shard object, but it does mutate the state in a more abstract/conceptual manner, so it's cleaner and easier to reason about if handled inside the event loop.

Copy link
Contributor

@aarshkshah1992 aarshkshah1992 Jul 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raulk

While I agree with the conceptual clarity point, please note that there is still a bug here:

  • We spin a go-routine to acquire a shard. It fetches the transient .
  • A GC runs -> enters the event loop -> deletes the transient as shard state is still available and not serving.
  • The initial acquire go-routine completes and we return the broken accessor to the client.

The fix here is to optimistically mark Shard State as Serving before we spin up a go-routine to acquire a shard since we also optimistically increment the shard access refcount there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #59.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a tiny window, close to impossible, but it's still there:

  1. Async fetch finished and updated the transient path in the Upgrader.
  2. Async fetch sends OpShardMakeAvailable to the event loop.
  3. In parallel, GC marked this shard for reclaim.
  4. The OpShardGC arrives to the event loop before the OpShardMakeAvailable. Thus the shard hasn't been moved to OpShardServing yet, and its transient gets deleted.

The way to solve this is for OpShardGC to refuse to delete the transient if there are pending acquirers.

Copy link
Contributor

@aarshkshah1992 aarshkshah1992 Jul 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are okay acquiring the refcount optimistically, why not mark the shard state as serving optimistically ? The shard state will be changed to Available later when the refcount drops to zero.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually do set the shard state to "Serving" optimistically, if it's in "Available" state. If this happens while initializing, and an acquirer comes midway, we park it but we don't update the state to "Serving", because the true main state is "Initializing".

@raulk raulk merged commit 939c620 into master Jul 12, 2021
@raulk raulk deleted the raulk/implement-gc branch July 12, 2021 10:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

implement transient file tracking and cleanup
2 participants