implement DAGStore#GC. #52

raulk · 2021-07-10T23:22:45Z

Implements the GC() method outlined in #26.
Closes #26.

aarshkshah1992 · 2021-07-12T10:28:11Z

dagstore.go

+//
+// However, the event loop checks for safety prior to deletion, so it will skip
+// over shards that are no longer safe to delete.
+func (d *DAGStore) GC(ctx context.Context) (map[shard.Key]error, error) {


@raulk Please can we launch a go-routine to do this periodically (configurable period) in the DAG Store ? Otherwise, we'll have to place this burden on all clients.

I'm not convinced periodic GC is the best approach, although I filed it here: #56. For now, we probably want to keep it manual, but in the future we will want to monitor disk usage of the transients dir and perform GC when it reaches a watermark, not periodically.

aarshkshah1992 · 2021-07-12T10:35:42Z

dagstore.go

+	for _, s := range d.shards {
+		s.lk.RLock()
+		if s.state == ShardStateAvailable || s.state == ShardStateErrored {
+			reclaim = append(reclaim, s)


@raulk Since we have the Shard lock here, why not delete the transient here ?

Deleting a transient needs the Upgrader lock and it dosen't mutate any Shard state, Why should the deletion of the transient happen in the event loop if it dosen't lead us to updating the Shard state?

Because there could be acquire operations queued at this point that still haven't been reflected on the state.

Note: deleting a transient doesn't mutate the shard state directly in the sense that it doesn't mutate the direct fields of the Shard object, but it does mutate the state in a more abstract/conceptual manner, so it's cleaner and easier to reason about if handled inside the event loop.

@raulk

While I agree with the conceptual clarity point, please note that there is still a bug here:

We spin a go-routine to acquire a shard. It fetches the transient .

A GC runs -> enters the event loop -> deletes the transient as shard state is still available and not serving.

The initial acquire go-routine completes and we return the broken accessor to the client.

The fix here is to optimistically mark Shard State as Serving before we spin up a go-routine to acquire a shard since we also optimistically increment the shard access refcount there.

There's a tiny window, close to impossible, but it's still there:

Async fetch finished and updated the transient path in the Upgrader.

Async fetch sends OpShardMakeAvailable to the event loop.

In parallel, GC marked this shard for reclaim.

The OpShardGC arrives to the event loop before the OpShardMakeAvailable. Thus the shard hasn't been moved to OpShardServing yet, and its transient gets deleted.

The way to solve this is for OpShardGC to refuse to delete the transient if there are pending acquirers.

If we are okay acquiring the refcount optimistically, why not mark the shard state as serving optimistically ? The shard state will be changed to Available later when the refcount drops to zero.

We actually do set the shard state to "Serving" optimistically, if it's in "Available" state. If this happens while initializing, and an acquirer comes midway, we park it but we don't update the state to "Serving", because the true main state is "Initializing".

implement DAGStore#GC.

5c8ef00

raulk requested a review from aarshkshah1992 July 10, 2021 23:22

nillify wRegister on failure.

9d99412

raulk mentioned this pull request Jul 11, 2021

implement transient file tracking and cleanup #26

Closed

4 tasks

aarshkshah1992 reviewed Jul 12, 2021

View reviewed changes

raulk merged commit 939c620 into master Jul 12, 2021

raulk deleted the raulk/implement-gc branch July 12, 2021 10:38

aarshkshah1992 mentioned this pull request Jul 12, 2021

GC races with Acquire Shard #59

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement DAGStore#GC. #52

implement DAGStore#GC. #52

raulk commented Jul 10, 2021

aarshkshah1992 Jul 12, 2021

raulk Jul 12, 2021

aarshkshah1992 Jul 12, 2021 •

edited

Loading

raulk Jul 12, 2021

raulk Jul 12, 2021 •

edited

Loading

aarshkshah1992 Jul 12, 2021 •

edited

Loading

aarshkshah1992 Jul 12, 2021

raulk Jul 12, 2021

aarshkshah1992 Jul 12, 2021 •

edited

Loading

raulk Jul 12, 2021

implement DAGStore#GC. #52

implement DAGStore#GC. #52

Conversation

raulk commented Jul 10, 2021

aarshkshah1992 Jul 12, 2021

Choose a reason for hiding this comment

raulk Jul 12, 2021

Choose a reason for hiding this comment

aarshkshah1992 Jul 12, 2021 • edited Loading

Choose a reason for hiding this comment

raulk Jul 12, 2021

Choose a reason for hiding this comment

raulk Jul 12, 2021 • edited Loading

Choose a reason for hiding this comment

aarshkshah1992 Jul 12, 2021 • edited Loading

Choose a reason for hiding this comment

aarshkshah1992 Jul 12, 2021

Choose a reason for hiding this comment

raulk Jul 12, 2021

Choose a reason for hiding this comment

aarshkshah1992 Jul 12, 2021 • edited Loading

Choose a reason for hiding this comment

raulk Jul 12, 2021

Choose a reason for hiding this comment

aarshkshah1992 Jul 12, 2021 •

edited

Loading

raulk Jul 12, 2021 •

edited

Loading

aarshkshah1992 Jul 12, 2021 •

edited

Loading

aarshkshah1992 Jul 12, 2021 •

edited

Loading