keep retrying the proof until we run out of sectors to skip #4633

Stebalien · 2020-10-29T01:15:24Z

If we have a bunch of corrupted but not missing sectors on disk, we may need to
retry many times before we get a proof to pass. Simply giving up doesn't help anyone.

magik6k

1 Comment

storage/wdpost_run.go

If we have a bunch of corrupted but not missing sectors on disk, we may need to retry many times before we get a proof to pass. Simply giving up doesn't help anyone.

Stebalien · 2020-10-30T21:05:47Z

storage/wdpost_run.go

+			if ctx.Err() != nil {
+				log.Warnw("aborting PoSt due to context cancellation", "error", ctx.Err(), "deadline", di.Index)
+				return nil, ctx.Err()
+			}


This will explicitly check the context. We should cancel in

lotus/storage/wdpost_changehandler.go

Lines 412 to 426 in 077bc83

// Replace the aborted postWindow with a new one so that we can

// submit again at any time without the state getting clobbered

// when the abort completes

abort := pw.abort

if abort != nil {

pw = &postWindow{

di: pw.di,

ts: advance,

submitState: SubmitStateStart,

}

s.postWindows[pw.di.Open] = pw

// Abort the current submit

abort()

}

.

I could also stop retrying once we get, e.g., 2/3rds of the way through the proof time but I'm not sure if that really makes sense. I guess sectors assigned to a single partition are somewhat correlated in time so their failure may be correlated? But I don't wan to:

Spend a lot of time trying to prove one partition.

Give up on that partition because we're running out of time.

Spend a little time trying to prove all other partitions in the deadline and fail because we have a lot of faulty sectors.

When we could have eventually submitted a valid proof for the first partition, if we had simply stuck with it.

Stebalien · 2020-10-30T21:07:20Z

Note: I agree this isn't the optimal solution, it's just strictly better than what we're doing now. Once we merge this, I plan on:

performing a more thorough check before recovering.
running the recovery check on a partition batch if we fail PoSt 2 times in a row.

Stebalien requested review from Kubuxu, magik6k and whyrusleeping as code owners October 29, 2020 01:15

Stebalien mentioned this pull request Oct 29, 2020

Additional checks required when recovering sectors in checkNextRecoveries #4634

Open

whyrusleeping approved these changes Oct 30, 2020

View reviewed changes

magik6k reviewed Oct 30, 2020

View reviewed changes

storage/wdpost_run.go Show resolved Hide resolved

Stebalien added 2 commits October 30, 2020 13:21

keep retrying the proof until we run out of sectors to skip

6985af5

If we have a bunch of corrupted but not missing sectors on disk, we may need to retry many times before we get a proof to pass. Simply giving up doesn't help anyone.

explicitly abort PoSt on context cancellation

077bc83

Stebalien force-pushed the steb/rick-roll branch from 6bf6c3b to 077bc83 Compare October 30, 2020 21:02

Stebalien commented Oct 30, 2020

View reviewed changes

magik6k approved these changes Oct 30, 2020

View reviewed changes

magik6k merged commit 8eae921 into master Oct 30, 2020

magik6k deleted the steb/rick-roll branch October 30, 2020 23:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

keep retrying the proof until we run out of sectors to skip #4633

keep retrying the proof until we run out of sectors to skip #4633

Stebalien commented Oct 29, 2020

magik6k left a comment

Stebalien Oct 30, 2020

Stebalien Oct 30, 2020

Stebalien commented Oct 30, 2020

	// Replace the aborted postWindow with a new one so that we can
	// submit again at any time without the state getting clobbered
	// when the abort completes
	abort := pw.abort
	if abort != nil {
	pw = &postWindow{
	di: pw.di,
	ts: advance,
	submitState: SubmitStateStart,
	}
	s.postWindows[pw.di.Open] = pw

	// Abort the current submit
	abort()
	}

keep retrying the proof until we run out of sectors to skip #4633

keep retrying the proof until we run out of sectors to skip #4633

Conversation

Stebalien commented Oct 29, 2020

magik6k left a comment

Choose a reason for hiding this comment

Stebalien Oct 30, 2020

Choose a reason for hiding this comment

Stebalien Oct 30, 2020

Choose a reason for hiding this comment

Stebalien commented Oct 30, 2020