Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keep retrying the proof until we run out of sectors to skip #4633

Merged
merged 2 commits into from
Oct 30, 2020

Conversation

Stebalien
Copy link
Member

If we have a bunch of corrupted but not missing sectors on disk, we may need to
retry many times before we get a proof to pass. Simply giving up doesn't help anyone.

Copy link
Contributor

@magik6k magik6k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 Comment

storage/wdpost_run.go Show resolved Hide resolved
If we have a bunch of corrupted but not missing sectors on disk, we may need to
retry many times before we get a proof to pass. Simply giving up doesn't help anyone.
if ctx.Err() != nil {
log.Warnw("aborting PoSt due to context cancellation", "error", ctx.Err(), "deadline", di.Index)
return nil, ctx.Err()
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will explicitly check the context. We should cancel in

// Replace the aborted postWindow with a new one so that we can
// submit again at any time without the state getting clobbered
// when the abort completes
abort := pw.abort
if abort != nil {
pw = &postWindow{
di: pw.di,
ts: advance,
submitState: SubmitStateStart,
}
s.postWindows[pw.di.Open] = pw
// Abort the current submit
abort()
}
.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could also stop retrying once we get, e.g., 2/3rds of the way through the proof time but I'm not sure if that really makes sense. I guess sectors assigned to a single partition are somewhat correlated in time so their failure may be correlated? But I don't wan to:

  1. Spend a lot of time trying to prove one partition.
  2. Give up on that partition because we're running out of time.
  3. Spend a little time trying to prove all other partitions in the deadline and fail because we have a lot of faulty sectors.

When we could have eventually submitted a valid proof for the first partition, if we had simply stuck with it.

@Stebalien
Copy link
Member Author

Note: I agree this isn't the optimal solution, it's just strictly better than what we're doing now. Once we merge this, I plan on:

  • performing a more thorough check before recovering.
  • running the recovery check on a partition batch if we fail PoSt 2 times in a row.

@magik6k magik6k merged commit 8eae921 into master Oct 30, 2020
@magik6k magik6k deleted the steb/rick-roll branch October 30, 2020 23:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants