-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
keep retrying the proof until we run out of sectors to skip #4633
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 Comment
If we have a bunch of corrupted but not missing sectors on disk, we may need to retry many times before we get a proof to pass. Simply giving up doesn't help anyone.
6bf6c3b
to
077bc83
Compare
if ctx.Err() != nil { | ||
log.Warnw("aborting PoSt due to context cancellation", "error", ctx.Err(), "deadline", di.Index) | ||
return nil, ctx.Err() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will explicitly check the context. We should cancel in
lotus/storage/wdpost_changehandler.go
Lines 412 to 426 in 077bc83
// Replace the aborted postWindow with a new one so that we can | |
// submit again at any time without the state getting clobbered | |
// when the abort completes | |
abort := pw.abort | |
if abort != nil { | |
pw = &postWindow{ | |
di: pw.di, | |
ts: advance, | |
submitState: SubmitStateStart, | |
} | |
s.postWindows[pw.di.Open] = pw | |
// Abort the current submit | |
abort() | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could also stop retrying once we get, e.g., 2/3rds of the way through the proof time but I'm not sure if that really makes sense. I guess sectors assigned to a single partition are somewhat correlated in time so their failure may be correlated? But I don't wan to:
- Spend a lot of time trying to prove one partition.
- Give up on that partition because we're running out of time.
- Spend a little time trying to prove all other partitions in the deadline and fail because we have a lot of faulty sectors.
When we could have eventually submitted a valid proof for the first partition, if we had simply stuck with it.
Note: I agree this isn't the optimal solution, it's just strictly better than what we're doing now. Once we merge this, I plan on:
|
If we have a bunch of corrupted but not missing sectors on disk, we may need to
retry many times before we get a proof to pass. Simply giving up doesn't help anyone.