-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve error handling for threaded verification errors #1748
Conversation
Looks like all the CI failures are 'Infrastructure failures' and I noticed I can no longer restart failed jobs. Everything passes locally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't comm_d_proof_inner.validate
and the comm_r last asserts above fail in a similar way if the trees are corrupted?
Yes, there are other cases. One thing (for testing) at a time. Once wired through FFI and I'm sure it's propagated properly, adding in the others are straight-forward. |
Good news is that everything worked wired through locally. I'll look into the other cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current code reports the most recent error only. A possible change would be to have a vector of errors and pushing them into a vector. Once done print the contents of the whole vector if it's non-empty.
That's incorrect. Look closer. |
Oh right. It's on the first error it encounters. That's probably good enough. |
I re-read the code again. Now I think it collects all errors. But wow, there must be a way to make all this easier to follow. |
Updated the error handling on all 4 of the threaded verifiers added for synth porep proofs. Wasted some cycles looking into it and of the 4 verifiers that were made threaded for consistency, only 1 (or 2) had the most perf impact. I recall this now, but again, the parallelism was added across the board for consistency, and now we consistently propagate errors across them all. There's also still a FIXME in place (on purpose) and some additional testing is needed (via my local |
fix: remove invalid assertion (resolves #1749)
@@ -140,8 +152,7 @@ impl<'a, Tree: 'static + MerkleTreeTrait, G: 'static + Hasher> StackedDrg<'a, Tr | |||
Challenges::Synth(synth_challenges) => { | |||
// If there are no synthetic vanilla proofs stored on disk yet, generate them. | |||
if pub_inputs.seed.is_none() { | |||
info!("generating synthetic vanilla proofs in a single partition"); | |||
assert_eq!(partition_count, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! I'm not seeing why this one assert
was removed, but I won't let it delay approval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nvm, I'm looking at #1750 now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! I'm not seeing why this one
assert
was removed, but I won't let it delay approval.
Adding it was an error. The partition count is only 1 for test sectors in porep, and the logging comment implied the incorrect thing (i.e that the single pass of synth porep proof generation implies that the partition count must be 1, which is wrong; it just meant we were not generating synth porep proofs with regard to challenges being partition specific like in other places).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^Fortunately, it was never released with that assert. It came up quickly in overall testing while looking into this issue though.
Instead of using asserts, return proper errors. As this is non-trivial with yastl, rely solely on rayon for parallelism. The performance characteristics are the same (it's maybe slightly, negligible faster). Replaces #1748.
Asserts in a yastl thread pool can lead to things hanging, instead of proper panicking. Spawning the validation was a performance optimization, but it turns out it's not needed. Hence we can just remove the thread pool from there. This has also the advantage that we don't have two different threads pools (from Rayon and yastl) fighting for the same resources. Also the number of threads for the Rayon thread pool can be bound with an environment variable (the yastl one not), so there's more control for running the operations. There was another instance of using the yastl thread pool for verifying the data before writing the Synthetic PoReps to disk. It again has the problem with the assert. With using Rayon instead, we again have the advantage of a more controllable thread pool and it's also slightly faster. Replaces #1748.
#1748 fixed a problem where errors (asserts) where not properly propagated upwards. There were threads just hanging without making any progress. I usually follow the development paradigm of "Make it work, make it right, make it fast" [1], where I interpret the "make it right" as "make it simple". With "simple" I mean things like clear code, easy to follow, no new paradigms, following the style of the existing code base. I'd like to be more concrete about that for this change, on why I find this version "simpler". The core issue was an assert within a spawned thread in a yastl pool. Someone new to the code base coould re-introduce such a problem easily, hence it would be best if we can actually prevent that. With removing the yastl thread pool, where it's not really needed for better performance, we can easily prevent that. We see three difference uses of the yastl thread pool within the `proof.rs`. One is using it to pipeline operations, another one is for a highly parallizing an operation. Usually for data parallism we use Rayon in this code base. In the pipelining use case, there is Rayon used within pipeline. The yastl thread pool will spawn only a limited set of threads, so this looks alright. For the highly parallelised case, we only use yastl and not Rayon (although we should look into just Rayon instead). The third use, where we mix yastl and Rayon for highly parallel operations is removed with this PR. This is intentional. Having two thread pools, which by default use as many threads as cores could easily overprovision a system. This could potentially lead to unintended slow downs. Another benefit of just using Rayon is, that the number of threads can be controlled with an environment variable. This gives more control when several instances of this code is run in parallel, which is the case for some storage providers. Switching back from errors to asserts. I don't know the reason, why the asserts were changed to errors. Hence I'm switching it back to asserts, as now it's easily possible. If one looks at the diff between the version prior to #1728 and this, then the diff is pretty minimal an straight forward. Also one verification again runs only in debug mode and not also in release mode. Though if errors are desired, they can be easily be introduce by switching the asserts to anyhow's `ensure!()` macro. Following the error handling needs less context. With this change, the asserts are happening in the code right away, it works the way people coding Rust are used to. Prior to this change, more context is needed. Taking the "invalid comm_d"-error as an example. It happens on line 456 [2]. Now reading through the code what it means: first look for the `invalid_comm_d` variable. It defines an instance of the `InvalidChallengeCoordinate` struct, which we take a quick look at and see that's a local one, for specifcally error handling. That instance is wrapped in an Arc which is interesting at a first glance, as in Rust shared state is usually tried to be avoided where possible. When looking into the usage of `invalid_comm_d` it becomes clear that we need the `Arc`, as the might assign a different value to it in the error case. We need a mutable reference here, but again usually in Rust immutable types are preferred. So if we can avoid the `Arc` as well some mutability, it's a win for being more Rust idiomatic, hence easier for people familiar with Rust. For me that falls under the "code is written once, but read many times" category. So making it easy to read is a win. In the lower part of the change, the yastl usage for high data parallelism is removed in favour of Rayon, see above for some of the reasons. Also using Rayon here seems to use the thread pool more efficiently, at least on the machine I've tested it on (with 64 threads). When looking at the diff between this change and the commit prior to #1748 git diff 8f5bd86.. -w -- storage-proofs-porep/src/stacked/vanilla/proof.rs Then the changes are very minimal, which I also count as a sign for being "simpler". [1]: https://wiki.c2.com/?MakeItWorkMakeItRightMakeItFast [2]: https://github.com/filecoin-project/rust-fil-proofs/blob/3f018b51b6327b135830899d237a7ba181942d7e/storage-proofs-porep/src/stacked/vanilla/proof.rs#L456-L457
For synth porep, we switch to threaded/parallel proof verification as an optimization. This takes it a step further and now allows errors to be propagated/returned properly.
Note: There is a FIXME in the code that will be removed after it's properly tested via filecoin-ffi testing.