Improve error handling for threaded verification errors #1748

cryptonemo · 2024-04-05T15:34:32Z

For synth porep, we switch to threaded/parallel proof verification as an optimization. This takes it a step further and now allows errors to be propagated/returned properly.

Note: There is a FIXME in the code that will be removed after it's properly tested via filecoin-ffi testing.

cryptonemo · 2024-04-06T00:38:58Z

Looks like all the CI failures are 'Infrastructure failures' and I noticed I can no longer restart failed jobs. Everything passes locally.

magik6k

Can't comm_d_proof_inner.validate and the comm_r last asserts above fail in a similar way if the trees are corrupted?

cryptonemo · 2024-04-06T12:01:25Z

Can't comm_d_proof_inner.validate and the comm_r last asserts above fail in a similar way if the trees are corrupted?

Yes, there are other cases. One thing (for testing) at a time. Once wired through FFI and I'm sure it's propagated properly, adding in the others are straight-forward.

cryptonemo · 2024-04-08T00:49:38Z

Good news is that everything worked wired through locally. I'll look into the other cases.

vmx

The current code reports the most recent error only. A possible change would be to have a vector of errors and pushing them into a vector. Once done print the contents of the whole vector if it's non-empty.

storage-proofs-porep/src/stacked/vanilla/proof.rs

cryptonemo · 2024-04-08T12:24:49Z

The current code reports the most recent error only. A possible change would be to have a vector of errors and pushing them into a vector. Once done print the contents of the whole vector if it's non-empty.

That's incorrect. Look closer.

vmx · 2024-04-08T12:34:16Z

The current code reports the most recent error only. A possible change would be to have a vector of errors and pushing them into a vector. Once done print the contents of the whole vector if it's non-empty.

That's incorrect. Look closer.

Oh right. It's on the first error it encounters. That's probably good enough.

storage-proofs-porep/src/stacked/vanilla/proof.rs

vmx · 2024-04-08T14:35:26Z

The current code reports the most recent error only. A possible change would be to have a vector of errors and pushing them into a vector. Once done print the contents of the whole vector if it's non-empty.

That's incorrect. Look closer.

Oh right. It's on the first error it encounters. That's probably good enough.

I re-read the code again. Now I think it collects all errors. But wow, there must be a way to make all this easier to follow.

cryptonemo · 2024-04-08T22:43:48Z

Updated the error handling on all 4 of the threaded verifiers added for synth porep proofs. Wasted some cycles looking into it and of the 4 verifiers that were made threaded for consistency, only 1 (or 2) had the most perf impact. I recall this now, but again, the parallelism was added across the board for consistency, and now we consistently propagate errors across them all. There's also still a FIXME in place (on purpose) and some additional testing is needed (via my local api and ffi branches, soon to be PRs).

fix: remove invalid assertion (resolves #1749)

storage-proofs-porep/src/stacked/vanilla/proof.rs

DrPeterVanNostrand · 2024-04-11T16:15:41Z

storage-proofs-porep/src/stacked/vanilla/proof.rs

@@ -140,8 +152,7 @@ impl<'a, Tree: 'static + MerkleTreeTrait, G: 'static + Hasher> StackedDrg<'a, Tr
            Challenges::Synth(synth_challenges) => {
                // If there are no synthetic vanilla proofs stored on disk yet, generate them.
                if pub_inputs.seed.is_none() {
-                    info!("generating synthetic vanilla proofs in a single partition");
-                    assert_eq!(partition_count, 1);


Looks good to me! I'm not seeing why this one assert was removed, but I won't let it delay approval.

Nvm, I'm looking at #1750 now.

Looks good to me! I'm not seeing why this one assert was removed, but I won't let it delay approval.

Adding it was an error. The partition count is only 1 for test sectors in porep, and the logging comment implied the incorrect thing (i.e that the single pass of synth porep proof generation implies that the partition count must be 1, which is wrong; it just meant we were not generating synth porep proofs with regard to challenges being partition specific like in other places).

^Fortunately, it was never released with that assert. It came up quickly in overall testing while looking into this issue though.

Instead of using asserts, return proper errors. As this is non-trivial with yastl, rely solely on rayon for parallelism. The performance characteristics are the same (it's maybe slightly, negligible faster). Replaces #1748.

Asserts in a yastl thread pool can lead to things hanging, instead of proper panicking. Spawning the validation was a performance optimization, but it turns out it's not needed. Hence we can just remove the thread pool from there. This has also the advantage that we don't have two different threads pools (from Rayon and yastl) fighting for the same resources. Also the number of threads for the Rayon thread pool can be bound with an environment variable (the yastl one not), so there's more control for running the operations. There was another instance of using the yastl thread pool for verifying the data before writing the Synthetic PoReps to disk. It again has the problem with the assert. With using Rayon instead, we again have the advantage of a more controllable thread pool and it's also slightly faster. Replaces #1748.

#1748 fixed a problem where errors (asserts) where not properly propagated upwards. There were threads just hanging without making any progress. I usually follow the development paradigm of "Make it work, make it right, make it fast" [1], where I interpret the "make it right" as "make it simple". With "simple" I mean things like clear code, easy to follow, no new paradigms, following the style of the existing code base. I'd like to be more concrete about that for this change, on why I find this version "simpler". The core issue was an assert within a spawned thread in a yastl pool. Someone new to the code base coould re-introduce such a problem easily, hence it would be best if we can actually prevent that. With removing the yastl thread pool, where it's not really needed for better performance, we can easily prevent that. We see three difference uses of the yastl thread pool within the `proof.rs`. One is using it to pipeline operations, another one is for a highly parallizing an operation. Usually for data parallism we use Rayon in this code base. In the pipelining use case, there is Rayon used within pipeline. The yastl thread pool will spawn only a limited set of threads, so this looks alright. For the highly parallelised case, we only use yastl and not Rayon (although we should look into just Rayon instead). The third use, where we mix yastl and Rayon for highly parallel operations is removed with this PR. This is intentional. Having two thread pools, which by default use as many threads as cores could easily overprovision a system. This could potentially lead to unintended slow downs. Another benefit of just using Rayon is, that the number of threads can be controlled with an environment variable. This gives more control when several instances of this code is run in parallel, which is the case for some storage providers. Switching back from errors to asserts. I don't know the reason, why the asserts were changed to errors. Hence I'm switching it back to asserts, as now it's easily possible. If one looks at the diff between the version prior to #1728 and this, then the diff is pretty minimal an straight forward. Also one verification again runs only in debug mode and not also in release mode. Though if errors are desired, they can be easily be introduce by switching the asserts to anyhow's `ensure!()` macro. Following the error handling needs less context. With this change, the asserts are happening in the code right away, it works the way people coding Rust are used to. Prior to this change, more context is needed. Taking the "invalid comm_d"-error as an example. It happens on line 456 [2]. Now reading through the code what it means: first look for the `invalid_comm_d` variable. It defines an instance of the `InvalidChallengeCoordinate` struct, which we take a quick look at and see that's a local one, for specifcally error handling. That instance is wrapped in an Arc which is interesting at a first glance, as in Rust shared state is usually tried to be avoided where possible. When looking into the usage of `invalid_comm_d` it becomes clear that we need the `Arc`, as the might assign a different value to it in the error case. We need a mutable reference here, but again usually in Rust immutable types are preferred. So if we can avoid the `Arc` as well some mutability, it's a win for being more Rust idiomatic, hence easier for people familiar with Rust. For me that falls under the "code is written once, but read many times" category. So making it easy to read is a win. In the lower part of the change, the yastl usage for high data parallelism is removed in favour of Rayon, see above for some of the reasons. Also using Rayon here seems to use the thread pool more efficiently, at least on the machine I've tested it on (with 64 threads). When looking at the diff between this change and the commit prior to #1748 git diff 8f5bd86.. -w -- storage-proofs-porep/src/stacked/vanilla/proof.rs Then the changes are very minimal, which I also count as a sign for being "simpler". [1]: https://wiki.c2.com/?MakeItWorkMakeItRightMakeItFast [2]: https://github.com/filecoin-project/rust-fil-proofs/blob/3f018b51b6327b135830899d237a7ba181942d7e/storage-proofs-porep/src/stacked/vanilla/proof.rs#L456-L457

feat: improve error handling for encoding proof verification

8cdc7fd

cryptonemo requested review from DrPeterVanNostrand and vmx as code owners April 5, 2024 15:34

cryptonemo requested a review from magik6k April 5, 2024 15:34

magik6k reviewed Apr 6, 2024

View reviewed changes

vmx reviewed Apr 8, 2024

View reviewed changes

storage-proofs-porep/src/stacked/vanilla/proof.rs Outdated Show resolved Hide resolved

storage-proofs-porep/src/stacked/vanilla/proof.rs Show resolved Hide resolved

storage-proofs-porep/src/stacked/vanilla/proof.rs Outdated Show resolved Hide resolved

vmx reviewed Apr 8, 2024

View reviewed changes

storage-proofs-porep/src/stacked/vanilla/proof.rs Outdated Show resolved Hide resolved

cryptonemo added 2 commits April 8, 2024 18:33

feat: propagates errors in all 4 of the threaded verifiers

67a9973

style: clippy

6d78474

cryptonemo changed the title ~~Improve error handling for encoding proof verification errors~~ Improve error handling for threaded verification errors Apr 8, 2024

cryptonemo and others added 2 commits April 9, 2024 11:50

fix: remove fixme used for testing

c2805ba

feat: update porep bench usage information

4265154

fix: remove invalid assertion (resolves #1749)

vmx reviewed Apr 11, 2024

View reviewed changes

storage-proofs-porep/src/stacked/vanilla/proof.rs Outdated Show resolved Hide resolved

cryptonemo mentioned this pull request Apr 11, 2024

Use improved error propagation branch filecoin-project/rust-filecoin-proofs-api#97

Merged

vmx reviewed Apr 11, 2024

View reviewed changes

storage-proofs-porep/src/stacked/vanilla/proof.rs Show resolved Hide resolved

storage-proofs-porep/src/stacked/vanilla/proof.rs Outdated Show resolved Hide resolved

feat: apply review feedback

bed3811

vmx reviewed Apr 11, 2024

View reviewed changes

storage-proofs-porep/src/stacked/vanilla/proof.rs Outdated Show resolved Hide resolved

style: additional feedback

00c5aa6

DrPeterVanNostrand approved these changes Apr 11, 2024

View reviewed changes

DrPeterVanNostrand mentioned this pull request Apr 11, 2024

fix: remove wrong assert in Synth PoRep code path #1750

Closed

cryptonemo merged commit 3f018b5 into master Apr 11, 2024
32 checks passed

cryptonemo deleted the propagate-invalid-encoding-proof branch April 11, 2024 17:07

vmx mentioned this pull request Apr 11, 2024

fix: better error handling #1751

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve error handling for threaded verification errors #1748

Improve error handling for threaded verification errors #1748

cryptonemo commented Apr 5, 2024

cryptonemo commented Apr 6, 2024

magik6k left a comment

cryptonemo commented Apr 6, 2024

cryptonemo commented Apr 8, 2024

vmx left a comment

cryptonemo commented Apr 8, 2024

vmx commented Apr 8, 2024

vmx commented Apr 8, 2024

cryptonemo commented Apr 8, 2024

DrPeterVanNostrand Apr 11, 2024

DrPeterVanNostrand Apr 11, 2024

cryptonemo Apr 11, 2024 •

edited

Loading

cryptonemo Apr 11, 2024

Improve error handling for threaded verification errors #1748

Improve error handling for threaded verification errors #1748

Conversation

cryptonemo commented Apr 5, 2024

cryptonemo commented Apr 6, 2024

magik6k left a comment

Choose a reason for hiding this comment

cryptonemo commented Apr 6, 2024

cryptonemo commented Apr 8, 2024

vmx left a comment

Choose a reason for hiding this comment

cryptonemo commented Apr 8, 2024

vmx commented Apr 8, 2024

vmx commented Apr 8, 2024

cryptonemo commented Apr 8, 2024

DrPeterVanNostrand Apr 11, 2024

Choose a reason for hiding this comment

DrPeterVanNostrand Apr 11, 2024

Choose a reason for hiding this comment

cryptonemo Apr 11, 2024 • edited Loading

Choose a reason for hiding this comment

cryptonemo Apr 11, 2024

Choose a reason for hiding this comment

cryptonemo Apr 11, 2024 •

edited

Loading