-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add parent cache verification and a setting to enable it #1265
Conversation
@@ -33,7 +35,7 @@ pub struct ParentCache { | |||
#[derive(Debug)] | |||
struct CacheData { | |||
/// This is a large list of fixed (parent) sized arrays. | |||
data: memmap::Mmap, | |||
pub data: memmap::Mmap, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to make this pub, I believe
.with_context(|| format!("could not mmap path={}", path.display()))? | ||
}; | ||
hasher.update(&data); | ||
drop(data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is likely more efficient to use std::io::ReadBuf
with a regular file open, than mmap for just hashing the data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, do you know of any fast rust hashers/crates made specifically for large data/files that might fit this use? I was wondering if something like that may work out all around better, but figured I'd stick with what we already have support for in the first cut.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the systems we run on sha2 is likely good enough, the other fast alternative I could suggest is https://github.com/BLAKE3-team/BLAKE3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok thanks, I tried blake2s since it was already supported, but switched back to sha2 ;-) I'll give the ReadBuf approach a go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was hoping for more of a speed-up, but if we loop over full buffers, it's slightly faster. I'll update shortly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very odd, on second test it appears much slower. Hrm ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dig, could you take a look at the commented out attempt at doing this? I tested several times and it appears much slower, so I'm probably doing something wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I right to conclude dig's suggestion didn't turn out to be faster? (It seems not to have been taken in the current version.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As added, it was far slower in my testing, so I reverted. I suspect dig may know a faster method or way of applying it, so it can be improved over time as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this validates the cache file against a checksum created locally, it doesn't check for absolute correctness. It therefore does not in any way address the situation in which bad hardware (especially RAM) leads to a wrongly-generated local cache. This is the primary use case for this integrity check. We need to calculate and distribute canonical checksums and use those for the validation.
One possible solution is to extend parameters.json
to also handle this more general case. Another would be to create a new manifest file for the caches.
Sure, I'm open to all of that, however, since re-generating the parent cache in all reported cases 'fixes' the problem, this is meant to address that aspect. |
I might be misunderstanding. Consider the case in which the user has generated a bad cache locally and created a checksum memorializing that bad cache as 'correct'. When the cache is checked for correctness later, it will be consistent with its original value and therefore won't be regenerated, right? If so, I don't see how this will fix the problem. The bad cache will still be used, and the seal results will be wrong. |
There are 2 separate cases that I'm aware of, one of which can be solved with a general software solution.
I believe that #1 is primarily the one we're interested in because all reports I'm aware of are solved by either 1) Re-generating the parent cache (by deleting it), or 2) testing RAM and physically replacing the hardware when needed (or adjust speed settings, etc). |
I am not aware of disk corruption being a significant risk — although it's good if we protect against it.
The problem is that if a user has previously had bad hardware and generated a bad cache, then even if they replace the hardware and are now capable of generating a correct cache (or would be capable of generating good proofs with downloaded cache), they won't detect that they are using a bad cache. That was what happened in the report here: #1264 (comment) — which was the primary motivation for #1264. That is the motivation for the acceptance criteria: 'A bad graph cache doesn't lead to failure if current hardware produces a correct graph. Instead, the bad cache is replaced by a newly-generated good one.'
The problem is that even if bad hardware is detected and replaced, if an old bad cache is still used, bad proofs will still be generated. At the very least, this condition needs to be detected. If bad hardware is the root cause and has not been replaced, the best we can do is fail — but we certainly should. If the underlying problem has been fixed, we should repair the cache and continue. |
We can agree to disagree that the requests that have been solved by re-downloading parameters or re-generating the parent cache is likely to be disk corruption. Those error(s) would certainly persist beyond that with bad RAM.
If a user replaced known bad hardware, the current recommendation here is still to regenerate the cache by deleting it, or enabling the verification afterward which would catch this error case. As I said, I'm open to some "known good hash" list somewhere, but it doesn't fix anything, it reports something and then triggers the actual solution. Current best practice for known bad disk is to replace disk and re-download parameters/re-generate parameters. I don't see how replacing RAM is any different, as probably a lot more than the parent cache is already corrupt, including sealed sectors, etc. In any case, an extra hash comparison is easy and can help trigger that (i.e. the check for the consistency check), but maintaining that list is potentially cumbersome. |
I'm pausing the ongoing discussion, since we have largely resolved these issue offline. Separately, cache keys need to include the |
Fortunately, the porep_id appears to already be included in the current cache path/id, which will be automatically invalidated. |
Ah right, good. Thanks for checking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, assuming fully tested.
|
||
fn gen_graph_cache<Tree: 'static + MerkleTreeTrait>( | ||
sector_size: usize, | ||
challenge_count: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not affect the graph.
fn gen_graph_cache<Tree: 'static + MerkleTreeTrait>( | ||
sector_size: usize, | ||
challenge_count: usize, | ||
layers: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not affect the graph.
let nodes = (sector_size / 32) as usize; | ||
let drg_degree = filecoin_proofs::constants::DRG_DEGREE; | ||
let expansion_degree = filecoin_proofs::constants::EXP_DEGREE; | ||
let layer_challenges = LayerChallenges::new(layers, challenge_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should consider using a default value for layer_challenges
if we need it only to instantiate SetupParams
. If the value affects the generated graph, something unexpected is happening. If it doesn't, then it's misleading and unnecessary to pass it to gen_graph_cache
. (However, if it's not awkward to do to so, maybe it's not a big deal. In that case, I would at least add a comment to the effect that these values don't affect the graph.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment and replaced with fixed values.
// filecoin-proofs-api:src/registry [porep_id()] that matches the specified | ||
// sector size and must be updated when that value is updated for the proper | ||
// graph cache generation/validation. | ||
// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This duplication is annoying, though maybe the right tradeoff here/now.
.read() | ||
.expect("LAYERS read failure") | ||
.get(§or_size) | ||
.expect("unknown sector size") as usize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted above, we could dispense with challenge_count
and layers
since they don't affect the graph.
@@ -34,6 +35,7 @@ impl Default for Settings { | |||
fn default() -> Self { | |||
Settings { | |||
maximize_caching: true, | |||
verify_cache: false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How expensive is cache verification? Defaulting to true
is likely a MUCH better UX in many cases where this matters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may cost miners hours of time per sector, so instead of default to true, it should be advised to check when the error cases arise. My fastest machine can verify all sizes in 2.5 hours, which is more than is needed (if enabled), but almost all of the time is spent on the 32&64GiB verifications, so it'll add up quickly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Let's make sure to document clearly and with sufficient visibility that those who will benefit from this know when they need to enable.
Also, related to my suggestion elsewhere, I think we can hide all or most of this cost by generating the digest incrementally (in parallel with graph computation) while generating the cache?
If we can do that, we could afford to always check integrity on verification (best!). That would make the supplementary check here only truly relevant in the case of disk corruption, and a better candidate for being enabled only when debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I think about it, the more I think this optimization is important for maximizing usability of this integrity check. If it's intrusively expensive, people won't enable it by default, and we'll miss the assurance it provides. Unless it proves impossible to keep this fast (as above), I think the integrity check at generation time should be mandatory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea, but I'm not sure how to do it given how the cache is created in parallel at the moment. Generation on my fastest machine takes closer to 24+ hours IIRC (and closer to 40 or so on my normal dev machine), I was referring to the verification only in the timing above (i.e. assuming you have files on disk, how long to verify they are good).
EDIT: The timings were not using release mode, which makes them an order of magnitude less reliable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If cache-generation is already parallelized, then this won't help, but maybe with release mode it's not such an issue.
➜ filecoin-parents /usr/bin/time sha256sum v28-sdr-parent-d5500bc0dddadb609f867d94da1471ecbaac3fe6f8ac68a4705cebde04a765b8.cache
f6deb8b7804ad7327ab9eb86b6b66f7dbb7dfc370cac54e1b5267984105d3611 v28-sdr-parent-d5500bc0dddadb609f867d94da1471ecbaac3fe6f8ac68a4705cebde04a765b8.cache
27.56user 7.10system 0:34.73elapsed 99%CPU (0avgtext+0avgdata 3140maxresident)k
5272inputs+0outputs (1major+157minor)pagefaults 0swaps
➜ filecoin-parents ll v28-sdr-parent-d5500bc0dddadb609f867d94da1471ecbaac3fe6f8ac68a4705cebde04a765b8.cache
-rw-r--r-- 1 xxx xxx 56G Aug 28 00:36 v28-sdr-parent-d5500bc0dddadb609f867d94da1471ecbaac3fe6f8ac68a4705cebde04a765b8.cache
The above assumes a machine with fast SHA256, but I think that holds for miners who will be generating the parents cache anyway. ~30 seconds is not so bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just ran the verify on my mining rig for 32GiB (in release mode) and it took just under 32 seconds, so we seem to agree there.
.with_context(|| format!("could not mmap path={}", path.display()))? | ||
}; | ||
hasher.update(&data); | ||
drop(data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I right to conclude dig's suggestion didn't turn out to be faster? (It seems not to have been taken in the current version.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approved before noticing that there is no final check after re-generating a bad cache, so I'm requesting changes to make sure my latest comment doesn't get missed.
Immediately after generating the graph, it's checked. |
I see, yes, thank you. We could reduce the time overhead by building up the digest in parallel, while the cache is being generated rather than waiting until the end. That would remove the cost of sequential hashing from the total generation time. |
@porcuquine I just pushed a change that only calculates the digest after generation if it's a production one that we recognize. Previously, it was calculated and wasn't compared to a known value in that case, so this fixes that (carried over from when it was always calculated and then persisted to disk). |
d90521c
to
63f41c5
Compare
769602d
to
6bd1b06
Compare
@cryptonemo As we discussed recently, we should also have an option to check integrity of proofs. When enabled, it should be impossible to generate or verify proofs except with the parameters published in This should probably eventually be 'opt out', but the expedient path may be for it to initially be opt in. That might simplify any extra work required to ensure testing and dev workflows remain ergonomic — without blocking the current work on those considerations. |
6bd1b06
to
aa05559
Compare
Updated for early review, but not complete. This update moves parameters.json over to storage-proofs, to avoid duplication of it being used (now in storage-proofs, as well as filecoin-proofs). If this looks to be the right direction, I can add the (ideally one time per parameter) hashing verification, but that is not ready at the moment. |
@cryptonemo If we do move |
8d5bb1e
to
04dfadb
Compare
Hash the parameters file in order to check if it matches the digest given in the `parameters.json` file.
Hashing the parameter files can be expenive, hence hash them only once for verification that they are good.
End once again a new range https://github.com/filecoin-project/rust-fil-proofs/pull/1265/files/04dfadb60a142a62bf7788c163cf33dafcfb4d04..a8584308c06a97ad50bc1b2d58dfb54c5de7b102 (I shouldn't have started that "link to commit ranges thing ;) |
It's ready for review. |
Looks good, as discussed elsewhere. I can't approve my own PRs, so it's up to @porcuquine or @dignifiedquire now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, almost there. A few things:
- Has this been manually tested? (It should be, to ensure the parameter checking works with published params and fails without them.)
- README needs update for both (and we should make sure Lotus is aware of the changes when released).
- I think
parameters.json
currently lives infilecoin_proofs
, with a symlink to it instorage_proofs
. I think we want the opposite, right?
// The hash in the parameters file is truncated to 256 bits. | ||
let digest_hex = &hash.to_hex()[..32]; | ||
|
||
ensure!(digest_hex == data.digest, "Parameters are invalid"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably log the file name, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's been manually tested as described by @vmx, but I've been using the check_parameters
tool in fil_proofs_tooling
, which is a direct way to check a specific parameter file.
storage-proofs/parameters.json
Outdated
@@ -0,0 +1 @@ | |||
../filecoin-proofs/parameters.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want the file to live in storage_proofs
, with the symlink in filecoin_proofs
. The idea is that there should only be dependencies from filecoin_proofs
on storage_proofs
and not vice versa. Does that match your understanding, @cryptonemo ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's how this should be, or at least was. I had git mv'd it over, then added a symlink there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does seem backwards. That was the intent anyway, will try again to fix that.
I've tested it locally manually. If anyone wants to test it, just enable the verification via env var and run the tests. @cryptonemo It would be cool if you could take this over the finish line. If you haven't until my tomorrow I will. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested the groth parameter verification enough to see that it's not behaving as expected, and a quick look at the code showed some of the reasons.
My expectation: if verify_production_params
is true but a file of the correct name with wrong contents exists, then attempts to use the parameters will fail with an error describing the cause of the problem (failed integrity check).
What actually happens (for me):
- with a very short junk file: an error from trying to read/parse the file.
- with a longer junk file: an (inaccurate) error about 'failure finding <FILENAME>'.
I think there are two broad issues causing this:
- the integrity check should come first (if requested), before trying to build the mapped parameters.
- when integrity check fails, this is not being distinguished from a 'cache miss', so we're falling back to generating parameters. Then an error is being reported because (in the case of the lifecycle tests I checked with) we choose not to generate parameters and instead return an error (here misidentified as a missing file). Instead, we really need to isolate and report the actual cause of the error. Even though the specific combination of constraints in our current code base doesn't make this catastrophic, it's definitely 'wrong', 'bad', and 'scary' to fall back to generating parameters in the case of a failed production-parameter integrity check.
I'll look into those. |
Verify the parameters before mmapping them.
Strongly agree here, as that's what I had personally tested. |
If parameter verfification is enabled, panic if the parameters are invalid.
86d87a0
to
01ff2ae
Compare
@porcuquine ready for review again, all your comments should be addressed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to work now. The only thing thing left is that it doesn't distinguish between an absent parameter file and one that has the wrong content. So if the user simply doesn't have the required parameters at all, they will get a somewhat confusing InvalidParameters
('Parameter verification failed') message. This is probably fine, and the ideal solution might just be change of name/message rather than an even more fine-grained error. If we did go that route, maybe ValidParametersMissing
would be true in both the absent and present-but-wrong cases.
|
||
Ok(mapped_params) | ||
})?; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The binding to params
is not really necessary here (though I don't mind if you leave it).
pub sector_size: u64, | ||
} | ||
|
||
pub const PARAMETERS_DATA: &str = include_str!("../..//parameters.json"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the double slash really necessary here?
Ok(x) => Ok(x), | ||
Err(_) => { | ||
read_cached_params(&cache_path).or_else(|err| match err.downcast::<Error>() { | ||
Ok(Error::InvalidParameters(msg)) => panic!("Parameter verification failed: {}", msg), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably should return an error rather than panic here, same as with the parents cache.
@porcuquine With the most recent commit, when I run If it is an invalid file (checksum doesn't match):
If the file is missing:
|
When the cache files are generated, they are hashed and that hash is
persisted. When the cache file is opened, it is optionally verified
for integrity via the 'verify_cache' setting (disabled by default).
When this code is run, if the checksum files do not exist, we
re-generate the parent cache file in order to write out that checksum
file since what's already on disk may be corrupted.
Resolves #1264