feat: add parent cache verification and a setting to enable it #1265

cryptonemo · 2020-08-25T14:44:06Z

When the cache files are generated, they are hashed and that hash is
persisted. When the cache file is opened, it is optionally verified
for integrity via the 'verify_cache' setting (disabled by default).

When this code is run, if the checksum files do not exist, we
re-generate the parent cache file in order to write out that checksum
file since what's already on disk may be corrupted.

Resolves #1264

dignifiedquire · 2020-08-25T14:56:42Z

storage-proofs/porep/src/stacked/vanilla/cache.rs

@@ -33,7 +35,7 @@ pub struct ParentCache {
 #[derive(Debug)]
 struct CacheData {
    /// This is a large list of fixed (parent) sized arrays.
-    data: memmap::Mmap,
+    pub data: memmap::Mmap,


no need to make this pub, I believe

dignifiedquire · 2020-08-25T15:20:48Z

storage-proofs/porep/src/stacked/vanilla/cache.rs

+                    .with_context(|| format!("could not mmap path={}", path.display()))?
+            };
+            hasher.update(&data);
+            drop(data);


it is likely more efficient to use std::io::ReadBuf with a regular file open, than mmap for just hashing the data

Out of curiosity, do you know of any fast rust hashers/crates made specifically for large data/files that might fit this use? I was wondering if something like that may work out all around better, but figured I'd stick with what we already have support for in the first cut.

Given the systems we run on sha2 is likely good enough, the other fast alternative I could suggest is https://github.com/BLAKE3-team/BLAKE3

Ok thanks, I tried blake2s since it was already supported, but switched back to sha2 ;-) I'll give the ReadBuf approach a go.

Was hoping for more of a speed-up, but if we loop over full buffers, it's slightly faster. I'll update shortly.

Very odd, on second test it appears much slower. Hrm ...

Dig, could you take a look at the commented out attempt at doing this? I tested several times and it appears much slower, so I'm probably doing something wrong.

Am I right to conclude dig's suggestion didn't turn out to be faster? (It seems not to have been taken in the current version.)

As added, it was far slower in my testing, so I reverted. I suspect dig may know a faster method or way of applying it, so it can be improved over time as needed.

porcuquine

Although this validates the cache file against a checksum created locally, it doesn't check for absolute correctness. It therefore does not in any way address the situation in which bad hardware (especially RAM) leads to a wrongly-generated local cache. This is the primary use case for this integrity check. We need to calculate and distribute canonical checksums and use those for the validation.

One possible solution is to extend parameters.json to also handle this more general case. Another would be to create a new manifest file for the caches.

cryptonemo · 2020-08-25T16:50:03Z

Although this validates the cache file against a checksum created locally, it doesn't check for absolute correctness. It therefore does not in any way address the situation in which bad hardware (especially RAM) leads to a wrongly-generated local cache. This is the primary use case for this integrity check. We need to calculate and distribute canonical checksums and use those for the validation.

One possible solution is to extend parameters.json to also handle this more general case. Another would be to create a new manifest file for the caches.

Sure, I'm open to all of that, however, since re-generating the parent cache in all reported cases 'fixes' the problem, this is meant to address that aspect.

porcuquine · 2020-08-25T17:20:28Z

I might be misunderstanding. Consider the case in which the user has generated a bad cache locally and created a checksum memorializing that bad cache as 'correct'. When the cache is checked for correctness later, it will be consistent with its original value and therefore won't be regenerated, right? If so, I don't see how this will fix the problem. The bad cache will still be used, and the seal results will be wrong.

cryptonemo · 2020-08-25T19:17:04Z

I might be misunderstanding. Consider the case in which the user has generated a bad cache locally and created a checksum memorializing that bad cache as 'correct'. When the cache is checked for correctness later, it will be consistent with its original value and therefore won't be regenerated, right? If so, I don't see how this will fix the problem. The bad cache will still be used, and the seal results will be wrong.

There are 2 separate cases that I'm aware of, one of which can be solved with a general software solution.

Parameter files and/or parent cache files become corrupt on disk. This solution addresses that issue because we can fix this. Lotus checks parameters and can report when they do not match. This PR does a similar thing for the parent cache and further regenerates them in that case*.
The user has bad RAM and cannot generate a proper cache. In this case, there is no general software solution. We could add the additional consistency check to check the consistency check being added though, and simply report a mis-match. But even if we downloaded known good cache files and checked those are consistent, bad RAM will continue to generate bad proofs as they are loaded. Same for the parameter files, which may exist perfectly on disk. The solution here is to test the RAM and/or replace it. In this case, re-generating parameters may also cause errors, but the problem (and solution) is the same, regardless.

I believe that #1 is primarily the one we're interested in because all reports I'm aware of are solved by either 1) Re-generating the parent cache (by deleting it), or 2) testing RAM and physically replacing the hardware when needed (or adjust speed settings, etc).

porcuquine · 2020-08-25T20:00:18Z

I might be misunderstanding. Consider the case in which the user has generated a bad cache locally and created a checksum memorializing that bad cache as 'correct'. When the cache is checked for correctness later, it will be consistent with its original value and therefore won't be regenerated, right? If so, I don't see how this will fix the problem. The bad cache will still be used, and the seal results will be wrong.

There are 2 separate cases that I'm aware of, one of which can be solved with a general software solution.

Parameter files and/or parent cache files become corrupt on disk. This solution addresses that issue because we can fix this. Lotus checks parameters and can report when they do not match. This PR does a similar thing for the parent cache and further regenerates them in that case*.

I am not aware of disk corruption being a significant risk — although it's good if we protect against it.

The user has bad RAM and cannot generate a proper cache. In this case, there is no general software solution. We could add the additional consistency check to check the consistency check being added though, and simply report a mis-match. But even if we downloaded known good cache files and checked those are consistent, bad RAM will continue to generate bad proofs as they are loaded. Same for the parameter files, which may exist perfectly on disk. The solution here is to test the RAM and/or replace it. In this case, re-generating parameters may also cause errors, but the problem (and solution) is the same, regardless.

The problem is that if a user has previously had bad hardware and generated a bad cache, then even if they replace the hardware and are now capable of generating a correct cache (or would be capable of generating good proofs with downloaded cache), they won't detect that they are using a bad cache. That was what happened in the report here: #1264 (comment) — which was the primary motivation for #1264. That is the motivation for the acceptance criteria: 'A bad graph cache doesn't lead to failure if current hardware produces a correct graph. Instead, the bad cache is replaced by a newly-generated good one.'

I believe that #1 is primarily the one we're interested in because all reports I'm aware of are solved by either 1) Re-generating the parent cache (by deleting it), or 2) testing RAM and physically replacing the hardware when needed (or adjust speed settings, etc).

The problem is that even if bad hardware is detected and replaced, if an old bad cache is still used, bad proofs will still be generated.

At the very least, this condition needs to be detected. If bad hardware is the root cause and has not been replaced, the best we can do is fail — but we certainly should. If the underlying problem has been fixed, we should repair the cache and continue.

cryptonemo · 2020-08-25T20:19:44Z

I am not aware of disk corruption being a significant risk — although it's good if we protect against it.

We can agree to disagree that the requests that have been solved by re-downloading parameters or re-generating the parent cache is likely to be disk corruption. Those error(s) would certainly persist beyond that with bad RAM.

The problem is that if a user has previously had bad hardware and generated a bad cache, then even if they replace the hardware and are now capable of generating a correct cache (or would be capable of generating good proofs with downloaded cache), they won't detect that they are using a bad cache.

If a user replaced known bad hardware, the current recommendation here is still to regenerate the cache by deleting it, or enabling the verification afterward which would catch this error case.

As I said, I'm open to some "known good hash" list somewhere, but it doesn't fix anything, it reports something and then triggers the actual solution. Current best practice for known bad disk is to replace disk and re-download parameters/re-generate parameters. I don't see how replacing RAM is any different, as probably a lot more than the parent cache is already corrupt, including sealed sectors, etc.

In any case, an extra hash comparison is easy and can help trigger that (i.e. the check for the consistency check), but maintaining that list is potentially cumbersome.

porcuquine · 2020-08-25T21:27:56Z

I'm pausing the ongoing discussion, since we have largely resolved these issue offline.

Separately, cache keys need to include the porep_id — since if and when this changes, graphs seeds will. This is by design, and caches of previous graphs must not be used after such a change.

cryptonemo · 2020-08-26T13:05:21Z

I'm pausing the ongoing discussion, since we have largely resolved these issue offline.

Separately, cache keys need to include the porep_id — since if and when this changes, graphs seeds will. This is by design, and caches of previous graphs must not be used after such a change.

Fortunately, the porep_id appears to already be included in the current cache path/id, which will be automatically invalidated.

porcuquine · 2020-08-26T15:48:57Z

Fortunately, the porep_id appears to already be included in the current cache path/id, which will be automatically invalidated.

Ah right, good. Thanks for checking.

porcuquine

Looks good, assuming fully tested.

porcuquine · 2020-08-31T22:30:27Z