Data integrity #133

evanmcc · 2018-01-26T19:00:25Z

in #100 I added a "data validation" task, but I'm breaking it out into a bigger issue here (#23 is also related) to capture my current thinking on data consistency and recovery from disk corruption or loss. I think that we should skip data validation for now, but have some specific recommendations at the end.

Some things to think about:

chain tail has corrupted data for an offset, but other chain parts don't, which blocks reads of that offset, but could be corrected.
non-tail node has corrupted data, which doesn't block reads, but silently reduces replication for long term storage.
a node comes up clean and must recover (we need to do this ASAP, but I think that it's not bad to write with existing primitives, as soon as we add a 'verify CRC mode' to the client).
nodes may have different index and log size settings, making bulk comparison difficult or impossible. that is, we can't simply md5 the whole file, which is fast, but must scan and compare per-value (and possibly across the whole chain) which can make things slow.
nodes may be at different stages in garbage collection, making bulk comparison difficult or impossible, same thing as before. Strongly consistent metadata could help here.
it's not clear to me where the right place to check the crc is. checking on the server is expensive and centralized, but checking at the client means that the error is detected far away from the source, which makes fixing it sometimes unclear.

Plan of action?

Things to do:

clean node recovery
single offset repair
AAE

clean node recovery is a little bit of a project, but is fairly pressing if we ever plan to grow chains. it might be possible to sidestep by never growing chains and draining them to bigger chains instead, but for disaster recovery, we should have this. I think that fully specifying the project is outside of the scope of this issue, but the primary issues are: detecting the need, how to report to clients while we're repairing, and how we decide on what the proper value to use for repair is.

simple single offset repair would be nice to have for a "read-repair" type issue, but I think an initial version could be designed to be triggered manually. we also need to decide on how to chose the right value, as above.

AAE is a big feature but important over the long run. standard merkle-tree implementations seems work well enough. keeping them from thrashing the cache or CPU too badly is the primary concern here, I think.

I don't know how to prioritize these. recovery can be done semi-manually by copying files and then letting write repair catch up the chain. single offset repair similarly, and it's not clear whether there is demand for AAE. does the data that we care about live long enough? or are people mostly running out their retention window before data corruption is likely?

evanmcc added enhancement question labels Jan 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data integrity #133

Data integrity #133

evanmcc commented Jan 26, 2018

Data integrity #133

Data integrity #133

Comments

evanmcc commented Jan 26, 2018