-
Notifications
You must be signed in to change notification settings - Fork 808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
littlefs checksum behaviour #843
Comments
Hi @ebasle, I can see where this can be confusing, and in a dangerous way. To be clear, littlefs does not provide any form of error detection at either the data or metadata level. It does checksum the metadata, but in a way that only provides power-loss guarantees. It doesn't provide protection against general bitrot like metadata checksumming in other filesystems. On a checksum failure, littlefs assumes a power-loss must have occurred and rolls back that metadata transaction. But since littlefs is built out of many independent metadata logs, this can put littlefs into an unstable state. When you modify Unfortunately, the only way to provide reliable error detection with littlefs right now would need to be under the filesystem, so at the block-device layer. This may not be that difficult, though it is extra work. One scheme:
This can be extended to full ECC, but that's another can of worms. Depending on what you set your The good news is I do have some some features in the works that should improve the state of things:
Unfortunately these are only in the design phase and will take some time to implement. |
Hello @geky Thank you for your answer! In my case:
And this is my current littlefs configuration:
So if I follow your instructions and setup 256b slices, it would result in:
I am really not sure about the block_size/block_count/cache_size values. |
That configuration looks correct to me. From littlefs's perspective, it really is only managing 3.5KiB sectors, since that's the amount of usable space. The trick is that the block device is the one doing the work to map the 3.5KiB sectors -> 4KiB sectors. This does make implementing the block device a bit trickier. Here's some pseudocode of what a block device might look like, sorry if there's bugs: PROG_SIZE = 224
HMAC_SIZE = 32
int bd_read(block, off, buffer, size):
assert(off % PROG_SIZE == 0)
assert(size % PROG_SIZE == 0)
# block address is unchanged, but we need to map off -> raw_off
raw_off = (off / PROG_SIZE) * (PROG_SIZE+HMAC_SIZE)
# littlefs may read in _multiples_ of read_size, handle this here
while size > 0:
# buffer may not have enough space for hmac, can either allocate
# another buffer here, or call rawbd_read twice
uint8_t raw_buffer[PROG_SIZE+HMAC_SIZE];
rawbd_read(block, raw_off, raw_buffer, PROG_SIZE+HMAC_SIZE)
# find corrupt blocks here
if !hmac_validate(raw_buffer):
return LFS_ERR_CORRUPT
memcpy(buffer, raw_buffer, PROG_SIZE)
buffer += PROG_SIZE
size -= PROG_SIZE
return 0
int bd_prog(block, off, buffer, size):
assert(off % PROG_SIZE == 0)
assert(size % PROG_SIZE == 0)
# block address is unchanged, but we need to map off -> raw_off
raw_off = (off / PROG_SIZE) * (PROG_SIZE+HMAC_SIZE)
# littlefs may read in _multiples_ of read_size, handle this here
while size > 0:
# buffer may not have enough space for hmac, can either allocate
# another buffer here, or call rawbd_read twice
uint8_t raw_buffer[PROG_SIZE+HMAC_SIZE];
memcpy(raw_buffer, buffer, PROG_SIZE)
hmac_hmac(raw_buffer)
rawbd_prog(block, raw_off, raw_buffer, PROG_SIZE+HMAC_SIZE)
buffer += PROG_SIZE
size -= PROG_SIZE
return 0
int bd_erase(block):
# block address is unchanged
return rawbd_erase(block) It's also worth noting littlefs reads back every piece of data written to see if it was written correctly. So the above indirectly validates writes even though it doesn't appear that way. |
Hello @geky, thanks for the detailed response. So I implemented the HMAC on the block device layer as you describe. However, when I corrupt a file (as explained in my inital post), I witness the following behavior when the block device layer returns LFS_ERR_CORRUPT to littlefs: Additionally, it looks like having a corrupted sector/slice cause a side effect on other file open (lfs_opencfg).
In this test, I was expecting:
In the end I just would like to make littlefs return the information that a file has been corrupted to the application. |
Ah you are right. Sorry, I made a mistake. I'm not sure what I was thinking. littlefs can't tell the difference between a bad HMAC and power-loss for the same reason it can't rely on its own checksum to detect bit-errors. Both incomplete writes and bit-errors look "bad" so littlefs assumes a power-loss occurred. I'm trying to think of a workaround but currently turning up empty. Returning LFS_ERR_IO gets reported directly to the user without any special handling in littlefs, but it's probably not what you want since it would also make power-loss reported as LFS_ERR_IO. Same for reporting errors out-of-band. I'll let you know if I think of anything, but at the moment it may not be possible to littlefs to handle both power-loss and bit-errors until either metadata redundancy or global checksums are added.
In littlefs, files share metadata logs. Most likely all three of these files are in the same metadata log, and when file 2 is corrupted littlefs stops parsing the metadata log before it finds file 3. Over times metadata logs are eventually compacted, so what gets rolled back may be different. This makes metadata rollbacks particularly nasty. |
Thank you time @geky, let me know if you think of anything! |
Happened to read this discussion. Do I understand correctly that a single bit error in metadata could cause revert back to old file contents (if it exists)? If so, then it doesn't sound good and I'm wondering if it makes sense to write critical files, such as configuration files, at least twice? |
@mikk-leini-krakul, it's a bit worse in that a single bit error can corrupt the filesystem's metadata itself. Unless you have some form of error-correction in the block device layer. There are two features in the works to improve this: metadata redundancy and global checksums, but these are not ready yet. The best option right now is to correct bit errors in the block device layer using a Reed-Solomon/BCH code or something else. Some block devices even provide error-correction in hardware. I realize this is not a great answer but it is what it is. |
@ebasle, it's a bit late and I realize you're probably not looking at this problem anymore, but for posterity it's probably worth noting you can turn any checksum/hash/hmac into a limited error correcting code via brute force. If the hash fails, you can try recalculating the hash with the first bit flipped, then with the second bit flipped, then the third, etc, to see if any single bit flips would result in a valid hash. You could do the same for 2 bit flips, but it would be And depending on how expensive your hash/hmac is to calculate, this approach may not be tenable... Nesting your hmac inside an RS/BCH/CRC error correcting code would probably be much more efficient at a storage cost. |
I realized it needs quite a lot of time to do full impact analysis of soft errors on LFS and that time I don't have. So I did like the LFS design documentation suggested and used software ECC. Unfortunately there are a very few FOSS options. I found single-bit error correcting ECC from Linux kernel: On every external QSPI Flash 256 byte page I gave LFS 252 bytes of payload and used 3 bytes for ECC. 1 byte left for CRC-8 (although 16 or 32 bit CRC would be much safer). In addition, every write is done with verify (read-compare). Can't say if this is sufficient or not, but it certainly feels better than no ECC/CRC. |
I'm interested in doing something similar, but on a small eeprom or fram using crc16 with small blocks (150b-180b LFS block, 30b LFS prog size, 32b eeprom pages split into 30b data and 2b crc). I am hoping to catch bus-transfer errors more than flipped bits in flash so my block device layer would be able to retry the read a few times (eg before lfs starts trying to do power loss recovery). Previously I was treating erase as a no-op and has been working fine - #77 (comment) With block-level CRCs, If I don't scrub the eeprom and pre-fill all blocks with valid crcs lfs_mount is failing after format (which succeeds). It seems it is trying to read a block it has not written yet, so I return LFS_ERR_CORRUPT and it fails during lfs_mount. Is this expected to need to make all the blocks valid for LFS first, when using block level crcs? I'd like to avoid the extra flash program cycles of an explicit erase if I can. It is also possible my block layer still has bugs, I just hacked it together for a quick test. I'll play with it some more. |
That is interesting, thanks for sharing. I've gotten kind of stuck in the theory side of things, so don't have the best understanding of what's out there practically. Using Hamming codes is an interesting choice because I believe they're limited to single-bit error correction? It seems like you could do something better with crc32s, but maybe I'm missing something...
Quite an interesting use case, should work fine.
No, you should be able to return LFS_ERR_CORRUPT and littlefs will treat it as not written yet. Though you do need to return LFS_ERR_CORRUPT consistently until the block is written.
This is especially curious because littlefs tries to mount the superblock in lfs_format before returning success: Lines 4385 to 4386 in d01280e
Is it possible there's some issue on the device's RAM side of things? |
Looks like just a bug in my code, I was not correctly handling certain read-modify-write cases spanning pages, and what to do about RMW cycles on blocks that were never erased but also not fully written (I lazily zero fill them now if there is a partial page write and the page had a bad crc, and assume it was unallocated). After fixing that, both my block layer and LFS are happy with or without erasing. I'll play with this some more. Thanks! |
Regarding software ECC, there's a few things you can do. While hamming codes are the easiest, there's also BCH, Viterbi and Golay codes for bit errors and Reed Solomon for burst errors. There are implementations available for all of these (written one myself for all of these). If on the bus side you were worried about missing data blocks (erasures), you could even use fountain codes like RaptorQ (although that one is a hassle to get working without |
I figured I'd also take a crack at it since these seems genuinely useful for littlefs users: Not exactly drop-in solutions, because of existing API issues, but they shouldn't be too difficult to adapt to your own use cases. I'll probably add them to littlefs's README.md as examples at some point. Feedback welcome. |
Hello,
From previous reading I understand that littlefs CRC is calculated only on the file metadata and not the actual file data.
To ensure file data integrity in my application, I would like to implement my own integrity check at file level using file attributes.
I wanted to check what happens when file data integrity is altered so I made the following test:
note: littlefs 2.5 is integrated and works otherwise fine in my environment.
A few questions/remarks:
Memory state before step5
Memory state after step5
Is this the expected behavior or am I missing something?
With this behavior I don't see how I can perform my custom integrity check since file data is lost when trying to open the file.
Thank you
The text was updated successfully, but these errors were encountered: