-
Notifications
You must be signed in to change notification settings - Fork 794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lfs assert from lfs_ctz_find (head >= 2 && head <= lfs->cfg->block_count) w/ 1.7.1 #245
Comments
Could this issue be caused by the lookahead being 1024 (128 * 8) bits in length while the flash is only 512 blocks in size? |
Thanks for creating and issue, I'll have to look into this when I have the chance. Do you know if this issue is present in v2?
This should be fine, however I have just noticed there is no longer a test case covering this situation. |
Hi, |
I don't have a specific reproducible sequence that causes the failure yet, though I suspect its related to completely filling the entire file system.. I've just seen in the error reports coming when trying to read the files afterwards. |
In my case the system was working some time without any problems, but after some time it stucks with this error, so I had to format filesystem. I thought that it was wrong firmware code, and only after I found this topic with same symptom. If it really depends on lookahead parameter then it should be checked inside initialization method. |
Hello! I have the same issue, but I can reproduce it. The scenario is:
Do you have an idea for the source of this issue? I'm using v2. |
I've been fuzzing littlefs and I have a fairly simple test case. The sequence of operations is:
The mount call simulates a power loss by just remounting the flash (in this case emubd) without unmounting. |
I did find one issue with the CTZ skip-list while working on #372, fix no 4. in here 517d341, these lines. However it only occurs if the filesystem hits ENOSPC, so I'm not sure it's related. Thanks @pjsg for the great test case! I will need to reproduce this next chance I get. |
Hi @pjsg, I'm running into some issues reproducing this test failure. Are we sure this bug hasn't been fixed as a part of #372? I've tried to plug it into the test framework here, but can't get it to fail. Let me know if you see anything obvious I'm doing wrong: As more of a curiosity question, do you know if AFL could be configured to generate test cases in that sort of format? (or converted via post-script?) |
I have a hacked up version that can generate (in some cases) toml files. However, it turned out that my testcases are actually being triggered by a change that I made to fix (in my branch) a much more common issue. I'm going to back that out and generate a new test case. And, of course, the easy test cases are the ones that involve power failures during writing, and I haven't implemented that bit. I'll get to it tomorrow. |
I have made a branch in my repo: https://github.com/pjsg/littlefs/tree/test-revamp-ctz-fuzz which improves the testbd and rambd drivers and adds a test_n1.toml (which was automatically generated). The improvements add more support for power fails at bad times. It also enhances the rambd driver to be able to use an mmap'ed file as the backing buffer. This obviates the need for using the filebd. The test_n1.toml is incredibly simple and it fails. This failure can be fixed by adding one line to lfs.c (but this causes its own, more subtle, problems):
It turns out that I had made this "fix" to my test branch and this "fix" was causing the other errors that I had reported. I made the fix as otherwise the fuzzer got stuck with finding ever more complex ways to trigger this fault. |
Looking into this bug, I'm not sure this situation should be allowed?
So if I understand correctly, during a power failure the block device wrote 0b10010101 instead of 0b00010000. Is that expected or an arithmetic mistake? At each commit, littlefs looks at the first bit of the next tag to try to figure out if the block has already been erased. littlefs always inverts the first bit of every tag (with some exceptions), so if it hasn't been touched, littlefs assumes it can write to the block without an erase. If has been touched, littlefs checks the commit's CRC, either accepted the tag as valid or marking the directory as "needing an erase". The temporary fix you suggested will force the directory to erase the block every commit which can get very expensive. Let me know if this doesn't work with an expected failure case. Also sorry I haven't been able to look into the AFL work you've done yet, it looks very promising. At the moment I'm thinking I will prioritize bringing in #372 as is, and hopefully fix these bugs on the next release. |
Your interpretation of the message is correct -- I'm modeling a program operation as follows:
My diagnosis was that after the power fail, there is a bad commit CRC -- but littlefs does not handle it correctly. As I see it, the only option is to erase the block and start again. You can't program over the bad commit, and you can't skip over the bad commit. I think that the code currently tries to program over the bad commit and this is why it gets the I'll generate another case where it doesn't partially program a byte... |
I just pushed another version of my https://github.com/pjsg/littlefs/tree/test-revamp-ctz-fuzz branch with a tests/test_corrupt.toml in it. This just gets a corrupted disk without any power fail interrupting anything -- there is a remount in the middle, but it is between calls. I'm running fuzzing on my system to see if I can trigger the same failure without partial byte writes. No luck so far.... |
I'm still not sure I understand how the bit pattern 0b10010101 could be created. Are the bits being written in a random order?
There is a 3rd case, which is erased storage. Because it's erased (and littlefs doesn't know the erase value), we can't rely on a CRC. Instead littlefs uses the first bit of every commit to indicate if an attempt to write the commit has been made. Is there a case where this design flawed? |
Actually, I think that the problem is worse than I thought. According to https://community.cypress.com/docs/DOC-10507 if power is lost during programming or erase, then the resulting locations can have any value (and worse, I think, may return different values on different reads). I.e. my model of writing bytes in order is wrong. It actually appears that all the bytes are written in parallel (up to the page programming size). I'm using W25Q Winbond flash chips and they seem to be like the Cypress chips from the application note. The note on their datasheet reads: which makes it sound as though writing a single byte on a page which then gets interrupted could corrupt the entire page. This is horrible. |
Hmm, interesting. It seems like the only correct solution for avoiding an erase-per-commit is to read and check the erase-value of the underlying storage. Thinking about this problem, it's possible to solve this without erase-value knowledge by storing a CRC of the erase-state off the page:
This is possible to add to littlefs (the commit tag is extendable), though it will need a bit of work. |
You could just store the erase_value in the commit -- unless there are chips that don't have uniform erase_values. What happens today when the readback check after write detects an error in programming (which might be because a value was written into a not-erased cell)? |
littlefs will force an erase (by garbage collecting all metadata into the next erase block). Though as you noted this isn't perfect as a malformed program may return different values on sequential read calls.
A very specific case littlefs enables is an encrypted block device. Under decryption, the erase value may (should?) be nonuniform ( |
Welp, I know it's been a while, but I have been able to make progress on this. I have a branch here with the above forward-looking CRC proposal: One change is it only looks-forward 1 prog_size and stores the current prog_size along with the CRC, so it should be the same cost as the current implementation. This change can also be brought in in a backwards compatible manner. It is working as a proof-of-concept, though failing some tests at the moment (NOSPC tests are failling?). Will work on these and then bring this in. |
Thank you for this -- I'm looking forward to the results! |
Hacker news thread brought this to my attention. I think it's an interesting situation. It is nice to see a potential solution to this. Hope this gets somewhere. |
We're using LFS 1.7.1 in our system and recently had an issue on a single device where an LFS_ASSERT was triggered when lfs_file_read was called. We're wondering if this is a flaw in 1.7.1 or possibly related to something we might be doing incorrectly in our use of the lfs API?
The LFS_ASSERT occurred in lfs_ctz_find in lfs.c line 1144.
The specific assertion says: LFS_ASSERT(head >= 2 && head <= lfs->cfg->block_count);
This is the relevant code:
The git hash we are using:
SHA-1: d3a2cf4
Our configuration is:
The text was updated successfully, but these errors were encountered: