Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lfs assert from lfs_ctz_find (head >= 2 && head <= lfs->cfg->block_count) w/ 1.7.1 #245

Open
asteriskSF opened this issue Jul 25, 2019 · 22 comments
Labels
needs investigation no idea what is wrong

Comments

@asteriskSF
Copy link

asteriskSF commented Jul 25, 2019

We're using LFS 1.7.1 in our system and recently had an issue on a single device where an LFS_ASSERT was triggered when lfs_file_read was called. We're wondering if this is a flaw in 1.7.1 or possibly related to something we might be doing incorrectly in our use of the lfs API?

The LFS_ASSERT occurred in lfs_ctz_find in lfs.c line 1144.

The specific assertion says: LFS_ASSERT(head >= 2 && head <= lfs->cfg->block_count);

This is the relevant code:

static int lfs_ctz_find(lfs_t *lfs,
        lfs_cache_t *rcache, const lfs_cache_t *pcache,
        lfs_block_t head, lfs_size_t size,
        lfs_size_t pos, lfs_block_t *block, lfs_off_t *off) {
    if (size == 0) {
        *block = 0xffffffff;
        *off = 0;
        return 0;
    }

    lfs_off_t current = lfs_ctz_index(lfs, &(lfs_off_t){size-1});
    lfs_off_t target = lfs_ctz_index(lfs, &pos);

    while (current > target) {
        lfs_size_t skip = lfs_min(
                lfs_npw2(current-target+1) - 1,
                lfs_ctz(current));

        int err = lfs_cache_read(lfs, rcache, pcache, head, 4*skip, &head, 4);
        head = lfs_fromle32(head);
        if (err) {
            return err;
        }

        LFS_ASSERT(head >= 2 && head <= lfs->cfg->block_count);
        current -= 1 << skip;
    }

    *block = head;
    *off = pos;
    return 0;
}

The git hash we are using:

SHA-1: d3a2cf4

Our configuration is:

// Statically allocated read buffer. Must be read sized.
static uint8_t read_buffer[256];

// Statically allocated program buffer. Must be program sized.
static uint8_t prog_buffer[256];

// Statically allocated lookahead buffer. Must be 1 bit per lookahead block.
static uint8_t lookahead_buffer[128];

// configuration of the filesystem is provided by this struct
static struct lfs_config m_lfsCfg = {
    // block device operations
    .read  = lfs_read,
    .prog  = lfs_prog,
    .erase = lfs_erase,
    .sync  = lfs_sync,

    // block device configuration
#define FLASH_SIZE_BYTES ((128*1024*1024)/8)
#define SERIAL_FLASH_BLOCK_ERASE_BYTES (32768)
#define SERIAL_FLASH_READ_SIZE (256)
#define SERIAL_FLASH_WRITE_SIZE (256)
    .read_size = SERIAL_FLASH_READ_SIZE,
    .prog_size = SERIAL_FLASH_WRITE_SIZE,
    .block_size = SERIAL_FLASH_BLOCK_ERASE_BYTES,
    .block_count = FLASH_SIZE_BYTES/SERIAL_FLASH_BLOCK_ERASE_BYTES,
    .lookahead = 128, // Should be a multiple of 32.

    // Statically allocated read buffer. Must be read sized.
    .read_buffer = read_buffer,

    // Statically allocated program buffer. Must be program sized.
    .prog_buffer = prog_buffer,

    // Statically allocated lookahead buffer. Must be 1 bit per lookahead block.
    .lookahead_buffer = lookahead_buffer,

    //Statically allocated buffer for files. Must be program sized.
    // If enabled, only one file may be opened at a time.
    //.file_buffer = file_buffer,
}
@asteriskSF
Copy link
Author

Could this issue be caused by the lookahead being 1024 (128 * 8) bits in length while the flash is only 512 blocks in size?

@geky geky added the needs investigation no idea what is wrong label Jul 26, 2019
@geky
Copy link
Member

geky commented Jul 26, 2019

Thanks for creating and issue, I'll have to look into this when I have the chance.

Do you know if this issue is present in v2?

Could this issue be caused by the lookahead being 1024 (128 * 8) bits in length while the flash is only 512 blocks in size?

This should be fine, however I have just noticed there is no longer a test case covering this situation.

@biovoid
Copy link

biovoid commented Jul 29, 2019

Hi,
I am not a post author, but I use v2 and had the same issue with that assert. I can't reproduce it because I format filesystem. And I see that in my code the lookahead parameter was 1024 instead of 128 (block_count / 8).

@asteriskSF
Copy link
Author

I don't have a specific reproducible sequence that causes the failure yet, though I suspect its related to completely filling the entire file system.. I've just seen in the error reports coming when trying to read the files afterwards.

@biovoid
Copy link

biovoid commented Jul 30, 2019

In my case the system was working some time without any problems, but after some time it stucks with this error, so I had to format filesystem. I thought that it was wrong firmware code, and only after I found this topic with same symptom. If it really depends on lookahead parameter then it should be checked inside initialization method.

@lsilvaalmeida
Copy link

Hello! I have the same issue, but I can reproduce it. The scenario is:

  • Open a file, write some data and close.
  • Unmount file system;
  • Mount FS;
  • Write into the middle of the same file (open, seek and write);
  • Try to close, the assert is triggered when littlefs tries to flush clash (and read back).

Do you have an idea for the source of this issue? I'm using v2.

@pjsg
Copy link

pjsg commented Feb 15, 2020

I've been fuzzing littlefs and I have a fairly simple test case. The sequence of operations is:

  open(1, "5file5.xxxxxxxxxxxx", 0x503) -> 0
  write(1, , 2007)[^ 1499 us] -> 2007
  write(1, , 2007)[^ 1411 us] -> 2007
  write(1, , 2007)[^ 1390 us] -> 2007
  write(1, , 2007)[^ 1401 us] -> 2007
  close(1) -> 0
  open(1, "1file1.xxxx", 0x503) -> 0
  mount
  open(0, "5file5.xxxxxxxxxxxx", 0x3) -> 0
  open(1, "5file5.xxxxxxxxxxxx", 0x503) -> 0
  close(1) -> 0
  open(1, "1file1.xxxx", 0x2) -> 0
  write(0, , 63) -> 63
a.out: lfs.c:2169: lfs_ctz_find: Assertion `head >= 2 && head <= lfs->cfg->block_count' failed.
  close(0)Aborted

The mount call simulates a power loss by just remounting the flash (in this case emubd) without unmounting.

@geky
Copy link
Member

geky commented Feb 24, 2020

I did find one issue with the CTZ skip-list while working on #372, fix no 4. in here 517d341, these lines.

However it only occurs if the filesystem hits ENOSPC, so I'm not sure it's related.

Thanks @pjsg for the great test case! I will need to reproduce this next chance I get.

@geky
Copy link
Member

geky commented Mar 26, 2020

Hi @pjsg, I'm running into some issues reproducing this test failure. Are we sure this bug hasn't been fixed as a part of #372?

I've tried to plug it into the test framework here, but can't get it to fail. Let me know if you see anything obvious I'm doing wrong:
https://github.com/ARMmbed/littlefs/blob/test-revamp-ctz-fuzz/tests/test_new.toml

As more of a curiosity question, do you know if AFL could be configured to generate test cases in that sort of format? (or converted via post-script?)

@pjsg
Copy link

pjsg commented Mar 27, 2020

I have a hacked up version that can generate (in some cases) toml files. However, it turned out that my testcases are actually being triggered by a change that I made to fix (in my branch) a much more common issue. I'm going to back that out and generate a new test case.

And, of course, the easy test cases are the ones that involve power failures during writing, and I haven't implemented that bit. I'll get to it tomorrow.

@pjsg
Copy link

pjsg commented Mar 27, 2020

I have made a branch in my repo: https://github.com/pjsg/littlefs/tree/test-revamp-ctz-fuzz which improves the testbd and rambd drivers and adds a test_n1.toml (which was automatically generated). The improvements add more support for power fails at bad times. It also enhances the rambd driver to be able to use an mmap'ed file as the backing buffer. This obviates the need for using the filebd.

The test_n1.toml is incredibly simple and it fails.

This failure can be fixed by adding one line to lfs.c (but this causes its own, more subtle, problems):

            // next commit not yet programmed or we're not in valid range
            if (!lfs_tag_isvalid(tag)) {
                dir->erased = (lfs_tag_type1(ptag) == LFS_TYPE_CRC &&
                        dir->off % lfs->cfg->prog_size == 0);
                dir->erased = false;    // this line fixes the test_n1 crash, but it isn't right
                break;
            } else if (off + lfs_tag_dsize(tag) > lfs->cfg->block_size) {

It turns out that I had made this "fix" to my test branch and this "fix" was causing the other errors that I had reported. I made the fix as otherwise the fuzzer got stuck with finding ever more complex ways to trigger this fault.

@geky
Copy link
Member

geky commented Mar 29, 2020

Looking into this bug, I'm not sure this situation should be allowed?

Powerfail during write of 48 bytes at offset 0x70 in block 1. Wrote 0 bytes.                                           
Byte at offset 0x70 should have been 0x10, but wrote 0x95 instead.                                                     
                                                                                                                       
Trying to program 0x70 into location with value 0x95 (ought to be 0xff) [at offset 0x70 (in block 1) in a length of 16]

So if I understand correctly, during a power failure the block device wrote 0b10010101 instead of 0b00010000. Is that expected or an arithmetic mistake?


At each commit, littlefs looks at the first bit of the next tag to try to figure out if the block has already been erased. littlefs always inverts the first bit of every tag (with some exceptions), so if it hasn't been touched, littlefs assumes it can write to the block without an erase.

If has been touched, littlefs checks the commit's CRC, either accepted the tag as valid or marking the directory as "needing an erase". The temporary fix you suggested will force the directory to erase the block every commit which can get very expensive.

Let me know if this doesn't work with an expected failure case.


Also sorry I haven't been able to look into the AFL work you've done yet, it looks very promising. At the moment I'm thinking I will prioritize bringing in #372 as is, and hopefully fix these bugs on the next release.

@pjsg
Copy link

pjsg commented Mar 29, 2020

Your interpretation of the message is correct -- I'm modeling a program operation as follows:

  • Some number of bytes (starting at the beginning) get programmed correctly.
  • The next byte is partially programmed (I assume that the actual programming operation is to clear bits in the byte to get to the correct value) -- i.e. only some of the bits are programmed correctly.
  • The rest of the bytes are left untouched (presumably at the erase value)

My diagnosis was that after the power fail, there is a bad commit CRC -- but littlefs does not handle it correctly. As I see it, the only option is to erase the block and start again. You can't program over the bad commit, and you can't skip over the bad commit. I think that the code currently tries to program over the bad commit and this is why it gets the Trying to program message.

I'll generate another case where it doesn't partially program a byte...

@pjsg
Copy link

pjsg commented Mar 29, 2020

I just pushed another version of my https://github.com/pjsg/littlefs/tree/test-revamp-ctz-fuzz branch with a tests/test_corrupt.toml in it. This just gets a corrupted disk without any power fail interrupting anything -- there is a remount in the middle, but it is between calls.

I'm running fuzzing on my system to see if I can trigger the same failure without partial byte writes. No luck so far....

@geky
Copy link
Member

geky commented Mar 30, 2020

I'm still not sure I understand how the bit pattern 0b10010101 could be created. Are the bits being written in a random order?

method:      0xff be -> 0x00 be -> 0xff le <- 0x00 le <-
erase value: 0b11111111 0b00000000 0b11111111 0b00000000
             0b01111111 0b00000000 0b11111110 0b00000000
             0b00111111 0b00000000 0b11111100 0b00000000
             0b00011111 0b00000000 0b11111000 0b00000000
             0b00011111 0b00000000 0b11110000 0b00000000
             0b00010111 0b00010000 0b11110000 0b00010000
             0b00010011 0b00010000 0b11010000 0b00010000
             0b00010001 0b00010000 0b10010000 0b00010000
result:      0b00010000 0b00010000 0b00010000 0b00010000

My diagnosis was that after the power fail, there is a bad commit CRC -- but littlefs does not handle it correctly. As I see it, the only option is to erase the block and start again. You can't program over the bad commit, and you can't skip over the bad commit.

There is a 3rd case, which is erased storage. Because it's erased (and littlefs doesn't know the erase value), we can't rely on a CRC. Instead littlefs uses the first bit of every commit to indicate if an attempt to write the commit has been made. Is there a case where this design flawed?

@pjsg
Copy link

pjsg commented Mar 30, 2020

Actually, I think that the problem is worse than I thought. According to https://community.cypress.com/docs/DOC-10507 if power is lost during programming or erase, then the resulting locations can have any value (and worse, I think, may return different values on different reads). I.e. my model of writing bytes in order is wrong. It actually appears that all the bytes are written in parallel (up to the page programming size).

I'm using W25Q Winbond flash chips and they seem to be like the Cypress chips from the application note. The note on their datasheet reads:

image

which makes it sound as though writing a single byte on a page which then gets interrupted could corrupt the entire page.

This is horrible.

@geky
Copy link
Member

geky commented Apr 6, 2020

Hmm, interesting. It seems like the only correct solution for avoiding an erase-per-commit is to read and check the erase-value of the underlying storage.

Thinking about this problem, it's possible to solve this without erase-value knowledge by storing a CRC of the erase-state off the page:

.-------------------. \
|  revision count   | |
|-------------------| |
|   metadata tags   | |
|                   | |
|                   | |
|                   | +---. commit
|                   | |   |
|-------------------| |   |
| CRC of erasestate---|-. |
|-------------------| | | |
|   CRC of commit  ---|-|-'
|-------------------| / |
|      padding      |   | padding (doesn't need CRC)
|                   |   |
|-------------------| \ |
|   erased storage  | | |
|        |          | | |
|        v          | | |
|                   | | |
|                   | | |
|                   | | |
|                   | | |
|                   | +-' erased
|                   | |
|                   | |
|                   | |
|                   | |
|                   | |
|                   | |
|                   | |
'-------------------' /

This is possible to add to littlefs (the commit tag is extendable), though it will need a bit of work.

@pjsg
Copy link

pjsg commented Apr 6, 2020

You could just store the erase_value in the commit -- unless there are chips that don't have uniform erase_values.

What happens today when the readback check after write detects an error in programming (which might be because a value was written into a not-erased cell)?

@geky
Copy link
Member

geky commented Apr 6, 2020

What happens today when the readback check after write detects an error in programming (which might be because a value was written into a not-erased cell)?

littlefs will force an erase (by garbage collecting all metadata into the next erase block). Though as you noted this isn't perfect as a malformed program may return different values on sequential read calls.

You could just store the erase_value in the commit -- unless there are chips that don't have uniform erase_values.

A very specific case littlefs enables is an encrypted block device. Under decryption, the erase value may (should?) be nonuniform (0xffffffffffffffff -> 0xab47cd083011bdef), but it can still be used to determine if the block device was erased.

@geky
Copy link
Member

geky commented Dec 7, 2020

Welp, I know it's been a while, but I have been able to make progress on this. I have a branch here with the above forward-looking CRC proposal:
https://github.com/littlefs-project/littlefs/commits/crc-rework-2

One change is it only looks-forward 1 prog_size and stores the current prog_size along with the CRC, so it should be the same cost as the current implementation. This change can also be brought in in a backwards compatible manner.

It is working as a proof-of-concept, though failing some tests at the moment (NOSPC tests are failling?). Will work on these and then bring this in.

@pjsg
Copy link

pjsg commented Dec 7, 2020

Thank you for this -- I'm looking forward to the results!

@Inviz
Copy link

Inviz commented Feb 19, 2022

Hacker news thread brought this to my attention. I think it's an interesting situation. It is nice to see a potential solution to this. Hope this gets somewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs investigation no idea what is wrong
Projects
None yet
Development

No branches or pull requests

5 participants