lfs assert from lfs_ctz_find (head >= 2 && head <= lfs->cfg->block_count) w/ 1.7.1 #245

asteriskSF · 2019-07-25T19:37:01Z

We're using LFS 1.7.1 in our system and recently had an issue on a single device where an LFS_ASSERT was triggered when lfs_file_read was called. We're wondering if this is a flaw in 1.7.1 or possibly related to something we might be doing incorrectly in our use of the lfs API?

The LFS_ASSERT occurred in lfs_ctz_find in lfs.c line 1144.

The specific assertion says: LFS_ASSERT(head >= 2 && head <= lfs->cfg->block_count);

This is the relevant code:

static int lfs_ctz_find(lfs_t *lfs,
        lfs_cache_t *rcache, const lfs_cache_t *pcache,
        lfs_block_t head, lfs_size_t size,
        lfs_size_t pos, lfs_block_t *block, lfs_off_t *off) {
    if (size == 0) {
        *block = 0xffffffff;
        *off = 0;
        return 0;
    }

    lfs_off_t current = lfs_ctz_index(lfs, &(lfs_off_t){size-1});
    lfs_off_t target = lfs_ctz_index(lfs, &pos);

    while (current > target) {
        lfs_size_t skip = lfs_min(
                lfs_npw2(current-target+1) - 1,
                lfs_ctz(current));

        int err = lfs_cache_read(lfs, rcache, pcache, head, 4*skip, &head, 4);
        head = lfs_fromle32(head);
        if (err) {
            return err;
        }

        LFS_ASSERT(head >= 2 && head <= lfs->cfg->block_count);
        current -= 1 << skip;
    }

    *block = head;
    *off = pos;
    return 0;
}

The git hash we are using:

SHA-1: d3a2cf4

Our configuration is:

// Statically allocated read buffer. Must be read sized.
static uint8_t read_buffer[256];

// Statically allocated program buffer. Must be program sized.
static uint8_t prog_buffer[256];

// Statically allocated lookahead buffer. Must be 1 bit per lookahead block.
static uint8_t lookahead_buffer[128];

// configuration of the filesystem is provided by this struct
static struct lfs_config m_lfsCfg = {
    // block device operations
    .read  = lfs_read,
    .prog  = lfs_prog,
    .erase = lfs_erase,
    .sync  = lfs_sync,

    // block device configuration
#define FLASH_SIZE_BYTES ((128*1024*1024)/8)
#define SERIAL_FLASH_BLOCK_ERASE_BYTES (32768)
#define SERIAL_FLASH_READ_SIZE (256)
#define SERIAL_FLASH_WRITE_SIZE (256)
    .read_size = SERIAL_FLASH_READ_SIZE,
    .prog_size = SERIAL_FLASH_WRITE_SIZE,
    .block_size = SERIAL_FLASH_BLOCK_ERASE_BYTES,
    .block_count = FLASH_SIZE_BYTES/SERIAL_FLASH_BLOCK_ERASE_BYTES,
    .lookahead = 128, // Should be a multiple of 32.

    // Statically allocated read buffer. Must be read sized.
    .read_buffer = read_buffer,

    // Statically allocated program buffer. Must be program sized.
    .prog_buffer = prog_buffer,

    // Statically allocated lookahead buffer. Must be 1 bit per lookahead block.
    .lookahead_buffer = lookahead_buffer,

    //Statically allocated buffer for files. Must be program sized.
    // If enabled, only one file may be opened at a time.
    //.file_buffer = file_buffer,
}

The text was updated successfully, but these errors were encountered:

asteriskSF · 2019-07-25T19:40:15Z

Could this issue be caused by the lookahead being 1024 (128 * 8) bits in length while the flash is only 512 blocks in size?

geky · 2019-07-26T22:46:19Z

Thanks for creating and issue, I'll have to look into this when I have the chance.

Do you know if this issue is present in v2?

Could this issue be caused by the lookahead being 1024 (128 * 8) bits in length while the flash is only 512 blocks in size?

This should be fine, however I have just noticed there is no longer a test case covering this situation.

biovoid · 2019-07-29T12:49:51Z

Hi,
I am not a post author, but I use v2 and had the same issue with that assert. I can't reproduce it because I format filesystem. And I see that in my code the lookahead parameter was 1024 instead of 128 (block_count / 8).

asteriskSF · 2019-07-29T18:38:23Z

I don't have a specific reproducible sequence that causes the failure yet, though I suspect its related to completely filling the entire file system.. I've just seen in the error reports coming when trying to read the files afterwards.

biovoid · 2019-07-30T13:53:39Z

In my case the system was working some time without any problems, but after some time it stucks with this error, so I had to format filesystem. I thought that it was wrong firmware code, and only after I found this topic with same symptom. If it really depends on lookahead parameter then it should be checked inside initialization method.

lsilvaalmeida · 2019-08-06T12:49:47Z

Hello! I have the same issue, but I can reproduce it. The scenario is:

Open a file, write some data and close.
Unmount file system;
Mount FS;
Write into the middle of the same file (open, seek and write);
Try to close, the assert is triggered when littlefs tries to flush clash (and read back).

Do you have an idea for the source of this issue? I'm using v2.

pjsg · 2020-02-15T14:11:43Z

I've been fuzzing littlefs and I have a fairly simple test case. The sequence of operations is:

  open(1, "5file5.xxxxxxxxxxxx", 0x503) -> 0
  write(1, , 2007)[^ 1499 us] -> 2007
  write(1, , 2007)[^ 1411 us] -> 2007
  write(1, , 2007)[^ 1390 us] -> 2007
  write(1, , 2007)[^ 1401 us] -> 2007
  close(1) -> 0
  open(1, "1file1.xxxx", 0x503) -> 0
  mount
  open(0, "5file5.xxxxxxxxxxxx", 0x3) -> 0
  open(1, "5file5.xxxxxxxxxxxx", 0x503) -> 0
  close(1) -> 0
  open(1, "1file1.xxxx", 0x2) -> 0
  write(0, , 63) -> 63
a.out: lfs.c:2169: lfs_ctz_find: Assertion `head >= 2 && head <= lfs->cfg->block_count' failed.
  close(0)Aborted

The mount call simulates a power loss by just remounting the flash (in this case emubd) without unmounting.

geky · 2020-02-24T17:46:32Z

I did find one issue with the CTZ skip-list while working on #372, fix no 4. in here 517d341, these lines.

However it only occurs if the filesystem hits ENOSPC, so I'm not sure it's related.

Thanks @pjsg for the great test case! I will need to reproduce this next chance I get.

geky · 2020-03-26T20:31:05Z

Hi @pjsg, I'm running into some issues reproducing this test failure. Are we sure this bug hasn't been fixed as a part of #372?

I've tried to plug it into the test framework here, but can't get it to fail. Let me know if you see anything obvious I'm doing wrong:
https://github.com/ARMmbed/littlefs/blob/test-revamp-ctz-fuzz/tests/test_new.toml

As more of a curiosity question, do you know if AFL could be configured to generate test cases in that sort of format? (or converted via post-script?)

pjsg · 2020-03-27T03:18:53Z

I have a hacked up version that can generate (in some cases) toml files. However, it turned out that my testcases are actually being triggered by a change that I made to fix (in my branch) a much more common issue. I'm going to back that out and generate a new test case.

And, of course, the easy test cases are the ones that involve power failures during writing, and I haven't implemented that bit. I'll get to it tomorrow.

pjsg · 2020-03-27T14:31:00Z

I have made a branch in my repo: https://github.com/pjsg/littlefs/tree/test-revamp-ctz-fuzz which improves the testbd and rambd drivers and adds a test_n1.toml (which was automatically generated). The improvements add more support for power fails at bad times. It also enhances the rambd driver to be able to use an mmap'ed file as the backing buffer. This obviates the need for using the filebd.

The test_n1.toml is incredibly simple and it fails.

This failure can be fixed by adding one line to lfs.c (but this causes its own, more subtle, problems):

            // next commit not yet programmed or we're not in valid range
            if (!lfs_tag_isvalid(tag)) {
                dir->erased = (lfs_tag_type1(ptag) == LFS_TYPE_CRC &&
                        dir->off % lfs->cfg->prog_size == 0);
                dir->erased = false;    // this line fixes the test_n1 crash, but it isn't right
                break;
            } else if (off + lfs_tag_dsize(tag) > lfs->cfg->block_size) {

It turns out that I had made this "fix" to my test branch and this "fix" was causing the other errors that I had reported. I made the fix as otherwise the fuzzer got stuck with finding ever more complex ways to trigger this fault.

geky · 2020-03-29T04:17:46Z

Looking into this bug, I'm not sure this situation should be allowed?

Powerfail during write of 48 bytes at offset 0x70 in block 1. Wrote 0 bytes.                                           
Byte at offset 0x70 should have been 0x10, but wrote 0x95 instead.                                                     
                                                                                                                       
Trying to program 0x70 into location with value 0x95 (ought to be 0xff) [at offset 0x70 (in block 1) in a length of 16]

So if I understand correctly, during a power failure the block device wrote 0b10010101 instead of 0b00010000. Is that expected or an arithmetic mistake?

At each commit, littlefs looks at the first bit of the next tag to try to figure out if the block has already been erased. littlefs always inverts the first bit of every tag (with some exceptions), so if it hasn't been touched, littlefs assumes it can write to the block without an erase.

If has been touched, littlefs checks the commit's CRC, either accepted the tag as valid or marking the directory as "needing an erase". The temporary fix you suggested will force the directory to erase the block every commit which can get very expensive.

Let me know if this doesn't work with an expected failure case.

Also sorry I haven't been able to look into the AFL work you've done yet, it looks very promising. At the moment I'm thinking I will prioritize bringing in #372 as is, and hopefully fix these bugs on the next release.

pjsg · 2020-03-29T19:43:01Z

Your interpretation of the message is correct -- I'm modeling a program operation as follows:

Some number of bytes (starting at the beginning) get programmed correctly.
The next byte is partially programmed (I assume that the actual programming operation is to clear bits in the byte to get to the correct value) -- i.e. only some of the bits are programmed correctly.
The rest of the bytes are left untouched (presumably at the erase value)

My diagnosis was that after the power fail, there is a bad commit CRC -- but littlefs does not handle it correctly. As I see it, the only option is to erase the block and start again. You can't program over the bad commit, and you can't skip over the bad commit. I think that the code currently tries to program over the bad commit and this is why it gets the Trying to program message.

I'll generate another case where it doesn't partially program a byte...

pjsg · 2020-03-29T20:27:14Z

I just pushed another version of my https://github.com/pjsg/littlefs/tree/test-revamp-ctz-fuzz branch with a tests/test_corrupt.toml in it. This just gets a corrupted disk without any power fail interrupting anything -- there is a remount in the middle, but it is between calls.

I'm running fuzzing on my system to see if I can trigger the same failure without partial byte writes. No luck so far....

geky · 2020-03-30T02:12:54Z

I'm still not sure I understand how the bit pattern 0b10010101 could be created. Are the bits being written in a random order?

method:      0xff be -> 0x00 be -> 0xff le <- 0x00 le <-
erase value: 0b11111111 0b00000000 0b11111111 0b00000000
             0b01111111 0b00000000 0b11111110 0b00000000
             0b00111111 0b00000000 0b11111100 0b00000000
             0b00011111 0b00000000 0b11111000 0b00000000
             0b00011111 0b00000000 0b11110000 0b00000000
             0b00010111 0b00010000 0b11110000 0b00010000
             0b00010011 0b00010000 0b11010000 0b00010000
             0b00010001 0b00010000 0b10010000 0b00010000
result:      0b00010000 0b00010000 0b00010000 0b00010000

My diagnosis was that after the power fail, there is a bad commit CRC -- but littlefs does not handle it correctly. As I see it, the only option is to erase the block and start again. You can't program over the bad commit, and you can't skip over the bad commit.

There is a 3rd case, which is erased storage. Because it's erased (and littlefs doesn't know the erase value), we can't rely on a CRC. Instead littlefs uses the first bit of every commit to indicate if an attempt to write the commit has been made. Is there a case where this design flawed?

pjsg · 2020-03-30T14:56:29Z

Actually, I think that the problem is worse than I thought. According to https://community.cypress.com/docs/DOC-10507 if power is lost during programming or erase, then the resulting locations can have any value (and worse, I think, may return different values on different reads). I.e. my model of writing bytes in order is wrong. It actually appears that all the bytes are written in parallel (up to the page programming size).

I'm using W25Q Winbond flash chips and they seem to be like the Cypress chips from the application note. The note on their datasheet reads:

which makes it sound as though writing a single byte on a page which then gets interrupted could corrupt the entire page.

This is horrible.

geky · 2020-04-06T17:26:07Z

Hmm, interesting. It seems like the only correct solution for avoiding an erase-per-commit is to read and check the erase-value of the underlying storage.

Thinking about this problem, it's possible to solve this without erase-value knowledge by storing a CRC of the erase-state off the page:

.-------------------. \
|  revision count   | |
|-------------------| |
|   metadata tags   | |
|                   | |
|                   | |
|                   | +---. commit
|                   | |   |
|-------------------| |   |
| CRC of erasestate---|-. |
|-------------------| | | |
|   CRC of commit  ---|-|-'
|-------------------| / |
|      padding      |   | padding (doesn't need CRC)
|                   |   |
|-------------------| \ |
|   erased storage  | | |
|        |          | | |
|        v          | | |
|                   | | |
|                   | | |
|                   | | |
|                   | | |
|                   | +-' erased
|                   | |
|                   | |
|                   | |
|                   | |
|                   | |
|                   | |
|                   | |
'-------------------' /

This is possible to add to littlefs (the commit tag is extendable), though it will need a bit of work.

pjsg · 2020-04-06T17:51:46Z

You could just store the erase_value in the commit -- unless there are chips that don't have uniform erase_values.

What happens today when the readback check after write detects an error in programming (which might be because a value was written into a not-erased cell)?

geky · 2020-04-06T18:25:07Z

What happens today when the readback check after write detects an error in programming (which might be because a value was written into a not-erased cell)?

littlefs will force an erase (by garbage collecting all metadata into the next erase block). Though as you noted this isn't perfect as a malformed program may return different values on sequential read calls.

You could just store the erase_value in the commit -- unless there are chips that don't have uniform erase_values.

A very specific case littlefs enables is an encrypted block device. Under decryption, the erase value may (should?) be nonuniform (0xffffffffffffffff -> 0xab47cd083011bdef), but it can still be used to determine if the block device was erased.

geky · 2020-12-07T06:25:16Z

Welp, I know it's been a while, but I have been able to make progress on this. I have a branch here with the above forward-looking CRC proposal:
https://github.com/littlefs-project/littlefs/commits/crc-rework-2

One change is it only looks-forward 1 prog_size and stores the current prog_size along with the CRC, so it should be the same cost as the current implementation. This change can also be brought in in a backwards compatible manner.

It is working as a proof-of-concept, though failing some tests at the moment (NOSPC tests are failling?). Will work on these and then bring this in.

pjsg · 2020-12-07T13:27:28Z

Thank you for this -- I'm looking forward to the results!

Inviz · 2022-02-19T05:45:24Z

Hacker news thread brought this to my attention. I think it's an interesting situation. It is nice to see a potential solution to this. Hope this gets somewhere.

geky added the needs investigation no idea what is wrong label Jul 26, 2019

geky mentioned this issue Apr 6, 2020

Failure due to writing 0x70 into location with value 0x10 #379

Open

pjsg mentioned this issue Jun 7, 2020

Supporting a reliable flash filesystem nodemcu/nodemcu-firmware#3148

Closed

geky mentioned this issue Dec 7, 2020

Forward-looking erase-state CRCs #497

Merged

4 tasks

geky mentioned this issue Oct 20, 2023

Valid bit of CRC is '0'? #833

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lfs assert from lfs_ctz_find (head >= 2 && head <= lfs->cfg->block_count) w/ 1.7.1 #245

lfs assert from lfs_ctz_find (head >= 2 && head <= lfs->cfg->block_count) w/ 1.7.1 #245

asteriskSF commented Jul 25, 2019 •

edited by geky

Loading

asteriskSF commented Jul 25, 2019

geky commented Jul 26, 2019

biovoid commented Jul 29, 2019

asteriskSF commented Jul 29, 2019

biovoid commented Jul 30, 2019

lsilvaalmeida commented Aug 6, 2019

pjsg commented Feb 15, 2020

geky commented Feb 24, 2020

geky commented Mar 26, 2020

pjsg commented Mar 27, 2020 •

edited

Loading

pjsg commented Mar 27, 2020 •

edited

Loading

geky commented Mar 29, 2020 •

edited

Loading

pjsg commented Mar 29, 2020

pjsg commented Mar 29, 2020

geky commented Mar 30, 2020

pjsg commented Mar 30, 2020

geky commented Apr 6, 2020 •

edited

Loading

pjsg commented Apr 6, 2020

geky commented Apr 6, 2020

geky commented Dec 7, 2020 •

edited

Loading

pjsg commented Dec 7, 2020

Inviz commented Feb 19, 2022

lfs assert from lfs_ctz_find (head >= 2 && head <= lfs->cfg->block_count) w/ 1.7.1 #245

lfs assert from lfs_ctz_find (head >= 2 && head <= lfs->cfg->block_count) w/ 1.7.1 #245

Comments

asteriskSF commented Jul 25, 2019 • edited by geky Loading

asteriskSF commented Jul 25, 2019

geky commented Jul 26, 2019

biovoid commented Jul 29, 2019

asteriskSF commented Jul 29, 2019

biovoid commented Jul 30, 2019

lsilvaalmeida commented Aug 6, 2019

pjsg commented Feb 15, 2020

geky commented Feb 24, 2020

geky commented Mar 26, 2020

pjsg commented Mar 27, 2020 • edited Loading

pjsg commented Mar 27, 2020 • edited Loading

geky commented Mar 29, 2020 • edited Loading

pjsg commented Mar 29, 2020

pjsg commented Mar 29, 2020

geky commented Mar 30, 2020

pjsg commented Mar 30, 2020

geky commented Apr 6, 2020 • edited Loading

pjsg commented Apr 6, 2020

geky commented Apr 6, 2020

geky commented Dec 7, 2020 • edited Loading

pjsg commented Dec 7, 2020

Inviz commented Feb 19, 2022

asteriskSF commented Jul 25, 2019 •

edited by geky

Loading

pjsg commented Mar 27, 2020 •

edited

Loading

pjsg commented Mar 27, 2020 •

edited

Loading

geky commented Mar 29, 2020 •

edited

Loading

geky commented Apr 6, 2020 •

edited

Loading

geky commented Dec 7, 2020 •

edited

Loading