-
Notifications
You must be signed in to change notification settings - Fork 808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about scalability and large files #75
Comments
Great analysis! Definitely several interesting observations here. I had to double check, but there is in fact a quadratic operation to writing files that takes over as the file grows. It's not an operation related to the file-structure, but actually the block allocator. Which, due to COW files, gets called on every block write, even if we're modifying a file in place. The block allocator by itself is O(n), but it's O(size of the filesystem, including the file we're writing to). So as our file grows, the cost to allocate a block grows linearly with it, resulting indirectly in a full cost of O(n^2) for a file write. However littlefs has a lot of tricks in place to try to keep this from becoming completely terrible! Ok, so I ended up with a lot of details below. Sorry about the wall of text. It's a relatively subtle problem, so hopefully it helps to know how everything works.
So, long story short, it's a big tradeoff of runtime for code size/complexity. One nice thing about this approach is that because the allocator is only stored in RAM, it's very easy to modify the allocator without worrying about compatibility. So it may be possible to introduce a more complex/RAM intensive/fast allocator in a version of littlefs designed for more powerful systems. So what can you do to improve the speed?
Unfortunately, these are only constant improvements, but that's what we get for dealing with a quadratic function. It'd be interesting to see if there is a way to fix this cost specifically for writing to a file. I wonder if we could store the bounds of all blocks in a file to keep the allocator from checking the file unless it needs to. Hmmm. May be worth investigating, though I'm not sure if then fragmentation becomes a concern.
I need to run, but there's still some interesting points in your second observation around different writes costs that you may start being able to piece together. I'll update with another comment about that in a bit. |
The comparison of successive writes at different sizes is interesting and something I haven't seen before. I'm still trying to piece together exactly what's going on. As you guessed the lookahead would explain a lot of the variations. Most notably the 128KB files are interesting as they alternate between files with a scan and files without (128KB = 256 blocks = 1/2 lookahead). As soon as that grows to 256KB files, the variation is gone as each file needs to do a scan. The very small files only need to scan the lookahead once for the entire lifetime of the mount, so that would explain the initial cost, whereas the larger files need to scan multiple times for every file, so the initial scan (when the filesystem is empty) is actually relatively cheap. Although I have no idea about the second spike for the 2MB file. (It's far too small to wrap around on your 32GB SD card). I'm wondering if it's actually caused by the FTL logic inside the SD card. Unlike the raw writes, the littlefs writes are moving accross the entire range of the SD card. It's possible that after a certain range of addresses the SD card slows down or needs to allocate/erase new blocks. I'm not entirely sure. I think I may try to build a model and see what the theoretical performance is for these operations and see how they compare, though it may be a bit before I can put this together. |
Thanks for your explanations! We'll see how well my application and littlefs will go along, but I hope that they will work together smoothly (; Even tough SD cards may not be a perfect fit for littlefs, I hope they are not a bad fit (; After all, these cards are huge, cheap and pretty fast. I think that once you need more than several hundred MB (and when production volume is rather small), the SD cards are the only thing that is left on the market with a reasonable price and good availability. As for ideas to amortize the cost of allocator, with a big chip and a RTOS you could dedicate a low-priority thread for some background work related to file system. Maybe it would be possible to expose a function, which the user could call repeatedly when there's nothing else to do, which could maybe perform some of the work which would otherwise be done when next write is executed? Something like a background defrag/file-system-scan? Is that possible and worthwhile? From my perspective I guess it could be nice, as it seems that when there's a big file written on the file system, using small files (I've implemented a counter which gets incremented all the time) gets slow too - once in a few dozen/hundred writes, there's a huge delay which I assume is the allocator scan - the same delay can be observed when actually mounting the file system. My use case is obviously not dumping several MB of raw data to files instantly, but generally I expect the application to also have huge and long files with logs and recorded measurements. These would be updated incrementally (for example adding several hundred bytes every minute) and I'm wondering whether in such scenario (appending "small" piece to the end of a "large" file) I should expect that the writes and seek would get slower with increasing file size. I guess I have to check that too (; |
Or here's another idea to mitigate the problem that once if a few hundred writes the allocation scan can take a really long time, which makes your "write a few bytes" operation take much more than expected (for example a minute instead of a fraction of a second). Maybe whenever one block from the allocator is used, the allocator would move on to find one new block? So instead of trying to find a few thousand blocks every few thousand writes, it would find just one with each write? Or would this basically be using a lookahead value of 1? Combining that with my idea for exposing a public function which could be run from a low priority thread or when there's nothing better to do - this function with each call would try to find one more free block. I'm not sure whether this ideas are possible (my understanding of file system internals is very low (; ), but I guess these would be a very good additions, as now the write time is far from being deterministic. I also see, that with time the performance decreases - my counter could be updated 20x per second when it was around 10000, after ~20000 writes it was updated only 3 times per second. After reset the performance is back to ~20x per second. |
Ah, you're right, SD/eMMC should be a priority for littlefs (though second to raw NOR/NAND flash). At least for now they're the main form of storage where the scalability becomes a real concern.
That's not a bad idea at all. The only downside I can think of is that right now littlefs is RTOS agnostic, so it would probably need to be synchronized externally. And also the progress only affects RAM, so it wouldn't be useful for speeding up mount-to-write times.
Yep, unfortunately the cost of the allocator is spread over all files, including the small ones.
Ah! This is actually a slight different operation, the deorphan step, which I have a resolution for and am currently working on.
Ah! actually you should find that the seek and append are very efficient : )
It's just the allocation that is costly... |
Yeah it'd be a lookahead of size 1. The main trick here is that scanning for 1 block is the same cost as scanning for any number of blocks, you just need the RAM to store them. If you had 8MB to spare (enough lookahead for 32GB at 512B blocks), you could scan once and file writes would be just the base O(1).
This is probably the best idea for an "lfs_fs_lookahead" function. That is, try to find as many blocks as we have RAM for if we haven't already. Though the weird thing here is that rescanning free blocks we have already scanned would be free anyways. |
Ok, actually that's a weird one. I suspect it's actually the SD card's FTL being problematic here. It probably ran out of erased blocks and needed to start allocating more (FTLs are actually as complex as filesystems). Though littlefs probably isn't helping by writing sequentially, which isn't comon for traditional file systems.
Also, you may be able to speed this up quite a bit by querying the card, I think they usually support around 25MHz. May be worth a try.
Do try this if you can! I suspect bumping up the erase size is the best answer for speed limitation for now. |
My general idea was for this function to be called only optionally. You don't call it - you get exactly what littlefs is doing right now. If you call it - some operations would get a speed-up. Unless that complicates the code too much? Moreover - the operation of this function should be "short", as it will require locking the file system. If it would do something very long while holding the lock, the situation is not that much improved anyway. That's why I'm talking just about "one call - one new block" instead of "... - whole new scan".
This is something I don't fully understand, unless you are saying that having such function will not reduce the time it takes for the mount. This is of course expected, however I think that these tasks could be offloaded to this function too (; Unless someone is really doing "mount-write-unmount" all the time, but in that case it would be a problem of the application <:
The important thing here is what I wrote in the second paragraph. This function should be kept as short (in time) as possible, otherwise it wouldn't really help that much. The general assumption would be that to really help, the user should call the function repeatedly over and over again - for example several hundred times. Or you could actually expose an argument in this function - either a bool (full-and-long-scan or as-short-as-possible-scan) or just a number (amount of blocks to "update" in this call or just anything which would allow selecting the length of operation), so that a RTOS concerned with locking would select for this function to be short and call it 100x per second, while a bare-metal application could choose to do a full-and-long-rescan once in a while.
If I would have 8 MB to spare life would be much nicer (; It's just a STM32F4 <:
If that would be the case, I suspect that resetting the device wouldn't change anything (resetting without any power-cycle to the SD card, though it would of course be reinitialized again). I'll investigate that a bit more, as this seems to be interesting too.
Yes, that would of course be a good solution, but the SPI driver is interrupt based which turned out to be a very bad idea for an RTOS. With high frequency and some other activity going on (other interrupts especially), it easily "chokes" and detects data overruns /; So either there's some bug in the driver, or using SPI that way in RTOS is just stupid. Most likely the latter, so I have to rework that to use DMA some day... Heh, so much to do, so little time (;
Currently my biggest concern is not the speed, but the variation of speed - that usually writing small file takes a fraction of a second, but sometimes it may take several dozen second, which may make the user of the device a bit nervous... |
Any chance we could have the idea presented above ( Actually I think that the whole-system-scan which is done after first write is a wrong idea and this has to be removed (or maybe made optional). Even with something like Or maybe this is actually related to the "deorphan ste" you mentioned earlier?
Please let me know whether you'd prefer to continue discussing this issue here or maybe you'd prefer to have it as a new issue (this one is pretty long and "multithreaded"). Thanks in advance! |
Sorry @FreddieChopin, currently my priority is to get the work around metadata logging, custom attributes, inline files done (#23, more info, dir-log branch), since this will have a big impact in the rest of the filesystem.
Ouch! Have you tried setting the Additionally, with the dir-log work, relatively small files (up to 4KB) can be stored inside the directory blocks. So in the near future large block sizes won't be that bad for small files.
Yeah, I suspect runing the lookahead in a background thread won't be that useful for embedded systems. There's still the first-time cost. The reason for the whole-system-scan is that we need to keep track of free blocks. By not storing free blocks in the filesystem we can decrease the code size of the implementation significantly. Though I'm open to different ideas if there's another solution that works well.
So, we actually only need to find each block address. The issue is that because we're using a linked-list, if our read_size is large we end up reading the nearby data. This is why increasing the block_size (but not the read_size) will reduce the scan cost. One long-term option may be to add support for "indirect ctz-lists" (thought up here), where each block is actually just an array of pointers to other blocks. This would make the most use out of large reads, though would require a bit of work to see if the data structure could reuse the same logic as the normal ctz-lists.
So the deorphan step is this big O(n^2) scan for orphans, where n is the number of metadata blocks. But it doesn't need to look at the file list at all, so large files don't matter, but it becomes an issue if you have a large number of files. It's actually already been removed in the dir-log branch : ) |
Would it be possible to make the issue of very-long-first-write-after-mount a second priority then? (;
I'll try that tomorrow, but I suspect that this will only decrease the time proportionally. We'll see...
Would that be "no more than 1 file, no more than ~4 kB, per 1 directory" or maybe "any number of files as long as combined size is less than ~4 kB, per 1 directory"?
I'm in no position to suggest alternatives, as my knowledge about file systems is pretty low. I'm just trying to understand what is going on and whether its a fundamental flaw which cannot be removed without redesigning the whole thing or maybe this can actually be fixed in the future. As with everything else, it always comes to the balance of different aspects (like complexity, speed, RAM usage, wear, features, ...). Don't get me wrong, as I really like the overall feature set of littlefs and power resilience is a very important feature. However - as it is now - the solution just doesn't scale well... People will use this project with SD cards (I suspect much more often than with raw memory chips) and will put a lot of data in the file system, which is when they will hit the same issue that I'm facing now. The first operation may take a bit longer than usually, but in my opinion anything above ~10 seconds (assuming that this first operation is not "dump 100 MB of data") is unacceptable and cannot be reliably used in a real time environment ); If you ask me, then the file system may use 2x as much flash and 2x as much RAM as long as the whole-system-scan could be omitted. And from what you wrote, the only option to actually omit that would be to store the info about free blocks in the storage. Whether you store this as a tree or a list or something in-between doesn't really matter, as long as this info is available right-away after mount. All attempts to just make this faster (instead of omitting it) are just postponing the moment the problem starts to be noticeable and starts causing trouble. Assuming the theoretical limit of 3 MB/s read speed it takes only ~60000 used blocks to reach 10 seconds (assuming the size of read block is 512 B). I'm not trying to complain just for the sake of complaining, just looking for the perfect reason that would convince you (; |
It is more a short-term solution than anything, though may work well enough for your use case.
Ah, 4KB per file and 1024 files per directory block, unlimited directory blocks in a directory.
I didn't think you were complaining, I owe you a sincere thanks for the constructive criticisms, it is very valuable : ) Just wanted to post the quick responses, I'll be able to respond more in a bit. (One note is from my perspective I've actually seen more consumers of small SPI flash chips than of SD cards. Though this doesn't change the fact that scalability is still a concern. Storage never seems to get smaller.) |
I was a wrong in saying this, the main issue isn't code size cost, but the fact that littlefs is built entirely around the idea that it doesn't need to track free blocks. Changing this will require a rewrite of most of the code and be a significant effort. Before shifting focus I want to make sure the existing work is in (#85).
One thing I just realized actually, the lookahead scan is only incurred when a block is allocated. With #85, if your files fit in inline files, you can avoid this cost. Again this is just a work-around if anything.
Excellent observation, I believe you're right about this. So idea! I've been thinking about this issue and think I have a good solution:
Anyways just a thought. There are a lot of details not fleshed out, and we'll need a prototype before we can really see how it behaves. I'm not going to lie, it's probably going to take at minimum a month or two (or three? I'm not good a predicting time) before this can get in. I want to make sure this is the right design decision, because right now the design is in a relatively safe position. By not having a free-list, we have the opportunity to add one in a backwards compatible manner. But once we add one we're stuck with it unless we want to break disk compatibility : ) |
If I understood you correctly (I'm not 100% certain about a few things), you would like the list where each node is exactly "1 block". This has a potential problem, that a format operation of an X GB storage would actually need to write X GB of data so it would be pretty slow. Maybe it would be possible to store the list of nodes with variable size? So initially you would have a free list with 1 element which would cover the whole storage. If you want to allocate a block (or a few) you just cut it from the beginning of this huge element. After deallocation you could (but don't actually have to) merge adjacent free blocks into one. This would be basically a 1:1 copy of how
I wouldn't go that way personally. This way you introduce some nondeterministic behaviour and need to have 2 code paths running simultaneously. With slow storage the size of fs where things get slow is not that significant anyway. With good design the stored-free-list can be robust (it will be faster, no matter how small the storage or fs is). With only one possible allocator the code will be smaller & simpler, you'd have only 1 combination to test and a few tweakable parameters less. If the list can contain variable sized entries the initial size of list is extremely small. Additionally, the stored-free-list would improve wear levelling too. Currently after each mount the writes start from the free block with lowest address. On huge storage (or with unwise application which unmounts after each write) this means that the blocks near the end get no writes, while the blocks at the beginning get almost all of them (excluding the case where files are only appended, never modified in place). With a FIFO list the writes would be levelled in all cases. And this is something I don't fully understand too:
What do you mean by this? Each existing file would have an address of the head of the list?
Such time-frame is perfectly fine, so let me thank you in advance for considering this change! I'll obviously can help with testing (my 32 GB card is a willing test subject [; ), maybe I can help with some decisions or ideas too - if needed.
I'm sure this direction is correct and this will make littlefs much more scalable! Implementation details will surely need some work and thought. We can for sure help you out with thinking and reviewing the ideas. I've read the description of the littlefs design over the weekend, so I should not write stupid things anymore (or so often [; ). BTW - about the cost to seek to the beginning of file. Wouldn't it be possible to just have the pointer to the head of the file in the metadata? I know you cannot go anywhere from the beginning anyway (because the files are stored in unidirectional "reversed" list), but if you want to store some useful metadata at the beginning then this could be beneficial. Or maybe just store this useful metadata in metadata blocks, not in the file's data blocks? |
I would suggest to consider implementation of some fixed size writhe-through cache on top of block device. Perhaps different cache algorithms/sizes for different use cases. |
First of all, this is not an "issue", just a question, as I'm not sure whether what I see is normal/expected or not.
I've ported littlefs to my C++ RTOS - https://github.com/DISTORTEC/distortos and I'm trying it out with an SDHC 32 GB card on STM32F429. My test more-or less goes like this:
These 3 steps are repeated for write sizes from following sequence: 512 B, 1 kB, 2 kB, ... 512 kB, 1 MB, 2 MB.
After some runs I noticed that difference between speed of raw sequential write to the SD card (with no file system) vs. speed of write with the file system is quite huge, but only for "large" files. The difference grows when the size of file grows - for example when writing just 1 kB, the ratio is just ~2.7, for 128 kB this is already ~7.2, while for 2 MB it's ~14.6 (these are all for average values).
Below is a chart which shows my measurements:
(x axis - write size, y axis - ratio between average fs write speed vs average raw write speed)
Here's another interesting observation - in my test I do 10 runs of raw write, format the card with littlefs and do 10 runs of file system write. While the speed of raw writes are mostly identical, the speed of file system writes vary:
Each write goes to the same file, which I truncate to zero on creation.
Here's the chart of these findings:
(x axis - number of write, y axis - ratio write speed of this write vs the speed of first write)
I guess it's also worth noting that I use 512 bytes for read, program and erase block size, and 512 for "lookahead" value.
To summarise:
Maybe this is all related to the value of lookahead? I did not test it yet, as the test takes quite some time (write of 2 MB file takes 60-100 seconds), but maybe this is related? 512 blocks (each 512 B long) of lookahead would correspond roughly to a file size around 256 kB, which is the size where the mentioned effects start to appear...
Thanks in advance for your insight!
--
Below I paste my test code for reference if this may be useful.
The text was updated successfully, but these errors were encountered: