-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loki (in Docker) reports "no space left on device" but there's plenty of space/inodes #1502
Comments
sorry, nothing obvious sticks out to me here. The 5 million chunks is certainly a lot... Loki is just using Go's You could maybe try asking in the Go slack or github? Edited typos |
Other things I can think of: chunk names are pretty long, what happens if you try to create a file in that directory with a really long name (instead of write_test), it's hard for me to pin down details on this but there is a size associated with file names and this I think has a limit as well, so too many long file names might be causing this. I have no idea how inodes work in relation to the host volume and from within docker, it seems like your dd test would indicated there are enough inodes but may be worth checking in and outside the container? What filesystem are you using? |
Looks like you can do an
On one of my raspberry pi test loki's it has a couple hundred thousand chunks in the directory and that corresponds to a directory size of 20M |
Thanks for taking a look. Filesystem is ext4 mounted as a Docker volume.
|
It looks like all of the "failed to flush user" errors refer to three files, none of which actually exist:
These are different chunk names than I originally reported, so these might change over time. I also restarted the container since the original report. I'll continue to monitor and report back if the invalid chunks change. Can also provide an strace of the error. |
Files with those filenames in the above errors can't be created manually, but filenames with very similar names in the same directory can be created.
|
Ok, found it. I was also unable to create that file inside the directory mounted to
I'll investigate if using a filesystem other than ext4 would allow for more files in the |
Disabled
That caused I/O errors and a read-only file system, and after rebooting dropped into an initramfs prompt with a corrupt volume. After running |
This is really great work @shane-axiom ! Sorry you had a scare there on getting your data. I found this blog which you may have seen, that talks about the problem being a hash collision, this would make sense why only some names fail to write but not others. The disappointing information seems to be there isn't much that you can do aside from just disabling the It wasn't clear to me if the It wasn't obvious from me in any docs if this would have any affect on the b-tree file hashing, or if it's just related to the max files which can be stored. I'm also curious if disabling There are plans in the works to overhaul how local filesystem storage works, we want to find a way to combine chunks into larger files to reduce the file count and help improve performance. I'm afraid this work is quite a few months from being done though. In the meantime, you could try the new flag #1406 which lets you cut bigger chunks (although this will use more memory) FYI this is not released yet you would need to use a master-xxxx image or Or increase the Or reduce the number of labels or cardinality on labels to reduce number of streams. |
@slim-bean Yep, looks like
|
This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions. |
Thank you stale bot. @slim-bean Disabling
We don't have any performance metrics running for loki other than loki-canary, but it seems fine after this change, and subjectively I haven't noticed any significant difference. Just upgraded to loki 1.3.0 and haven't tried Before this gets closed, do you think it's worth distilling this |
Thanks for the update @shane-axiom yes this should definitely make it into the docs somewhere I haven't looked to see where. If you have a chance to add something that would be awesome, else I will try to add something too, for now we'll just kick stale bot and see where we end up in another 30 days :) |
This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions. |
The chunks created by Loki were stored in a persistent volume. This does not scale well, since volumes cannot easily be resized in Kubernetes. Also, at least the ext4-filesystem had issues, when large numbers of logs were saved. These issues are due to the dir_index as discussed in [1]. An object store provides a more scalable and cheaper solution. Loki supports S3 as an object storage and also other object stores that understand the S3 API like Ceph or OpenStack Swift. [1] grafana/loki#1502 Change-Id: Id55095c3b6659f40708712c1a494753dbcab7686
hey everybody: This is still happening, we just ran into this. Having it closed by the stale bot it a bit disappointing, can this be reopened? As a quick fix, we'll probably just migrate to a more battle-tested chunk store backend (S3), but having issues like this in the fs backend is still annoying, as that's what most people probably start with. |
Even closed, this issue is still relevant. I had several issues with |
Same here; still very much an issue with latest Loki; it's a very fresh installation with pulling logs from maybe 20-30 hosts and already hitting >3M chunks in a single directory. It's just not feasible with a local filesystem; is it not typical to hash files in subfolder structure? |
@weakcamel it seems that the current posture on this issue is: I think there is also a lack of documentation about this, lets wait and see if it's fixed. |
Thanks @theonlydoo! Interesting to see that the discussion thread points to Cortex which does not seem to support filesystem storage at all: https://cortexmetrics.io/docs/chunks-storage/ Seems that at this point it's better to look at alternative storage options. |
You can enable the |
|
You need to be running kernel 4.13 or newer. |
Thanks for the tip. I will try it out when I got a chance!
On Apr 19, 2021, at 2:59 PM, Kristian Klausen ***@***.***> wrote:
# tune2fs -O large_dir /dev/nvme0n1
tune2fs 1.42.9 (28-Dec-2013)
Setting filesystem feature ‘large_dir’ not supported.
You need to be running kernel 4.13 or newer<torvalds/linux@e08ac99>.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#1502 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABETAJ7KJK42A22GGQJAUS3TJSRUNANCNFSM4KFTTEOQ>.
|
Enabling Running on |
Enabled |
We had the same problem in a Rook/Ceph cluster with RBD images (ceph-block storage).
What did work was to execute the same command from the node where the volume was mounted. |
Another ❤️ for the |
It worked for me too but it took some time to fix |
I have also been troubled by this issue. It can only query data for up to 1 hour, and it is particularly slow. Later on, the storage was switched to minio s3, and the performance is no longer an issue |
Setting |
As @wschoot points out EXERCISE CAUTION before you run Anything using < grub-2.12 will fail to boot and report an unknown filesystem error similar to:
And any grub rescue operations or ISO live / recovery using any grub2 utils, e.g. grub2-install, grub2-mkconfig will report the same error. AND .... IT GETS WORSE once |
Oof that's harsh. Hopefully they'll get in that newer grub version! Just to elaborate on what this all means (I have a cache where sometimes I get like 200,000 files in a single directory and it's rather slow, other times it's like 1500 larger files instead and it's a fair bit faster, and I can place a limit on number of files in the cache settings. So I decided to research what limits would likely result in differences in directory lookup time.) This is all in the documentation technically (since the ext4 disk structure is documented) but I wanted to figure out reasonably concrete numbers. I found the classic linear directory has an overhead of 8 bytes per entry plus the file name itself, 4KB block so for short like 10-12 character names you can hold several hundred entries. For long names like this program is using, the names are around 72 characters, 80 bytes per directory so you'll end up with about 50 files per 4KB block. If the directory gets bigger than a single 4KB block, it's converted to using dir_index, it uses an htree. dir_index, the htree index holds block holds about 500 entries, each pointing to a block holding conventional directory entries, so 500*50 would be about 25000 entries. Now "birthday paradox" suggests you could hit the limits of a single layer htree at like 10,000-15,000 entries instead as some block fills before the rest. If a "leaf" gets over 4KB again a second layer is added to the tree. With a 2-level htree, 50050050 is 12,500,000 directory entries. A 3rd layer was not allowed, so the hashes didn't hash out quite evenly, some directory block got it's 50 entries and filled up a little ahead of schedule and the directory failed with 5.5 million entries instead. The large_dir option allows a 3-level htree (125,000,000 potential leafs with about 50 files apiece), 6.25 billion files (or really since you might hit the limit a little below that due to uneven hashing, like 3, 4 billion.) I'll note the hash value used is truncated to 32 bits (around 4 billion entries) anyway so having a 4-level h-tree wouldn't really do much unless you were using extraordinarily long file names. Obviously I would seriously consider a redesign if you were going to hit billions of items in a single directory. (For my purposes, my file names are more like 10-12 characters so I'd hit about 200 files before going from a classic directory to an htree and 100,000 limit on a 1-level h-tree, so proabaly keep file limit to 50,000 or so to avoid significant amounts of 2-level tree action.) |
Yes. The issue is common among services like email servers or others storing large number of files.
and the store the files in there. This reduces the files per directory by orders of magnitude. See here how this is done for Dovecot: https://doc.dovecot.org/configuration_manual/mail_location/#directory-hashing |
^^ @slim-bean could you reopen this one maybe? I strongly believe Loki should "learn" to store its chunks / files in a way that does not require filesystems to scale to millions of files in a single directory. |
sorry, I haven't checked on this issue in quite some time.... I will say it wasn't ever really our intention for folks to take the filesystem this far, Loki was really meant to be used with object storage... But I know there are lots and lots of folks out there who are running with filesystems and I'd love for y'all to be happy and not having issues. I wonder what we should rename this issue to. |
Maybe something like "Support large numbers of loki chunks on filesystems using prefix directory hierarchy"? |
I believe the word "hashed" or "hashing" is used for this kind of directory hierarchy / structure:
If I may quote what our dear friend ChatGPT has to say:
|
The chunk filenames in |
Hello, just to provide input for the discussion. A proper solution for this would be appreciated. I intend to roll out loki to customers soon. Using an object store is not feasable for any of those customers due to the very sensitive nature of the data that these systems will be carrying they are usually air gapped. Just to give you an idea, nobody will ever analize the logs on prem. I intend to just have the customer put the entire docker mount of loki onto (ideally) a tape drive for example and send it to me for analysis. I will then have my (once again air gapped) computer where I will have my grafana dashboards to analize everything. While I dont doubt that tar can handle infinite amount of files in a directory I can already see a customer just trying to copy the directory onto an external storage disk with god forbid ntfs or any other random file systems before sending them to me. So yeah a solution that loki does not create all chunks in a single directory would indeed be appreciated. Any of the methodologies described above would solve those problems. Literally just creating 4 levels of directories for example taking 2 letters of a hash of the chunk file name would probably do the job. The main reason why I personally think that loki is awesome is precisely because of the file system storage and the fact that its a single binary that just works. (Plus the push rest api is easy to use, thats not relevant here) I dont think you should lazor focus on just beeing yet another cloud logging tool. It can so easily be perfect for large on prem solutions too. I would not have minded personally to use apache casandra instead of the fs but for some reason you guys deprecated that. All the other storage options are non options for air gapped systems especially if you have somewhat (justifiably) paranoid customers. I hope you guys manage to find an acceptable solution to this problem. Sincerely |
Fair point.
|
Is it known whether this issue applies when using the |
Hi, I looked at implementing hashed storage and it seems the core functionality is pretty trivial to write thanks to the existing architecture. Most of the work would be writing tests and the contribution flow. With the intention of reducing that second part, could someone from Loki chime in on these points:
My personal use case is setting up Loki with data storage on a NAS. The alternative I considered is setting up a S3 compatible server, but they all come with a lot of complexity and overhead for data distribution and integrity that is useless for me since my filesystem already provides those. |
I think this should be just be a config option in the file system part of the yaml? I personally think mixing dir non dir chunks is not a good idea. Either bail out or migrate the data (as in move the files into directories or out of them) if the setting changes. Migration would have the charm of not requireing any modification of existing configs if the dir store mode was made the default... I could however also verywell live with a seperate v2 filesystem storage. Id go with whatever is easiest to maintain long term. For my intended use case this overhead of s3 impls is not needed. I intend to only store the last 30 days and even if all logs in loki would go poof because a drive failed then it would not be an issue. As you mentioned one can easily solve this by using an appropriate file system (if needed). As I dont have any such integrity requirements so good old ext4 would do the job for me. |
+1 for this feature, please! |
I'll just note, since I've used a filesystem that throws things into many blocks and then stores it locallly, here's some performance notes from my experience. s3ql filesystem (a compressing and deduplicating file system) supports both remote (S3 etc.) and local storage. For local storage does this pretty simply -- there's an s3ql data directory, top level stores a copy of the database plus a few recent snapshots of it (just in case). (So you can unmount it from one location and mount it somewhere else if you wished). As well as the first 1000 (well 900) data blocks. They started numbering their blocks from 100 (0-99 are reserved, like I think block 0 is a block with entirely binary 0s and 1 might be a block of all 0xFF, so it doesn't have to compress/decompress or download anything if it was using cloud storage for some special cases.) First blocks are stored at the top level ;when it gets to 999, for block 1000 it makes a directory "100" and puts file 1000 in there; if you get past 999999 blocks (directory 999 with file 999999 file in it), then 1000000 goes in directory 100, next diretory 000, file 10000000 in there.) This limits a directory to under 2000 entries (files 1000 files, and 1000 subdirectories). If you're using hex hashes using 2 hex digits would get you 256 files and 256 subdiretories per level, or using 3 hex digits would get 4096+4096; or even 4 digits (65536+65536 -- 131072 entries per directory); depending on if it proves faster to have many files per directory and a much shallower tree compared to having a tree that could get quite deep when you've got many blocks in there (ike over 5-6 million if you're currently needing large_dir to begin with.) I'll note (along with 2 smaller file systems on other disks) I currently have a 16TB ext4 filesystem with an s3ql store on it with 14.3TB data using only 8.7TB disk space. This has a bit over 6 million data blocks in it, and performance is great, I can run VirtualBox VMs and such out of it fine let alone using it for more normal storage and retrieval. So at least at that scale and maximum of1000 blocks per directory (some have very few if something was written then deleted, new block numbers increase monotonically...) the directory tree doesn't get deep enough to bog things down at least with ext4. It does mean when blocks are deleted (like you create and delete many files, or delete large files, which I have definitely done on there..) you can end up with all these empty subdirectories, I got a very large number on mine and submitted a patch (which is in s3ql now) , if you run the fsck it looks for any extraneous files in the storage (if it's cloudy type storage it uses S3 or whatever queries to get the list of blocks, locally it walks the directory tree) so it now removes empty directories while it's walking the tree anyway. (You get extraneous blocks in pretty much the same conditions you would for ext4, you're in the middle of deleting files or writing new ones and the power goes out, you manage to crash your system, I was using it on a USB disk and the cable popped out, etc.) I DID start having my filesystems bog down before I did this, the first time I removed empty directories they'd been building up for quite a while and I had something ridiculous like 50,000 or maybe it was even 80,000 empties I was running some compiles out of there so I'm sure tons of temp files and junk were created and deleted. |
We encountered this problem, too. Mounted a second ext4 filesystem (because of #1502 (comment)) just for loki data dir with |
If you are using the The new TSDB store will create two more levels of directories. E.g.
this will at least delay the problem. Maybe fix the problem, depending on your chunk count in your setup. If you want to switch be sure to read and understand this: |
Describe the bug
When running the Loki 1.2.0 Docker image, Loki is reporting that it can't write chunks to disk because there is "no space left on device", although there appears to be plenty of space.
Plenty of space and inodes available on disk where
/tmp/loki
volume lives:/tmp/loki
named volume mount fromdocker inspect
Execing into the loki container and doing manual write tests to verify that files can be written
I haven't been able to find any disk limitation in the Docker container, and the fact that I can still manually write files to the volume inside the container makes me suspect the bug is in the loki code, but I could definitely be wrong!
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Loki continues to successfully write chunks to
/tmp/loki
while disk space and inodes are available.Environment:
/etc/loki/local-config.yaml
)The text was updated successfully, but these errors were encountered: