Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loki (in Docker) reports "no space left on device" but there's plenty of space/inodes #1502

Open
srstsavage opened this issue Jan 11, 2020 · 58 comments
Labels
keepalive An issue or PR that will be kept alive and never marked as stale.

Comments

@srstsavage
Copy link
Contributor

Describe the bug

When running the Loki 1.2.0 Docker image, Loki is reporting that it can't write chunks to disk because there is "no space left on device", although there appears to be plenty of space.

level=error ts=2020-01-11T19:13:11.822567024Z caller=flush.go:178 org_id=fake msg="failed to flush user" err="open /tmp/loki/chunks/ZmFrZS84NDBiODY0MTMwOWFkOTZlOjE2Zjk1ZWNjNmU1OjE2Zjk1ZWNkM2JjOmRkMWUwMjUx: no space left on device"
level=error ts=2020-01-11T19:13:11.851323284Z caller=flush.go:178 org_id=fake msg="failed to flush user" err="open /tmp/loki/chunks/ZmFrZS82ZDNlZmFhODk1OWZiYjQxOjE2Zjk1ZTgzOTI4OjE2Zjk1ZmMyNzRiOjg3MTQ1OTkw: no space left on device"

Plenty of space and inodes available on disk where /tmp/loki volume lives:

$ df -h /dev/sda1
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       915G  223G  646G  26% /
$ df -i /dev/sda1
Filesystem       Inodes   IUsed    IFree IUse% Mounted on
/dev/sda1      60981248 5473071 55508177    9% /

/tmp/loki named volume mount from docker inspect

        "Mounts": [
            {
                "Type": "volume",
                "Name": "loki",
                "Source": "/var/lib/docker/volumes/loki/_data",
                "Destination": "/tmp/loki",
                "Driver": "local",
                "Mode": "rw",
                "RW": true,
                "Propagation": ""
            }
$ docker volume inspect loki
[
    {
        "CreatedAt": "2020-01-11T10:37:39-08:00",
        "Driver": "local",
        "Labels": null,
        "Mountpoint": "/var/lib/docker/volumes/loki/_data",
        "Name": "loki",
        "Options": null,
        "Scope": "local"
    }
]

Execing into the loki container and doing manual write tests to verify that files can be written

$ docker exec -it loki sh
/ # cd /tmp/loki
/tmp/loki # ls -l
total 596644
drwxr-xr-x    2 root     root     610926592 Jan 11 19:24 chunks
drwxr-xr-x    2 root     root          4096 Jan  9 00:01 index
/tmp/loki # cd chunks/
/tmp/loki/chunks # ls -l | wc -l
5286025
/tmp/loki/chunks # dd if=/dev/zero of=write_test count=1024 bs=1048576
1024+0 records in
1024+0 records out
/tmp/loki/chunks # ls -l write_test
-rw-r--r--    1 root     root     1073741824 Jan 11 19:27 write_test
/tmp/loki/chunks # rm write_test
/tmp/loki/chunks # dd if=/dev/urandom of=write_test count=1024 bs=1048576
1024+0 records in
1024+0 records out
/tmp/loki/chunks # ls -l write_test
-rw-r--r--    1 root     root     1073741824 Jan 11 19:28 write_test
/tmp/loki/chunks # rm write_test

I haven't been able to find any disk limitation in the Docker container, and the fact that I can still manually write files to the volume inside the container makes me suspect the bug is in the loki code, but I could definitely be wrong!

To Reproduce
Steps to reproduce the behavior:

  1. Run Loki (1.2.0, commit ccef3da) Docker image with Docker 18.09.6
  2. ???

Expected behavior
Loki continues to successfully write chunks to /tmp/loki while disk space and inodes are available.

Environment:

  • Infrastructure: Docker 18.09.6, Debian 9.9, kernel 4.9.0-9-amd64
  • Deployment tool: Ansible (using default Loki config file in Docker image at /etc/loki/local-config.yaml)
@slim-bean
Copy link
Collaborator

slim-bean commented Jan 14, 2020

sorry, nothing obvious sticks out to me here. The 5 million chunks is certainly a lot...

Loki is just using Go's ioutil.WriteFile so maybe this uses different syscalls than dd does which is why one works and the other doesn't?

https://github.com/cortexproject/cortex/blob/ff6fc0a47f6716fdd23188faa729f42c04d26565/pkg/chunk/local/fs_object_client.go#L58

You could maybe try asking in the Go slack or github?

Edited typos

@slim-bean
Copy link
Collaborator

Other things I can think of:

chunk names are pretty long, what happens if you try to create a file in that directory with a really long name (instead of write_test), it's hard for me to pin down details on this but there is a size associated with file names and this I think has a limit as well, so too many long file names might be causing this.

I have no idea how inodes work in relation to the host volume and from within docker, it seems like your dd test would indicated there are enough inodes but may be worth checking in and outside the container?

What filesystem are you using?

@slim-bean
Copy link
Collaborator

Looks like you can do an ls on the parent directory to see the size of it (which includes the size of the file names)

~/loki $ ls -alh
total 20M
drwxr-xr-x  4 pi   pi   4.0K Sep  9 19:59 .
drwxr-xr-x 11 pi   pi   4.0K Jan  6 22:35 ..
drwxr-xr-x  2 root root  20M Jan 13 20:22 chunks
drwxr-xr-x  2 root root 4.0K Jan  8 19:00 index

On one of my raspberry pi test loki's it has a couple hundred thousand chunks in the directory and that corresponds to a directory size of 20M

@srstsavage
Copy link
Contributor Author

Thanks for taking a look. Filesystem is ext4 mounted as a Docker volume.

/tmp/loki/chunks is definitely large: 606M directory, 5.5 million chunk files, 227 million file system blocks.

/tmp/loki # ls -lh
total 620440
drwxr-xr-x    2 root     root      605.9M Jan 14 06:25 chunks
drwxr-xr-x    2 root     root        4.0K Jan  9 00:01 index
/tmp/loki # find chunks/ | wc -l
5497294
/tmp/loki # ls -l chunks/ | head -n 1
total 226881340

@srstsavage
Copy link
Contributor Author

It looks like all of the "failed to flush user" errors refer to three files, none of which actually exist:

sudo docker logs loki 2>&1 | grep "failed to flush user" | grep -o "open [^:]*" | awk '{print $2}' | sort | uniq -c
     48 /tmp/loki/chunks/ZmFrZS83MzU5OTJkNzAzOWM0MDU5OjE2ZmEyYmJmZTIxOjE2ZmEyYmRkMzYzOmNjMjZlNTA1
     64 /tmp/loki/chunks/ZmFrZS83NWQ5NjliMjUzNDhlYjM1OjE2ZmEyYjVjMzIwOjE2ZmEyYjVkNjU0OjM5YzY3OWM=
     58 /tmp/loki/chunks/ZmFrZS9jOWNjNDUxMWZmZjUxODg2OjE2ZmEyYjkxZWRiOjE2ZmEyYjkxZWRjOjYzZTBhZGM1

These are different chunk names than I originally reported, so these might change over time. I also restarted the container since the original report.

I'll continue to monitor and report back if the invalid chunks change. Can also provide an strace of the error.

@srstsavage
Copy link
Contributor Author

srstsavage commented Jan 14, 2020

Files with those filenames in the above errors can't be created manually, but filenames with very similar names in the same directory can be created.

/tmp/loki # touch /tmp/loki/chunks/ZmFrZS83MzU5OTJkNzAzOWM0MDU5OjE2ZmEyYmJmZTIxOjE2ZmEyYmRkMzYzOmNjMjZlNTA1
touch: /tmp/loki/chunks/ZmFrZS83MzU5OTJkNzAzOWM0MDU5OjE2ZmEyYmJmZTIxOjE2ZmEyYmRkMzYzOmNjMjZlNTA1: No space left on device
/tmp/loki # touch /tmp/loki/chunks/ZmFrZS83MzU5OTJkNzAzOWM0MDU5OjE2ZmEyYmJmZTIxOjE2ZmEyYmRkMzYzOmNjMjZlNTA1-2
/tmp/loki # touch /tmp/loki/chunks/ZmFrZS83MzU5OTJkNzAzOWM0MDU5OjE2ZmEyYmJmZTIxOjE2ZmEyYmRkMzYzOmNjMjZlNTA2
/tmp/loki # ls -l /tmp/loki/chunks/ZmFrZS83MzU5OTJkNzAzOWM0MDU5OjE2ZmEyYmJmZTIxOjE2ZmEyYmRkMzYzOmNjMjZlNTA*
-rw-r--r--    1 root     root             0 Jan 14 07:05 /tmp/loki/chunks/ZmFrZS83MzU5OTJkNzAzOWM0MDU5OjE2ZmEyYmJmZTIxOjE2ZmEyYmRkMzYzOmNjMjZlNTA1-2
-rw-r--r--    1 root     root             0 Jan 14 07:05 /tmp/loki/chunks/ZmFrZS83MzU5OTJkNzAzOWM0MDU5OjE2ZmEyYmJmZTIxOjE2ZmEyYmRkMzYzOmNjMjZlNTA2
/tmp/loki # strace touch /tmp/loki/chunks/ZmFrZS83MzU5OTJkNzAzOWM0MDU5OjE2ZmEyYmJmZTIxOjE2ZmEyYmRkMzYzOmNjMjZlNTA1
execve("/bin/touch", ["touch", "/tmp/loki/chunks/ZmFrZS83MzU5OTJ"...], 0x7ffeb5249058 /* 7 vars */) = 0
arch_prctl(ARCH_SET_FS, 0x7f51a4152b68) = 0
set_tid_address(0x7f51a4152ba8)         = 139
mprotect(0x7f51a414f000, 4096, PROT_READ) = 0
mprotect(0x556cd17d7000, 16384, PROT_READ) = 0
getuid()                                = 0
utimensat(AT_FDCWD, "/tmp/loki/chunks/ZmFrZS83MzU5OTJkNzAzOWM0MDU5OjE2ZmEyYmJmZTIxOjE2ZmEyYmRkMzYzOmNjMjZlNTA1", NULL, 0) = -1 ENOENT (No such file or directory)
open("/tmp/loki/chunks/ZmFrZS83MzU5OTJkNzAzOWM0MDU5OjE2ZmEyYmJmZTIxOjE2ZmEyYmRkMzYzOmNjMjZlNTA1", O_RDWR|O_CREAT, 0666) = -1 ENOSPC (No space left on device)
write(2, "touch: /tmp/loki/chunks/ZmFrZS83"..., 122touch: /tmp/loki/chunks/ZmFrZS83MzU5OTJkNzAzOWM0MDU5OjE2ZmEyYmJmZTIxOjE2ZmEyYmRkMzYzOmNjMjZlNTA1: No space left on device
) = 122
exit_group(1)                           = ?
+++ exited with 1 +++

@srstsavage
Copy link
Contributor Author

Ok, found it. I was also unable to create that file inside the directory mounted to /tmp/loki from the host (outside of the container). Finally looked in dmesg and found lots of:

[624434.242593] EXT4-fs warning (device sda1): ext4_dx_add_entry:2236: inode #58458946: comm loki: Directory index full!

I'll investigate if using a filesystem other than ext4 would allow for more files in the chunks directory, but this might show a need for loki to organize chunk files into subdirectories.

@srstsavage
Copy link
Contributor Author

Disabled dir_index on the ext4 volume using

sudo tune2fs -O "^dir_index" /dev/sda1

That caused I/O errors and a read-only file system, and after rebooting dropped into an initramfs prompt with a corrupt volume. After running fsck.ext4 -y on the volume, the system booted successfully and files which couldn't be created before seem to be able to be created now. I'll let it run and see if there are any more errors.

@slim-bean
Copy link
Collaborator

This is really great work @shane-axiom ! Sorry you had a scare there on getting your data.

I found this blog which you may have seen, that talks about the problem being a hash collision, this would make sense why only some names fail to write but not others.

The disappointing information seems to be there isn't much that you can do aside from just disabling the dir_index as you did. Even using a different hash algorithm or maybe a longer hash ends up getting truncated.

It wasn't clear to me if the 64bit feature might change this, out of curiosity if you run tune2fs -l do you see 64bit in the output?

It wasn't obvious from me in any docs if this would have any affect on the b-tree file hashing, or if it's just related to the max files which can be stored.

I'm also curious if disabling dir_index will have a performance impact on Loki.

There are plans in the works to overhaul how local filesystem storage works, we want to find a way to combine chunks into larger files to reduce the file count and help improve performance. I'm afraid this work is quite a few months from being done though.

In the meantime, you could try the new flag #1406 which lets you cut bigger chunks (although this will use more memory) FYI this is not released yet you would need to use a master-xxxx image or latest or master tags.

Or increase the chunk_idle_period if you have some slowly writing log streams (again uses more memory but cuts less chunks).

Or reduce the number of labels or cardinality on labels to reduce number of streams.

@srstsavage
Copy link
Contributor Author

@slim-bean Yep, looks like 64bit is on. Here's the whole tunefs output:

$ sudo tune2fs -l /dev/sda1
tune2fs 1.43.4 (31-Jan-2017) 
Filesystem volume name:   <none>
Last mounted on:          /                       
Filesystem UUID:          acfb753d-0109-4646-88d4-90ae17ff5978
Filesystem magic number:  0xEF53                  
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isiz
e metadata_csum                                   
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean        
Errors behavior:          Continue      
Filesystem OS type:       Linux
Inode count:              60981248
Block count:              243924480
Reserved block count:     12196224
Free blocks:              180478308
Free inodes:              55291940
First block:              0                                   
Block size:               4096        
Fragment size:            4096  
Group descriptor size:    64        
Reserved GDT blocks:      1024  
Blocks per group:         32768     
Fragments per group:      32768                 
Inodes per group:         8192           
Inode blocks per group:   512
Flex block group size:    16    
Filesystem created:       Fri Jan  4 15:10:28 2019
Last mount time:          Mon Jan 13 23:43:57 2020            
Last write time:          Mon Jan 13 23:43:57 2020
Mount count:              1          
Maximum mount count:      -1                                                                                                                                   
Last checked:             Mon Jan 13 23:43:20 2020
Check interval:           0 (<none>)           
Lifetime writes:          980 GB        
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11   
Inode size:               256     
Required extra isize:     32       
Desired extra isize:      32      
Journal inode:            8        
Default directory hash:   half_md4
Directory Hash Seed:      8941b98b-4c9e-4b8d-a84b-ba72b466191d
Journal backup:           inode blocks
Checksum type:            crc32c
Checksum:                 0x4bdf5115

@stale
Copy link

stale bot commented Feb 13, 2020

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale A stale issue or PR that will automatically be closed. label Feb 13, 2020
@srstsavage
Copy link
Contributor Author

Thank you stale bot.

@slim-bean Disabling dir_index while using loki 1.2.0 seemed to solve this issue for us.

I'm also curious if disabling dir_index will have a performance impact on Loki.

We don't have any performance metrics running for loki other than loki-canary, but it seems fine after this change, and subjectively I haven't noticed any significant difference.

Just upgraded to loki 1.3.0 and haven't tried target-chunk-size or chunk_idle_period yet, so I can't comment there.

Before this gets closed, do you think it's worth distilling this dir_index config tweak into documentation somewhere?

@stale stale bot removed the stale A stale issue or PR that will automatically be closed. label Feb 13, 2020
@slim-bean
Copy link
Collaborator

Thanks for the update @shane-axiom yes this should definitely make it into the docs somewhere I haven't looked to see where.

If you have a chance to add something that would be awesome, else I will try to add something too, for now we'll just kick stale bot and see where we end up in another 30 days :)

@stale
Copy link

stale bot commented Mar 15, 2020

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale A stale issue or PR that will automatically be closed. label Mar 15, 2020
@stale stale bot closed this as completed Mar 22, 2020
lucamilanesio pushed a commit to GerritCodeReview/gerrit-monitoring that referenced this issue Aug 13, 2020
The chunks created by Loki were stored in a persistent volume. This
does not scale well, since volumes cannot easily be resized in
Kubernetes. Also, at least the ext4-filesystem had issues, when large
numbers of logs were saved. These issues are due to the dir_index as
discussed in [1].

An object store provides a more scalable and cheaper solution. Loki
supports S3 as an object storage and also other object stores that
understand the S3 API like Ceph or OpenStack Swift.

[1] grafana/loki#1502

Change-Id: Id55095c3b6659f40708712c1a494753dbcab7686
@jcgruenhage
Copy link

hey everybody: This is still happening, we just ran into this. Having it closed by the stale bot it a bit disappointing, can this be reopened?

As a quick fix, we'll probably just migrate to a more battle-tested chunk store backend (S3), but having issues like this in the fs backend is still annoying, as that's what most people probably start with.

@Doooooo0o
Copy link

Before this gets closed, do you think it's worth distilling this dir_index config tweak into documentation somewhere?

Even closed, this issue is still relevant. I had several issues with dir_index, using minio and local storage.

@weakcamel
Copy link

Same here; still very much an issue with latest Loki; it's a very fresh installation with pulling logs from maybe 20-30 hosts and already hitting >3M chunks in a single directory.

It's just not feasible with a local filesystem; is it not typical to hash files in subfolder structure?

@Doooooo0o
Copy link

@weakcamel it seems that the current posture on this issue is:
#3324 (comment)

I think there is also a lack of documentation about this, lets wait and see if it's fixed.

@weakcamel
Copy link

Thanks @theonlydoo!

Interesting to see that the discussion thread points to Cortex which does not seem to support filesystem storage at all: https://cortexmetrics.io/docs/chunks-storage/

Seems that at this point it's better to look at alternative storage options.

@klausenbusk
Copy link
Contributor

klausenbusk commented Mar 31, 2021

The disappointing information seems to be there isn't much that you can do aside from just disabling the dir_index as you did. Even using a different hash algorithm or maybe a longer hash ends up getting truncated.

You can enable the large_dir feature which is better than disabling dir_index (which speed up name lookups).

@duhang
Copy link

duhang commented Apr 19, 2021

The disappointing information seems to be there isn't much that you can do aside from just disabling the dir_index as you did. Even using a different hash algorithm or maybe a longer hash ends up getting truncated.

You can enable the large_dir feature which is better than disabling dir_index (which speed up name lookups).

# tune2fs -O large_dir /dev/nvme0n1
tune2fs 1.42.9 (28-Dec-2013)
Setting filesystem feature ‘large_dir’ not supported.

@klausenbusk
Copy link
Contributor

# tune2fs -O large_dir /dev/nvme0n1
tune2fs 1.42.9 (28-Dec-2013)
Setting filesystem feature ‘large_dir’ not supported.

You need to be running kernel 4.13 or newer.

@duhang
Copy link

duhang commented Apr 19, 2021 via email

@edernucci
Copy link

edernucci commented Jun 14, 2022

Enabling large_dir as you guys suggested solved the problem immediately.

Running on Azure Kubernetes Service on Ubuntu.

@tazhate
Copy link

tazhate commented Aug 18, 2022

Enabled large_dir at mdadm raid 0 based at nvme bare metal solved for me too.

@wjentner
Copy link

We had the same problem in a Rook/Ceph cluster with RBD images (ceph-block storage).
The `tune2fs -O large_dir /dev/rbdX did not work from inside a container, it returned:

tune2fs: No such file or directory while trying to open /dev/rbd0
Couldn't find valid filesystem superblock.

What did work was to execute the same command from the node where the volume was mounted.

@fopina
Copy link

fopina commented Jul 2, 2023

Another ❤️ for the tune2fs -O large_dir /dev/... and another +1 for the restructure of chunks in the fs 🙏

@hofarah
Copy link

hofarah commented Sep 21, 2023

large_dir

It worked for me too but it took some time to fix

@glyslxq
Copy link

glyslxq commented Sep 22, 2023

I have also been troubled by this issue. It can only query data for up to 1 hour, and it is particularly slow. Later on, the storage was switched to minio s3, and the performance is no longer an issue

@wschoot
Copy link

wschoot commented Dec 7, 2023

Setting large_dir on your ext4 filesystem also renders it unable to boot whenever the same drive is also your bootdevice it seems.

@earthgecko
Copy link

As @wschoot points out EXERCISE CAUTION before you run tune2fs -O large_dir /dev/... if the device uses grub to boot it will fail to boot on the next reboot. This is true for any system using < grub-2.12 (which was only released (Wed, 20 Dec 2023) and will take a while to rollout via the distros. Support for large_dir on ext4 was added to grub2 on Fri 03 Dec 2021, but with the last release before grub-2.12 being grub-2.06 which was released on Tue, 8 Jun 2021, support for large_dir on ext4 with grub has lagged.

Anything using < grub-2.12 will fail to boot and report an unknown filesystem error similar to:

error: ../../grub-core/kern/fs.c:120:unknown filesystem

And any grub rescue operations or ISO live / recovery using any grub2 utils, e.g. grub2-install, grub2-mkconfig will report the same error.

AND .... IT GETS WORSE once large_dir has been set on a device, it cannot be unset like many other features that can be added and removed with tune2fs, large_dir is a once only operation. This means that the volume can never be from booted again, even though it is perfectly fine. The partition can be mounted and all the data is there, it just cannot be booted from, due to this limitation with most running versions of grub.

@hwertz
Copy link

hwertz commented Feb 21, 2024

Oof that's harsh. Hopefully they'll get in that newer grub version!

Just to elaborate on what this all means (I have a cache where sometimes I get like 200,000 files in a single directory and it's rather slow, other times it's like 1500 larger files instead and it's a fair bit faster, and I can place a limit on number of files in the cache settings. So I decided to research what limits would likely result in differences in directory lookup time.)

This is all in the documentation technically (since the ext4 disk structure is documented) but I wanted to figure out reasonably concrete numbers.

I found the classic linear directory has an overhead of 8 bytes per entry plus the file name itself, 4KB block so for short like 10-12 character names you can hold several hundred entries. For long names like this program is using, the names are around 72 characters, 80 bytes per directory so you'll end up with about 50 files per 4KB block.

If the directory gets bigger than a single 4KB block, it's converted to using dir_index, it uses an htree. dir_index, the htree index holds block holds about 500 entries, each pointing to a block holding conventional directory entries, so 500*50 would be about 25000 entries. Now "birthday paradox" suggests you could hit the limits of a single layer htree at like 10,000-15,000 entries instead as some block fills before the rest.

If a "leaf" gets over 4KB again a second layer is added to the tree. With a 2-level htree, 50050050 is 12,500,000 directory entries. A 3rd layer was not allowed, so the hashes didn't hash out quite evenly, some directory block got it's 50 entries and filled up a little ahead of schedule and the directory failed with 5.5 million entries instead. The large_dir option allows a 3-level htree (125,000,000 potential leafs with about 50 files apiece), 6.25 billion files (or really since you might hit the limit a little below that due to uneven hashing, like 3, 4 billion.) I'll note the hash value used is truncated to 32 bits (around 4 billion entries) anyway so having a 4-level h-tree wouldn't really do much unless you were using extraordinarily long file names. Obviously I would seriously consider a redesign if you were going to hit billions of items in a single directory.

(For my purposes, my file names are more like 10-12 characters so I'd hit about 200 files before going from a classic directory to an htree and 100,000 limit on a 1-level h-tree, so proabaly keep file limit to 50,000 or so to avoid significant amounts of 2-level tree action.)

@frittentheke
Copy link
Contributor

It's just not feasible with a local filesystem; is it not typical to hash files in subfolder structure?

Yes. The issue is common among services like email servers or others storing large number of files.
The solution is usually to create a few layers (2-4) of hashed directories e.g.

/
  /a
    /a
    /b
    ...
  /b
    /a
    /b
    ...
  /c
    /a
    /b
    ...
...

and the store the files in there. This reduces the files per directory by orders of magnitude.

See here how this is done for Dovecot: https://doc.dovecot.org/configuration_manual/mail_location/#directory-hashing

@frittentheke
Copy link
Contributor

frittentheke commented Apr 4, 2024

^^ @slim-bean could you reopen this one maybe? I strongly believe Loki should "learn" to store its chunks / files in a way that does not require filesystems to scale to millions of files in a single directory.

@slim-bean
Copy link
Collaborator

sorry, I haven't checked on this issue in quite some time....

I will say it wasn't ever really our intention for folks to take the filesystem this far, Loki was really meant to be used with object storage...

But I know there are lots and lots of folks out there who are running with filesystems and I'd love for y'all to be happy and not having issues.

I wonder what we should rename this issue to.

@slim-bean slim-bean reopened this Apr 11, 2024
@stale stale bot removed stale A stale issue or PR that will automatically be closed. labels Apr 11, 2024
@slim-bean slim-bean added the keepalive An issue or PR that will be kept alive and never marked as stale. label Apr 11, 2024
@srstsavage
Copy link
Contributor Author

Maybe something like "Support large numbers of loki chunks on filesystems using prefix directory hierarchy"?

@frittentheke
Copy link
Contributor

I wonder what we should rename this issue to.

Maybe something like "Support large numbers of loki chunks on filesystems using prefix directory hierarchy"?

I believe the word "hashed" or "hashing" is used for this kind of directory hierarchy / structure:

If I may quote what our dear friend ChatGPT has to say:

Directory hashing is a technique used to store and organize data in a hashed directory structure. In this approach, data is distributed across a large number of directories using a hash function. The hash function takes the data's key as input and generates a unique hash value, which is used to determine the directory where the data will be stored.

The main advantage of directory hashing is that it can help distribute data evenly across directories, which can improve performance and scalability. It also allows for efficient retrieval of data, as the hash function can be used to quickly locate the directory where the data is stored.

Directory hashing is commonly used in database systems, distributed file systems, and other storage systems to efficiently organize and access large volumes of data. It is an important technique for managing data in a way that optimizes storage and retrieval operations.

@srstsavage
Copy link
Contributor Author

The chunk filenames in /tmp/loki/chunks seem to be base64 encodings of some other information so I don't think storing them in a prefix derived directory hierarchy would exactly qualify as directory hashing, but similar approach, yes.

@AlexanderSchuetz97
Copy link

AlexanderSchuetz97 commented Apr 15, 2024

Hello,

just to provide input for the discussion.

A proper solution for this would be appreciated. I intend to roll out loki to customers soon. Using an object store is not feasable for any of those customers due to the very sensitive nature of the data that these systems will be carrying they are usually air gapped.

Just to give you an idea, nobody will ever analize the logs on prem. I intend to just have the customer put the entire docker mount of loki onto (ideally) a tape drive for example and send it to me for analysis. I will then have my (once again air gapped) computer where I will have my grafana dashboards to analize everything.

While I dont doubt that tar can handle infinite amount of files in a directory I can already see a customer just trying to copy the directory onto an external storage disk with god forbid ntfs or any other random file systems before sending them to me. So yeah a solution that loki does not create all chunks in a single directory would indeed be appreciated.

Any of the methodologies described above would solve those problems. Literally just creating 4 levels of directories for example taking 2 letters of a hash of the chunk file name would probably do the job.

The main reason why I personally think that loki is awesome is precisely because of the file system storage and the fact that its a single binary that just works. (Plus the push rest api is easy to use, thats not relevant here)

I dont think you should lazor focus on just beeing yet another cloud logging tool. It can so easily be perfect for large on prem solutions too.

I would not have minded personally to use apache casandra instead of the fs but for some reason you guys deprecated that. All the other storage options are non options for air gapped systems especially if you have somewhat (justifiably) paranoid customers.

I hope you guys manage to find an acceptable solution to this problem.

Sincerely
Alexander Schütz

@frittentheke
Copy link
Contributor

frittentheke commented Apr 16, 2024

The chunk filenames in /tmp/loki/chunks seem to be base64 encodings of some other information so I don't think storing them in a prefix derived directory hierarchy would exactly qualify as directory hashing, but similar approach, yes.

Fair point.

  1. In the end hashing in this context is just about projecting a larger namespace to a finite number of buckets (e.g. by using substring ranges of the hash value) to which you then distribute data, possibly with multiple levels of hierarchy.
  2. Hashing is used to ensure the distribution of data es even, compared to e.g. using two character pairs of some bare hostnames or other potentially non-evenly distributed data which likely causes lost of chunks to end up in the same directory.
  3. Hashing is also nice because you can match the used identifiers, so "base64 encodings of some other information" in O(1) to determine / map which directory holds the chunks and don't require any overview files or additional metadata.

@jfontsaballs
Copy link

Is it known whether this issue applies when using the xfs file system?

@ThinkChaos
Copy link

ThinkChaos commented May 7, 2024

Hi, I looked at implementing hashed storage and it seems the core functionality is pretty trivial to write thanks to the existing architecture. Most of the work would be writing tests and the contribution flow.

With the intention of reducing that second part, could someone from Loki chime in on these points:

  1. What's the way you'd prefer this is implemented:

    • Add a v2 FS storage
    • Modify the existing one so new objects get written in hash based subdirs and reads try reading using that pattern, but fallback to the old one. It has a small performance penalty for pre-existing objects, or lookups of non-existing objects (if that's even possible), but that seems better to me than adding a new storage engine.
    • Something else?
  2. Do you want a LID for this?
    I'd love to avoid it since I expect writing a LID + the review would take more time than writing the code + tests.

My personal use case is setting up Loki with data storage on a NAS. The alternative I considered is setting up a S3 compatible server, but they all come with a lot of complexity and overhead for data distribution and integrity that is useless for me since my filesystem already provides those.

@AlexanderSchuetz97
Copy link

AlexanderSchuetz97 commented May 7, 2024

I think this should be just be a config option in the file system part of the yaml? I personally think mixing dir non dir chunks is not a good idea. Either bail out or migrate the data (as in move the files into directories or out of them) if the setting changes. Migration would have the charm of not requireing any modification of existing configs if the dir store mode was made the default...

I could however also verywell live with a seperate v2 filesystem storage. Id go with whatever is easiest to maintain long term.

For my intended use case this overhead of s3 impls is not needed. I intend to only store the last 30 days and even if all logs in loki would go poof because a drive failed then it would not be an issue. As you mentioned one can easily solve this by using an appropriate file system (if needed). As I dont have any such integrity requirements so good old ext4 would do the job for me.

@salacr
Copy link

salacr commented Jun 7, 2024

+1 for this feature, please!

@hwertz
Copy link

hwertz commented Jun 7, 2024

I'll just note, since I've used a filesystem that throws things into many blocks and then stores it locallly, here's some performance notes from my experience.

s3ql filesystem (a compressing and deduplicating file system) supports both remote (S3 etc.) and local storage. For local storage does this pretty simply -- there's an s3ql data directory, top level stores a copy of the database plus a few recent snapshots of it (just in case). (So you can unmount it from one location and mount it somewhere else if you wished). As well as the first 1000 (well 900) data blocks. They started numbering their blocks from 100 (0-99 are reserved, like I think block 0 is a block with entirely binary 0s and 1 might be a block of all 0xFF, so it doesn't have to compress/decompress or download anything if it was using cloud storage for some special cases.)

First blocks are stored at the top level ;when it gets to 999, for block 1000 it makes a directory "100" and puts file 1000 in there; if you get past 999999 blocks (directory 999 with file 999999 file in it), then 1000000 goes in directory 100, next diretory 000, file 10000000 in there.) This limits a directory to under 2000 entries (files 1000 files, and 1000 subdirectories).

If you're using hex hashes using 2 hex digits would get you 256 files and 256 subdiretories per level, or using 3 hex digits would get 4096+4096; or even 4 digits (65536+65536 -- 131072 entries per directory); depending on if it proves faster to have many files per directory and a much shallower tree compared to having a tree that could get quite deep when you've got many blocks in there (ike over 5-6 million if you're currently needing large_dir to begin with.)

I'll note (along with 2 smaller file systems on other disks) I currently have a 16TB ext4 filesystem with an s3ql store on it with 14.3TB data using only 8.7TB disk space. This has a bit over 6 million data blocks in it, and performance is great, I can run VirtualBox VMs and such out of it fine let alone using it for more normal storage and retrieval. So at least at that scale and maximum of1000 blocks per directory (some have very few if something was written then deleted, new block numbers increase monotonically...) the directory tree doesn't get deep enough to bog things down at least with ext4.

It does mean when blocks are deleted (like you create and delete many files, or delete large files, which I have definitely done on there..) you can end up with all these empty subdirectories, I got a very large number on mine and submitted a patch (which is in s3ql now) , if you run the fsck it looks for any extraneous files in the storage (if it's cloudy type storage it uses S3 or whatever queries to get the list of blocks, locally it walks the directory tree) so it now removes empty directories while it's walking the tree anyway. (You get extraneous blocks in pretty much the same conditions you would for ext4, you're in the middle of deleting files or writing new ones and the power goes out, you manage to crash your system, I was using it on a USB disk and the cable popped out, etc.)

I DID start having my filesystems bog down before I did this, the first time I removed empty directories they'd been building up for quite a while and I had something ridiculous like 50,000 or maybe it was even 80,000 empties I was running some compiles out of there so I'm sure tons of temp files and junk were created and deleted.

@michaelkebe
Copy link

michaelkebe commented Jun 25, 2024

We encountered this problem, too.

Mounted a second ext4 filesystem (because of #1502 (comment)) just for loki data dir with large_dir enabled.

@michaelkebe
Copy link

michaelkebe commented Jul 2, 2024

If you are using the boltdb-shipper store, switching to the "new" TSDB store (https://grafana.com/docs/loki/latest/operations/storage/tsdb/) will help a bit.

The new TSDB store will create two more levels of directories. E.g.

fake/f23e5e184e31a720/MTkwNzExYjBhYTA6MTkwNzFkZGE2ZjA6ODRjMTYyY2E=

this will at least delay the problem. Maybe fix the problem, depending on your chunk count in your setup.

If you want to switch be sure to read and understand this:
https://grafana.com/docs/loki/latest/operations/storage/schema/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keepalive An issue or PR that will be kept alive and never marked as stale.
Projects
None yet
Development

No branches or pull requests