Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fs: get inodes via pure go instead of find #1576

Closed
wants to merge 5 commits into from

Conversation

euank
Copy link
Collaborator

@euank euank commented Jan 16, 2017

This also makes the count more accurate since hardlinks no longer
double-count.

fs/fs.go Outdated

err := filepath.Walk(dir, func(path string, info os.FileInfo, err error) error {
if err != nil {
return fmt.Errorf("unable to count inodes for part of dir %s: %s", dir, err)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This matches the previous behavior because find would exit non-zero on most types of errors you might run into here (notably permission denied), so the err code would be hit before too.

There might be some cases this wouldn't work though.

fs/fs.go Outdated

inodes := map[uint64]struct{}{}

err := filepath.Walk(dir, func(path string, info os.FileInfo, err error) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this traverse devices?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, I'll skip em.

fs/fs.go Outdated
if !ok {
return fmt.Errorf("unsupported fileinfo; could not convert to stat_t")
}
inodes[s.Ino] = struct{}{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that this increases accuracy, but what's the performance impact on a relatively full FS? Would be good to do a comparison to the find method too.

@euank
Copy link
Collaborator Author

euank commented Jan 19, 2017

Numbers are a good idea, so I gave it a quick go on a few sample directories.
I wrote a go program (test-get-inode-usage) that does nothing but call the package-level function in fs to compare with.

My methodology was to call each command 3 times to warm the cache, and then run it once more and record the output.

When run on cadvisor itself:

$ time test-get-inode-usage
2331                                                  
test-get-inode-usage  0.01s user 0.00s system 21% cpu 0.047 total
$ time sh -c 'find -xdev -printf '.' | wc -c'
2331
sh -c 'find -xdev -printf '.' | wc -c'  0.00s user 0.00s system 0% cpu 0.010 total

When run on a 100G directory with quite a few files, recursing, and all that jazz

$ time sh -c 'find -xdev -printf '.' | wc -c'
1461051

real	0m1.713s
user	0m0.560s
sys	0m1.140s


$ time test-get-inode-usage 
1459087

real	0m5.102s
user	0m2.480s
sys	0m2.810s

On /var/lib/docker with a modest set of containers lying around

$ time sh -c 'find -xdev -printf '.' | wc -c'
3764597

real	0m3.678s
user	0m1.030s
sys	0m2.620s


$ time test-get-inode-usage 
1147504

real	0m13.063s
user	0m6.500s
sys	0m6.990s


$ time sh -c 'find . -printf "%i\n" | sort | uniq | wc -l'
1147504

real	0m12.189s
user	0m10.180s
sys	0m2.760s

It's interesting to note that for /var/lib/docker, the new code is roughly 3x more correct than the old code!

I checked what a random set of files with the same inode looked like (via find -links +2 and find -samefile); an example of the sort of duplicates is:

./overlay/fc737a3304e188bc985d07d7e5896393e1c7c97c712d83317c8107cddc5ff58b/root/tmp/openssl-1.0.2e/doc/ssl/SSL_set_verify_result.pod
./overlay/9660ad5520bb82e857bcc326fb1bb9469f0cd0b10bf7dada0c3f81bb6af5c944/root/tmp/openssl-1.0.2e/doc/ssl/SSL_set_verify_result.pod
...

@timstclair
Copy link
Contributor

Accuracy increase is great, we should definitely push forward with this, but I'm a little concerned about the slowdown. If I'm reading the cAdvisor code correctly, it looks like we might be running a separate FSHandler per-container, all scanning the same rootfs. If we only run a single handler scanning the rootfs, this would easily overshadow any performance downsides of this change (although it's still worth benchmarking). Would you be willing to pursue this optimization?

@euank
Copy link
Collaborator Author

euank commented Jan 31, 2017

@timstclair Mind explaining what you mean more clearly?

My reading of the code is that the fshandlers are created per-container and apply to the read/write upper layer, so the directory is unique per container and can't be so easily deduped.

Maybe I'm missing something, and there certainly are optimizations that could be made by being smarter, but it seems like a change to make in the future if we're okay with the decrease in performance of this change.

@dashpole
Copy link
Collaborator

dashpole commented Feb 1, 2017

My reading of the code is that the fshandlers are created per-container and apply to the read/write upper layer, so the directory is unique per container and can't be so easily deduped.

I believe @euank is correct in that we have one FsHandler per container, and we are only counting inodes using find on the container's writeable layer.

Given that we only currently use this on the container's r/w upper layer, will we actually see much of an accuracy increase? It looks like the find command has issues counting inodes used by an overlay filesystem, but does a decent job with "regular" files and folders.

The latency increase definitely concerns me, as we believe we have some issues related to evictions using data that isnt recent enough. I am also curious how the latency would change if we were running more than one command at a time (which is the case with multiple containers).

If a couple extra days of delay on this is alright, I would like to collect some data on latency in use myself.

@euank
Copy link
Collaborator Author

euank commented Feb 1, 2017

@dashpole if you're willing to do the footwork to get better numbers, that'd be awesome!

@dashpole
Copy link
Collaborator

dashpole commented Feb 1, 2017

I mostly just want to get numbers in the context of our e2e eviction tests.

@euank
Copy link
Collaborator Author

euank commented Feb 1, 2017

@dashpole @timstclair it's also an option to keep the pure go but remove the map (and thus memory allocation) which will make this only slightly slower than the find method I think.

I'll also point out that on slower disks (e.g. EBS or spinning rust), both find and this will be roughtly equally slower. On my laptop's SSD the difference is more stark than in environments with worse disk io.

And finally, I think it's still worth doing hardlink deduplication if we can in container rootfs too. There aren't too many applications that use hardlinks at runtime, but they're out there, and it would be great to not evict applications that are doing the right thing in terms of saving inodes.

I have some quite hacky rsync scripts on my local kubernetes cluster that do use hardlinks and would probably benefit from this change.

@dashpole
Copy link
Collaborator

dashpole commented Feb 1, 2017

If there aren't any downsides to removing the map, and the change isn't too hard to make, then ill do my testing after that.

@euank
Copy link
Collaborator Author

euank commented Feb 1, 2017

@dashpole The downside is that hardlinks get double-counted, which is why I'd prefer to leave it if possible.

@dashpole
Copy link
Collaborator

dashpole commented Feb 1, 2017

Oh, I see. Ill just run it on this then

@timstclair
Copy link
Contributor

My reading of the code is that the fshandlers are created per-container and apply to the read/write upper layer, so the directory is unique per container and can't be so easily deduped.

You're right, I misread. Thanks for clearing it up!

How about only using the map if Nlink > 1?

@dashpole
Copy link
Collaborator

dashpole commented Feb 2, 2017

I ran the inode eviction test with and without this change.
With this change, the test flaked once, out of 4 runs, and each run had a call to GetDirInodeUsage that took > 10s.
Without this change, the test did not flake in 9 runs, and had NO calls to GetDirInodeUsage that took > 1s. The highest recorded was ~700ms.

Ill run some more testing on this change, to make sure that these stats are representative, but this version may pose too high of latency increases.

I spoke with @vishh, and he suggested that we could possibly optimize this in two ways:
parallelization of walking the tree with limits on the max number of threads
combination of this and the GetDirDiskUsage: if we are making the effort of walking the tree, we might as well collect both disk space and inode stats at the same time.

@dashpole
Copy link
Collaborator

dashpole commented Feb 3, 2017

I have done a couple more runs with the change, and I havent seen any that were as slow as the first. The highest seems to be around 2.5 seconds.
I tried it with the change timstclair suggested, and the results were about the same.

I realized something while running my tests: Some of the rarely seen extremely long times (> 10s) may be due to the limits on the number of concurrent exec calls, and they could be waiting to be able to run.
Combining the GetDirInodeUsage and GetDirDiskUsage functions may alleviate this, since we will have half the total number of calls.
This can be done with info.Size()

I think the latency increase is still too large ATM, and I would love to try some optimizations before merging anything.

@k8s-ci-robot
Copy link
Collaborator

@euank: The following test(s) failed:

Test name Commit Details Rerun command
Jenkins GCE e2e 6a36b4c link @k8s-bot test this

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@euank
Copy link
Collaborator Author

euank commented Mar 10, 2017

I did some more fiddling with optimization here. Tim's suggestion to reduce the map's usage was a good one, and that did help some (though as @dashpole points out, it's still slower).

At this point, the slowness largely comes from Go's implementation of Walk in the stdlib, not from the code in my walkFn. Exchanging the walkFn for an empty body results in hardly any difference.

I tried parallelizing via powerwalk (https://github.com/euank/cadvisor/tree/find-inodes-2) and tweaked the parallelism a bit, but that was slower for all the parallelism values I tried (10, 20, 100, 200).

I didn't really expect it to since sequential access of files (the walk access -> fstat from the fn being sequential) on a (fairly unfragmented) filesystem is usually faster.

I'm happy to merge inode and dir usage and into one walk operation. I'm not sure if that'll end up quicker than the current C implementations combined though.

@timstclair
Copy link
Contributor

Take a look at golang/go#16399, in particular the solution for goimports: https://go-review.googlesource.com/c/25001/11/imports/fastwalk.go. Unfortunately that's not exported, but we could just clone it if it's sufficiently better. Another possible improvement here: https://gist.github.com/YuriyNasretdinov/a68b6a997216103b13ea0baa4204230f

@euank
Copy link
Collaborator Author

euank commented Mar 14, 2017

@timstclair The primary issue mentioned in the linked go thread is that for find the caller doesn't need the result of fstat.

However, we do use the fstat result to get the inode, so it's not totally wasted for us. If we're going to merge this with the filesystem size code, then we'll actually need the full fstat.

The goimports fastwalk function doesn't need the inode, but getdirents does include it so most of the optimizations it does can apply here if we want to get the inode, but don't want to merge in the directory size code.

If we do plan to merge inode and size accounting, then we won't want any of the optimizations linked since all of them optimize away the ability to get filesize.

Which direction do you think I should head down?

Optimizing inode counting in isolation, or unifying size/inode counting code to amortize the time that way.

@dashpole
Copy link
Collaborator

@euank Ill try and get some numbers on "du" latency. Sounds like that may be slower than find, since it needs fstat. If we can collect both inodes and disk usage in comparable time to du, then this obviously would be a great thing to add.

@vishh
Copy link
Contributor

vishh commented Mar 14, 2017 via email

@euank
Copy link
Collaborator Author

euank commented Mar 28, 2019

Merged as part of #2171; this one can be closed

@euank euank closed this Mar 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants