Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Per-directory file counts in the Filecount plugin #4752

Merged
merged 8 commits into from
Oct 5, 2018

Conversation

Jaeyo
Copy link
Contributor

@Jaeyo Jaeyo commented Sep 26, 2018

Required for all PRs:

  • Signed CLA.
  • Associated README.md updated.
  • Has appropriate unit tests.

close #4749

@glinton glinton added the feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin label Sep 26, 2018
@danielnelson danielnelson added this to the 1.9.0 milestone Sep 27, 2018
@Jaeyo
Copy link
Contributor Author

Jaeyo commented Sep 27, 2018

@glinton @danielnelson how about use directories than directory like file input?

@glinton
Copy link
Contributor

glinton commented Sep 27, 2018

Would directories take an array of glob strings? Similar to the exec input commands vs command?

@sometimesfood
Copy link
Contributor

Regarding the docs:

In this context, "only count the /var/log dir" is a bit misleading. By default, using /var/log counts all files in /var/log and all of its subdirectories.

In my opinion we should distinguish between counting files in a directory and producing separate stats (fields) for a directory. These are not the same.

@sometimesfood
Copy link
Contributor

sometimesfood commented Sep 27, 2018

Would directories take an array of glob strings? Similar to the exec input commands vs command?

Wouldn't it be more consistent to drop "directory" in favour of "directories" (array of globs) altogether? (I think @Jaeyo suggested the same thing in #4749, but I might have misunderstood his comment.)

If a glob expression is used, the filecount plugin has to deal with multiple base directories and report stats for all matching directories accordingly, therefore "directories" seems to be a better name than "directory" to me.

@sometimesfood
Copy link
Contributor

sometimesfood commented Sep 27, 2018

@Jaeyo Since you removed the countFn argument, you should also get rid of the countFunc type.

@sometimesfood
Copy link
Contributor

Also, wouldn't it be a good idea to add some updated unit tests to test the new behaviour?

I hope my comments are not coming across as too nitpicky. I actually quite like the approach taken in this PR, I just think some minor details could be improved a bit.


filtered := []string{}
for path, file := range g.Match() {
if file.IsDir() == true {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should read if file.IsDir() { instead of if file.IsDir() == true {.

@glinton
Copy link
Contributor

glinton commented Sep 27, 2018

Wouldn't it be more consistent to drop "directory" in favour of "directories" (array of globs) altogether?

I agree, however one major drawback with that is breaking compatibility with config files from older versions.

@Jaeyo
Copy link
Contributor Author

Jaeyo commented Sep 27, 2018

i think my comment lacked a little explanation.
my sugestion is,

  1. add "directories" option that is array of glob string.
  2. mark "directory" as deprecated.

and like @sometimesfood comment, doc also should be modified.

thanks for all comment.
@glinton

@sometimesfood
Copy link
Contributor

sometimesfood commented Sep 28, 2018

I agree, however one major drawback with that is breaking compatibility with config files from older versions.

Marking "directory" as deprecated in favour of "directories" sounds like a good approach in order to avoid breaking existing configurations.

@Jaeyo Jaeyo changed the title Per-directory file counts in the Filecount plugin [WIP] Per-directory file counts in the Filecount plugin Sep 28, 2018
@Jaeyo Jaeyo changed the title [WIP] Per-directory file counts in the Filecount plugin Per-directory file counts in the Filecount plugin Sep 29, 2018
@Jaeyo
Copy link
Contributor Author

Jaeyo commented Sep 29, 2018

directories added

@Samuel-BF
Copy link
Contributor

Reading the code, it seems that it reads the same directories several times (in case of recursive ask) :
if you have the hierarchy /var/log/{apache2/{access/,error/},nginx/{access/,error/},mysql/}, then when you'll ask for [/var/log/**] it will count elements in /var/log/apache2/access directory 3 times (once for /var/log/apache2/access, once for /var/log/apache2 and once for /var/log).

Imho, it would be more efficient to parse once all directories rather than to launch walkFn for each directory found. I may propose a fix, but not before this friday.

@Jaeyo
Copy link
Contributor Author

Jaeyo commented Oct 2, 2018

@Samuel-BF thanks for your suggestion. i would appreciate any fix at your convenience.

Samuel-BF pushed a commit to Samuel-BF/telegraf that referenced this pull request Oct 5, 2018
@Samuel-BF
Copy link
Contributor

Samuel-BF commented Oct 5, 2018

Well, I already recoded the plugin without filepath.Walk (using rather recursive functions) for the #4778 PR.

It seemed simpler to modify it to add your work (Directories option), that's what I made in 18373c7 .

Detail : in fact, your version parses n+1 times the directories, where n is the level of recursion of directories (once in globpath.Match and one time per recursion level).

I made one test comparing our two versions on my system (running it against source tree of telegraf) :

$ time ./telegraf --config test.conf --test # PR 4778
[... lots of outputs]
real    0m0,717s
user    0m0,370s
sys     0m0,426s
$ time ./telegraf --config test.conf --test # PR 4752
real    0m3,159s
user    0m1,849s
sys     0m1,725s

Of course, the gain is proportional to your folder depth.

@sometimesfood
Copy link
Contributor

@Samuel-BF: While I appreciate the better efficiency and the resulting speedup, I dislike the recursive_print_size option. Hardcoding this type of behaviour seems like a bad idea to me.

@Samuel-BF
Copy link
Contributor

Samuel-BF commented Oct 5, 2018

@Samuel-BF: While I appreciate the better efficiency and the resulting speedup, I dislike the recursive_print_size option. Hardcoding this type of behaviour seems like a bad idea to me.

Let me explain my use case.
I'm a sysadmin willing to monitor the use of a shared disk by employees (they all have a personal folder on it, no quota yet). I want to be able to identify folders which takes a lot of place. Without this option, i have to :

  • either output informations for all directories to my influxDB backend (which make a lot of unnecessary data)
  • either restrict information to a predefined set of folders (e.g. .../users/*)

An option like recursive_print_size allows me to have a configurable balance between these two extremas.

But I agree, it doesn't seem very elegant, both the term and the logic. Any proposal ? I'll check the processors or agreggators plugins to see if it allows me to drop measurements.

@sometimesfood
Copy link
Contributor

sometimesfood commented Oct 5, 2018

I think I understand your use case, but IMHO this should be addressed on a different abstraction level. Using processors or aggregators might be a solution to this type of problem. To be honest I haven't checked yet whether these are viable options, but I think adding this type of feature should be postponed until we have evaluated simpler options to do the same thing.

I added some comments to your pull request (#4778), sorry about splitting up the discussion.

@danielnelson danielnelson merged commit 030f944 into influxdata:master Oct 5, 2018
@danielnelson
Copy link
Contributor

I added a few comments to #4778 about the proposed options, but we can add other changes from there on top of this work.

rgitzel pushed a commit to rgitzel/telegraf that referenced this pull request Oct 17, 2018
otherpirate pushed a commit to otherpirate/telegraf that referenced this pull request Mar 15, 2019
otherpirate pushed a commit to otherpirate/telegraf that referenced this pull request Mar 15, 2019
dupondje pushed a commit to dupondje/telegraf that referenced this pull request Apr 22, 2019
athoune pushed a commit to bearstech/telegraf that referenced this pull request Apr 17, 2020
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Per-directory file counts in the Filecount plugin
5 participants