Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable/disable functionality for monitoring components #118

Closed
wants to merge 12 commits into from

Conversation

graeme-a-stewart
Copy link
Member

@graeme-a-stewart graeme-a-stewart commented Jun 3, 2020

Closes #107

This PR builds off #115, so it's WIP until that is accepted. However, it's essentially ready to go in its own right.

A new switch, -m/--monitor, is added that allows monitoring components to be enabled or disabled. The switch can be given multiple times and allows for a comma separated list of monitors to be given. If the name of the monitor is given the component is enabled. If the name is prefixed with ~ then it is disabled. The special name all sets the default status (useful when given as ~all that switches all components off by default).

The wallmon component cannot be disabled as it's needed by the main monitor loop.

switch effect
-m ~netmon disable network monitoring
-m ~netmon,~iomon disable network and io monitoring
-m ~netmon -m ~iomon disable network and io monitoring
-m ~all,memmon,cpumon only enable memory and cpu monitoring
-m ~all -m memmon,cpumon only enable memory and cpu monitoring

As part of this PR pidutils.cpp is renamed to prmonutils.cpp as it now contains generally utilities for the main prmon code (cf. utils.cpp which is for the monitors).

@graeme-a-stewart graeme-a-stewart added enhancement New feature or request work in progress Further development needed before this pull request can be accepted labels Jun 3, 2020
@graeme-a-stewart graeme-a-stewart requested a review from amete June 3, 2020 19:03
@graeme-a-stewart graeme-a-stewart changed the title Enable disable Enable disable functionality for monitoring components Jun 3, 2020
@graeme-a-stewart graeme-a-stewart changed the title Enable disable functionality for monitoring components Enable/disable functionality for monitoring components Jun 3, 2020
@graeme-a-stewart
Copy link
Member Author

Note to self - README.md needs updated!

@graeme-a-stewart
Copy link
Member Author

Just rebased this PR to go cleanly on top of master

@graeme-a-stewart graeme-a-stewart removed the work in progress Further development needed before this pull request can be accepted label Jun 4, 2020
@graeme-a-stewart
Copy link
Member Author

I think this guy is ready to go now. After that we can finally make a proper PR for the NVIDIA GPU monitoring. The yak is shaved.

In our case the class name can always be used
Change indenting and lower case method names
New command line option "-m" than enables on/off (~mon = off)
Can be given multiple times or comma separated
Remove debug printouts from option parsing
The "+MON" option was never implemented and doesn't
really seem to be necessary (avoid having two ways
to do things, it's just confusing)
Probably the user got something wrong, so we should print this warning
(but we do not abort the job)
Copy link
Collaborator

@amete amete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this @graeme-a-stewart. I love the idea but I'm a bit conflicted w/ the set of rules we have. Please see below.

Comment on lines +287 to +292
<< "[--monitor, -m mon] Enable or disable monitors:\n"
<< " ~NAME disables that monitor\n"
<< " NAME enables that monitor\n"
<< " comma separation supported or specifing\n"
<< " multiple times\n"
<< " Special name '[~]all' sets default state\n"
Copy link
Collaborator

@amete amete Jun 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the idea but I'm a bit conflicted w/ the set of rules we have.

For example, if as a user I'd like to only monitor cpumon, my initial reaction would be to execute prmon -m cpumon -- [command]. However, doing so I end up getting everything (w/o any message) because I should've used a rather awkward prmon -m ~all,cpumon -- [command].

Can we do something like this instead:

  • If no -m is provided collect everything (default as is now)
  • Don't allow mixing ~ and non-~
    • If the default is all then one can turn a subset off things w/ a simple -m ~a,~b
    • If the user provides -m a,b then only enable those monitors

In all cases, we better print exactly what's going to be monitored. Currently we only print a message saying wallmon cannot be turned-off if the user tries to do so.

Then there are edge cases, which can be blamed on the user, but -m ~all,all ends up collecting only wallmon but -m all,~all ends up collecting all (i.e. the options are not commutative). This can be eliminated if we don't allow mixing but at the very least perhaps one can print a warning in such cases. If a simple -m ~all is provided (one can argue that's silly but) we can simply bail out printing a message that the user asked for monitoring nothing.

Also another issue is that, w/ this approach we're altering the orders of the columns in the text file. Now this is not really a problem on our side per se (for example the plotting script uses column names). However, ADC is making plots using these files and if they have hardcoded column numbers we should make sure they're aware. Technically it's someone else's problem but would be nice to give a heads up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your proposal for the syntax - I was probably thinking too generally and then it does become a bit confusing (especially with all being the implicit default).

The issue with ~all,all != all,~all is just impossible to solve satisfactorily, but is avoided if we disallow mixing on/off arguments.

Re. the columns, we should check that they use column names as we can't really guaranttee any particular ordering.

@graeme-a-stewart graeme-a-stewart added the work in progress Further development needed before this pull request can be accepted label Jun 5, 2020
@graeme-a-stewart
Copy link
Member Author

I actually just thought of something that wouldn't fit in this scheme. If we were to develop any components where we did not want to have them enabled by default (it could happen, if something was expensive to monitor or required a special setup) then that wouldn't fit into this on/off scheme.

@graeme-a-stewart
Copy link
Member Author

Just thinking about this again, maybe we could add two switches, enable and monitor.

  • If the enable option is given then it means enable exclusively those monitoring components (this is the behaviour you requested)

  • If the monitor option is given it takes the syntax above, which is more complex, but also more flexible

  • If both options are given it's an option error

Maybe I am overthinking it (!), but I do now think that a better state for device level monitoring components would be disabled by default, because they only give sensible values on a machine used exclusively by the monitored job. e.g., I would consider adding device level GPU stats like i/o, but I would disable that by default.

@amete
Copy link
Collaborator

amete commented Jun 12, 2020

To be honest this is one of those things where I'd prefer a pragmatic approach. So far we've been monitoring everything and this hasn't been a problem really. From your preliminary results, the overhead that comes from the new GPU monitoring is practically negligible, too. Therefore, my inclination is to enable everything by default (device-level or otherwise - in the current implementation we don't treat these differently anyhow). I'd keep a single argument and allow the user to either specify what should be exclusively monitored (-m a,b,c meaning only a,b,c gets monitored) or what shouldn't (-m ~a,~b,~c meaning everything but a,b,c). I think this would cover the overwhelming majority of the use cases in practice. In most platforms we don't even have GPUs installed at this point, so although enabled by default GPU monitoring will not add anything extra.

@graeme-a-stewart graeme-a-stewart added this to the v2.2 milestone Sep 8, 2020
@graeme-a-stewart
Copy link
Member Author

I went back to this again, in the light of the discussion with Vince and this is now implemented in a much simpler way in #178

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request work in progress Further development needed before this pull request can be accepted
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Options to enable/disable monitoring components
2 participants