Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to enable RMM logging #542

Merged
merged 15 commits into from
Mar 2, 2021

Conversation

charlesbluca
Copy link
Member

Continuation of #322; this PR adds:

  • Keyword argument rmm_log_directory to LocalCUDACluster and CUDAWorker to enable per-worker RMM logging to a specified directory
  • Option --rmm-log-directory to the dask-cuda-worker CLI to achieve the same as above
  • Option --rmm-log-directory to the benchmarks to enable worker and scheduler RMM logging
  • Tests to test_local_cuda_cluster.py and test_dask_cuda_worker.py to check that logging is happening

The log files use the following naming convention:

  • rmm_log_<N>.dev0.txt for workers spawned through the construction of a cluster
  • rmm_log_<IP>:<PORT>.dev0.txt for workers spawned using the CLI
  • rmm_log_scheduler.dev0.txt for schedulers

Some questions:

  • Should we leave dev0 in the log file names? It seems redundant, but I didn't really put much time into seeing if we could stop RMM from appending that to the file names.
  • Are the tests sufficient? Right now they are only checking that the memory resources are rmm.mr.LoggingResourceAdaptors when logging is enabled but a check that the files were actually generated could be nice.
  • Should we consider adding logging to the scheduler/client?

cc @jakirkham @pentschev

@charlesbluca charlesbluca added 3 - Ready for Review Ready for review by team feature request New feature or request non-breaking Non-breaking change labels Feb 26, 2021
@charlesbluca charlesbluca requested a review from a team as a code owner February 26, 2021 02:16
@github-actions github-actions bot added the python python code needed label Feb 26, 2021
@codecov-io
Copy link

codecov-io commented Feb 26, 2021

Codecov Report

Merging #542 (cbfafaf) into branch-0.19 (fd48b7d) will increase coverage by 30.67%.
The diff coverage is 100.00%.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-0.19     #542       +/-   ##
================================================
+ Coverage        61.77%   92.45%   +30.67%     
================================================
  Files               22       16        -6     
  Lines             2459     1603      -856     
================================================
- Hits              1519     1482       -37     
+ Misses             940      121      -819     
Impacted Files Coverage Δ
dask_cuda/cuda_worker.py 78.75% <ø> (ø)
dask_cuda/cli/dask_cuda_worker.py 96.96% <100.00%> (+1.58%) ⬆️
dask_cuda/local_cuda_cluster.py 81.39% <100.00%> (+0.21%) ⬆️
dask_cuda/utils.py 90.76% <100.00%> (+1.14%) ⬆️
dask_cuda/proxify_host_file.py 100.00% <0.00%> (ø)
dask_cuda/_version.py
dask_cuda/explicit_comms/dataframe/shuffle.py 97.94% <0.00%> (+0.16%) ⬆️
dask_cuda/proxy_object.py 91.36% <0.00%> (+1.27%) ⬆️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fd48b7d...cbfafaf. Read the comment docs.

Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good @charlesbluca , I've made a few requests and wrote a couple of questions, please take a look when you have a chance.

@@ -112,6 +112,15 @@
"WARNING: managed memory is currently incompatible with NVLink, "
"trying to enable both will result in an exception.",
)
@click.option(
"--rmm-log-directory",
default=None,
Copy link
Member

@quasiben quasiben Feb 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using a Pool and RMM is set to None, will there be an error or will RMM not log ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now if --disable-rmm-pool is explicitly passed, there will be no logging. Do you have a preference on what should happen in that case @quasiben ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this referring to the dask-cuda-worker CLI or the benchmark options? In both cases, if --rmm-log-directory is set to None, then logging will be set to False in their respective RMM setups and it will be disabled.

Like @pentschev said, if --disable-rmm-pool is passed in the benchmarks the worker/scheduler's calls to rmm.reintialize will be skipped altogether and logging will be disabled. This relates somewhat to the conversation we had above - is there any utility to enabling RMM logging even if we aren't using a memory pool or managed memory?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to pass --rmm-log-directory without a directory? If so, what happens?

Copy link
Member Author

@charlesbluca charlesbluca Feb 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! If it's the last argument provided, it gives an error:

>>> dask-cuda-worker localhost:8786 --rmm-log-directory
Error: --rmm-log-directory option requires an argument

However, if it isn't the last argument, the following argument is interpreted as the directory to write to:

>>> dask-cuda-worker localhost:8786 --rmm-log-directory --no-dashboard
...
>>> ls -- --no-dashboard/
rmm_log_127.0.0.1:34325.dev0.txt  rmm_log_127.0.0.1:35957.dev0.txt 

Is there anything we can do to check for this case?

Copy link
Member

@jakirkham jakirkham Feb 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it would be good if it could error consistently. I'm sure there is a way for click to do this, but I don't know offhand what it is

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The closest I could find is the handling of option-like arguments, but I think that's limited to click.arguments; setting ignore_unknown_options to False in the command context didn't change anything here.

If we do find a general solution for this, it would probably be good to open up a separate PR applying it throughout the CLI, as it looks like most of the options behave in a similar fashion; they just usually lead to errors later on unrelated to click.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like @pentschev said, if --disable-rmm-pool is passed in the benchmarks the worker/scheduler's calls to rmm.reintialize will be skipped altogether and logging will be disabled. This relates somewhat to the conversation we had above - is there any utility to enabling RMM logging even if we aren't using a memory pool or managed memory?

Somehow I thought this comment was for the benchmarks only. Regardless, I think @charlesbluca has already clarified.

The closest I could find is the handling of option-like arguments, but I think that's limited to click.arguments; setting ignore_unknown_options to False in the command context didn't change anything here.

If we do find a general solution for this, it would probably be good to open up a separate PR applying it throughout the CLI, as it looks like most of the options behave in a similar fashion; they just usually lead to errors later on unrelated to click.

Even though @jakirkham raises a good point, I wouldn't worry too much about it in this PR, as this is also an issue with all other arguments that require a value and somehow we didn't have too many issues with it. I think we could simply open an issue to eventually address that but not worry about it in this PR.

Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the questions I raised @charlesbluca , LGTM now.

@quasiben quasiben added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Mar 2, 2021
@quasiben quasiben removed the 3 - Ready for Review Ready for review by team label Mar 2, 2021
@quasiben
Copy link
Member

quasiben commented Mar 2, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit f99a037 into rapidsai:branch-0.19 Mar 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge feature request New feature or request non-breaking Non-breaking change python python code needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants