Add option to enable RMM logging #542

charlesbluca · 2021-02-26T02:16:23Z

Continuation of #322; this PR adds:

Keyword argument rmm_log_directory to LocalCUDACluster and CUDAWorker to enable per-worker RMM logging to a specified directory
Option --rmm-log-directory to the dask-cuda-worker CLI to achieve the same as above
Option --rmm-log-directory to the benchmarks to enable worker and scheduler RMM logging
Tests to test_local_cuda_cluster.py and test_dask_cuda_worker.py to check that logging is happening

The log files use the following naming convention:

rmm_log_<N>.dev0.txt for workers spawned through the construction of a cluster
rmm_log_<IP>:<PORT>.dev0.txt for workers spawned using the CLI
rmm_log_scheduler.dev0.txt for schedulers

Some questions:

Should we leave dev0 in the log file names? It seems redundant, but I didn't really put much time into seeing if we could stop RMM from appending that to the file names.
Are the tests sufficient? Right now they are only checking that the memory resources are rmm.mr.LoggingResourceAdaptors when logging is enabled but a check that the files were actually generated could be nice.
Should we consider adding logging to the scheduler/client?

cc @jakirkham @pentschev

This reverts commit 7658158.

This reverts commit 21aabc4.

This reverts commit e2d6c0c.

codecov-io · 2021-02-26T02:58:05Z

Codecov Report

Merging #542 (cbfafaf) into branch-0.19 (fd48b7d) will increase coverage by 30.67%.
The diff coverage is 100.00%.

@@               Coverage Diff                @@
##           branch-0.19     #542       +/-   ##
================================================
+ Coverage        61.77%   92.45%   +30.67%     
================================================
  Files               22       16        -6     
  Lines             2459     1603      -856     
================================================
- Hits              1519     1482       -37     
+ Misses             940      121      -819

Impacted Files	Coverage Δ
dask_cuda/cuda_worker.py	`78.75% <ø> (ø)`
dask_cuda/cli/dask_cuda_worker.py	`96.96% <100.00%> (+1.58%)`	⬆️
dask_cuda/local_cuda_cluster.py	`81.39% <100.00%> (+0.21%)`	⬆️
dask_cuda/utils.py	`90.76% <100.00%> (+1.14%)`	⬆️
dask_cuda/proxify_host_file.py	`100.00% <0.00%> (ø)`
dask_cuda/_version.py
dask_cuda/explicit_comms/dataframe/shuffle.py	`97.94% <0.00%> (+0.16%)`	⬆️
dask_cuda/proxy_object.py	`91.36% <0.00%> (+1.27%)`	⬆️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fd48b7d...cbfafaf. Read the comment docs.

pentschev

Overall looks good @charlesbluca , I've made a few requests and wrote a couple of questions, please take a look when you have a chance.

dask_cuda/benchmarks/utils.py

dask_cuda/cli/dask_cuda_worker.py

dask_cuda/local_cuda_cluster.py

quasiben · 2021-02-26T13:18:41Z

dask_cuda/cli/dask_cuda_worker.py

@@ -112,6 +112,15 @@
    "WARNING: managed memory is currently incompatible with NVLink, "
    "trying to enable both will result in an exception.",
 )
+@click.option(
+    "--rmm-log-directory",
+    default=None,


When using a Pool and RMM is set to None, will there be an error or will RMM not log ?

Right now if --disable-rmm-pool is explicitly passed, there will be no logging. Do you have a preference on what should happen in that case @quasiben ?

Is this referring to the dask-cuda-worker CLI or the benchmark options? In both cases, if --rmm-log-directory is set to None, then logging will be set to False in their respective RMM setups and it will be disabled.

Like @pentschev said, if --disable-rmm-pool is passed in the benchmarks the worker/scheduler's calls to rmm.reintialize will be skipped altogether and logging will be disabled. This relates somewhat to the conversation we had above - is there any utility to enabling RMM logging even if we aren't using a memory pool or managed memory?

Is it possible to pass --rmm-log-directory without a directory? If so, what happens?

Nice catch! If it's the last argument provided, it gives an error:

>>> dask-cuda-worker localhost:8786 --rmm-log-directory Error: --rmm-log-directory option requires an argument

However, if it isn't the last argument, the following argument is interpreted as the directory to write to:

>>> dask-cuda-worker localhost:8786 --rmm-log-directory --no-dashboard ... >>> ls -- --no-dashboard/ rmm_log_127.0.0.1:34325.dev0.txt rmm_log_127.0.0.1:35957.dev0.txt

Is there anything we can do to check for this case?

Yeah it would be good if it could error consistently. I'm sure there is a way for click to do this, but I don't know offhand what it is

The closest I could find is the handling of option-like arguments, but I think that's limited to click.arguments; setting ignore_unknown_options to False in the command context didn't change anything here.

If we do find a general solution for this, it would probably be good to open up a separate PR applying it throughout the CLI, as it looks like most of the options behave in a similar fashion; they just usually lead to errors later on unrelated to click.

Like @pentschev said, if --disable-rmm-pool is passed in the benchmarks the worker/scheduler's calls to rmm.reintialize will be skipped altogether and logging will be disabled. This relates somewhat to the conversation we had above - is there any utility to enabling RMM logging even if we aren't using a memory pool or managed memory?

Somehow I thought this comment was for the benchmarks only. Regardless, I think @charlesbluca has already clarified.

The closest I could find is the handling of option-like arguments, but I think that's limited to click.arguments; setting ignore_unknown_options to False in the command context didn't change anything here.

If we do find a general solution for this, it would probably be good to open up a separate PR applying it throughout the CLI, as it looks like most of the options behave in a similar fashion; they just usually lead to errors later on unrelated to click.

Even though @jakirkham raises a good point, I wouldn't worry too much about it in this PR, as this is also an issue with all other arguments that require a value and somehow we didn't have too many issues with it. I think we could simply open an issue to eventually address that but not worry about it in this PR.

pentschev

Thanks for addressing the questions I raised @charlesbluca , LGTM now.

quasiben · 2021-03-02T21:35:25Z

@gpucibot merge

charlesbluca added 13 commits February 22, 2021 11:07

Add option to enable RMM logging

c117d95

Clarify Dask config fallback

349b3d9

Change RMM logging flag to directory argument

7658158

Run pre-commit hooks

21aabc4

Revert "Change RMM logging flag to directory argument"

e2d6c0c

This reverts commit 7658158.

Undo revert - must specify a logging directory

9eeaab0

This reverts commit 21aabc4.

Second attempt at undoing revert

0914624

This reverts commit e2d6c0c.

Set attribute for logging directory

4320928

Change log file naming convention

2ad262a

Add RMM logging option to benchmarks

09611dd

Add test cases for RMM logging

9543208

Run pre-commit hooks

4494e80

Clarify that scheduler RMM pool is also logged

cbfafaf

charlesbluca added 3 - Ready for Review Ready for review by team feature request New feature or request non-breaking Non-breaking change labels Feb 26, 2021

charlesbluca requested a review from a team as a code owner February 26, 2021 02:16

github-actions bot added the python python code needed label Feb 26, 2021

charlesbluca mentioned this pull request Feb 26, 2021

Provide option for RMM logging #522

Closed

pentschev requested changes Feb 26, 2021

View reviewed changes

dask_cuda/benchmarks/utils.py Outdated Show resolved Hide resolved

dask_cuda/cli/dask_cuda_worker.py Outdated Show resolved Hide resolved

dask_cuda/local_cuda_cluster.py Outdated Show resolved Hide resolved

dask_cuda/local_cuda_cluster.py Outdated Show resolved Hide resolved

quasiben reviewed Feb 26, 2021

View reviewed changes

charlesbluca added 2 commits February 26, 2021 07:34

Created dedicated function for RMM log file name constuction

3c5ab1f

Clarify logging capabilities in docstrings

f8fc53b

pentschev approved these changes Feb 26, 2021

View reviewed changes

quasiben added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Mar 2, 2021

quasiben approved these changes Mar 2, 2021

View reviewed changes

quasiben removed the 3 - Ready for Review Ready for review by team label Mar 2, 2021

rapids-bot bot merged commit f99a037 into rapidsai:branch-0.19 Mar 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to enable RMM logging #542

Add option to enable RMM logging #542

charlesbluca commented Feb 26, 2021

codecov-io commented Feb 26, 2021 •

edited

Loading

pentschev left a comment

quasiben Feb 26, 2021 •

edited

Loading

pentschev Feb 26, 2021

charlesbluca Feb 26, 2021

jakirkham Feb 26, 2021

charlesbluca Feb 26, 2021 •

edited

Loading

jakirkham Feb 26, 2021 •

edited

Loading

charlesbluca Feb 26, 2021

pentschev Feb 26, 2021

pentschev left a comment

quasiben commented Mar 2, 2021

Add option to enable RMM logging #542

Add option to enable RMM logging #542

Conversation

charlesbluca commented Feb 26, 2021

codecov-io commented Feb 26, 2021 • edited Loading

Codecov Report

pentschev left a comment

Choose a reason for hiding this comment

quasiben Feb 26, 2021 • edited Loading

Choose a reason for hiding this comment

pentschev Feb 26, 2021

Choose a reason for hiding this comment

charlesbluca Feb 26, 2021

Choose a reason for hiding this comment

jakirkham Feb 26, 2021

Choose a reason for hiding this comment

charlesbluca Feb 26, 2021 • edited Loading

Choose a reason for hiding this comment

jakirkham Feb 26, 2021 • edited Loading

Choose a reason for hiding this comment

charlesbluca Feb 26, 2021

Choose a reason for hiding this comment

pentschev Feb 26, 2021

Choose a reason for hiding this comment

pentschev left a comment

Choose a reason for hiding this comment

quasiben commented Mar 2, 2021

codecov-io commented Feb 26, 2021 •

edited

Loading

quasiben Feb 26, 2021 •

edited

Loading

charlesbluca Feb 26, 2021 •

edited

Loading

jakirkham Feb 26, 2021 •

edited

Loading