-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to enable RMM logging #542
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-0.19 #542 +/- ##
================================================
+ Coverage 61.77% 92.45% +30.67%
================================================
Files 22 16 -6
Lines 2459 1603 -856
================================================
- Hits 1519 1482 -37
+ Misses 940 121 -819
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good @charlesbluca , I've made a few requests and wrote a couple of questions, please take a look when you have a chance.
@@ -112,6 +112,15 @@ | |||
"WARNING: managed memory is currently incompatible with NVLink, " | |||
"trying to enable both will result in an exception.", | |||
) | |||
@click.option( | |||
"--rmm-log-directory", | |||
default=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using a Pool and RMM is set to None, will there be an error or will RMM not log ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now if --disable-rmm-pool
is explicitly passed, there will be no logging. Do you have a preference on what should happen in that case @quasiben ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this referring to the dask-cuda-worker
CLI or the benchmark options? In both cases, if --rmm-log-directory
is set to None, then logging
will be set to False in their respective RMM setups and it will be disabled.
Like @pentschev said, if --disable-rmm-pool
is passed in the benchmarks the worker/scheduler's calls to rmm.reintialize
will be skipped altogether and logging will be disabled. This relates somewhat to the conversation we had above - is there any utility to enabling RMM logging even if we aren't using a memory pool or managed memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to pass --rmm-log-directory
without a directory? If so, what happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! If it's the last argument provided, it gives an error:
>>> dask-cuda-worker localhost:8786 --rmm-log-directory
Error: --rmm-log-directory option requires an argument
However, if it isn't the last argument, the following argument is interpreted as the directory to write to:
>>> dask-cuda-worker localhost:8786 --rmm-log-directory --no-dashboard
...
>>> ls -- --no-dashboard/
rmm_log_127.0.0.1:34325.dev0.txt rmm_log_127.0.0.1:35957.dev0.txt
Is there anything we can do to check for this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it would be good if it could error consistently. I'm sure there is a way for click to do this, but I don't know offhand what it is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The closest I could find is the handling of option-like arguments, but I think that's limited to click.arguments
; setting ignore_unknown_options
to False in the command context didn't change anything here.
If we do find a general solution for this, it would probably be good to open up a separate PR applying it throughout the CLI, as it looks like most of the options behave in a similar fashion; they just usually lead to errors later on unrelated to click.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like @pentschev said, if --disable-rmm-pool is passed in the benchmarks the worker/scheduler's calls to rmm.reintialize will be skipped altogether and logging will be disabled. This relates somewhat to the conversation we had above - is there any utility to enabling RMM logging even if we aren't using a memory pool or managed memory?
Somehow I thought this comment was for the benchmarks only. Regardless, I think @charlesbluca has already clarified.
The closest I could find is the handling of option-like arguments, but I think that's limited to
click.arguments
; settingignore_unknown_options
to False in the command context didn't change anything here.If we do find a general solution for this, it would probably be good to open up a separate PR applying it throughout the CLI, as it looks like most of the options behave in a similar fashion; they just usually lead to errors later on unrelated to click.
Even though @jakirkham raises a good point, I wouldn't worry too much about it in this PR, as this is also an issue with all other arguments that require a value and somehow we didn't have too many issues with it. I think we could simply open an issue to eventually address that but not worry about it in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the questions I raised @charlesbluca , LGTM now.
@gpucibot merge |
Continuation of #322; this PR adds:
rmm_log_directory
toLocalCUDACluster
andCUDAWorker
to enable per-worker RMM logging to a specified directory--rmm-log-directory
to thedask-cuda-worker
CLI to achieve the same as above--rmm-log-directory
to the benchmarks to enable worker and scheduler RMM loggingtest_local_cuda_cluster.py
andtest_dask_cuda_worker.py
to check that logging is happeningThe log files use the following naming convention:
rmm_log_<N>.dev0.txt
for workers spawned through the construction of a clusterrmm_log_<IP>:<PORT>.dev0.txt
for workers spawned using the CLIrmm_log_scheduler.dev0.txt
for schedulersSome questions:
dev0
in the log file names? It seems redundant, but I didn't really put much time into seeing if we could stop RMM from appending that to the file names.rmm.mr.LoggingResourceAdaptors
when logging is enabled but a check that the files were actually generated could be nice.cc @jakirkham @pentschev