-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add capability to log spilling #442
Conversation
The LoggedBuffer class is a zict.Buffer subclass, simply adding a logging capability while keeping original functionality unchanged.
This is composed of two classes LoggedNanny and LoggedWorker, which are subclasses of Nanny and Worker, respectively. These classes are required to set the worker's address of the DeviceHostFile object, once the worker has been started.
Codecov Report
@@ Coverage Diff @@
## branch-0.19 #442 +/- ##
================================================
+ Coverage 61.06% 90.43% +29.37%
================================================
Files 22 16 -6
Lines 2571 1663 -908
================================================
- Hits 1570 1504 -66
+ Misses 1001 159 -842
Continue to review full report at Codecov.
|
I don't disagree that parts are a bit hacky and but it's cool to see this functionality. I think it also helps us explore what we want to deliver for users and then we can have a more complete idea before upstreaming.
|
Yes, I intended to write that in the description, but ended up forgetting about it. It would be easier to extend that to |
@EvenOldridge friendly reminder here, is that in line with what you want for #438 ? |
Looks like a good start. Is there a way to track how much is being spilled and the time spent spilling? And possibly a summary at the end of the total time spent spilling? |
Latest changes add size of the value spilled and time taken, it now looks like the following:
@EvenOldridge is that what you had in mind? |
Yeah, I like that. Is it possible to get the total spill time summary at the end? Also might be useful to set a threshold of logging to cut down on filling up the logs. |
With the latest commits users can query the total spilling time and print it: spilling_time = await client.run(lambda dask_worker: dask_worker.data.get_total_spilling_time())
print(spilling_time) The output looks like the following: {'tcp://127.0.0.1:35603': {'Total spilling time from Device to Host': 0.14711833000183105, 'Total spilling time from Host to Device': 0.21700596809387207, 'Total spilling time from Host to Disk': 0.13082504272460938, 'Total spilling time from Disk to Host': 0.16785812377929688}} As for the threshold, I'm not confident in a metric for that. However, if that's really necessary, I think we might want to do it in a different manner, as per #438 (comment) . In the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice work @pentschev
Thanks @madsbk for reviewing. @EvenOldridge could you check with the customers who have raised this issue whether this is good for their use case? |
@EvenOldridge friendly reminder here, could you check with customers if this covers their use case? :) |
Sorry for missing this. This is great @pentschev - Thanks for the work here! I think this is a reliable way for both developers and users to directly measure the spilling overhead for a given workflow. In the long run, it would be nice for users to have the option to visualize spilling behavior with the dask dashboard, but I think this is even more generally useful than that. It seems easy enough to document the necessary steps for cluster creation and then NVT should be able to automatically query spilling metrics during/after computation (if the user is executing a workflow with a dask-cuda client). |
cc @jacobtomlinson (in case you have thoughts on how we might visualize spilling 🙂) |
@rjzamora thanks for checking this PR. From your comment, my understanding is that this PR is useful for the use case you mentioned, but long-term it would be nice to also have that in the Dask dashboard, am I correct? If so, I would go ahead and merge this and the dashboard could be addressed later on, probably with the help of @jacobtomlinson . |
@gpucibot merge |
This PR allows enabling logs for spilling operations, as requested in #438 . This is definitely not the cleanest solution, but it allows us to test without changing anything in Distributed or Zict, although these changes could be done in those two projects in the future preventing us from creating subclasses. We might also want to have a specific configuration for logging, once we're confident what's the best way to handle that, we would need to decide whether that should be a Distributed or Dask-CUDA configuration.
Here's an example of how logs look like: