-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add zero energy warning #93
Add zero energy warning #93
Conversation
Thanks for the PR! Does |
There's one error, "zeus/utils/env.py:28:12: E721 Use |
Ah I see. I'll review the PR this afternoon (Eastern Time). |
Sounds good, env.py code updated! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for your work.
I left quite a few comments, which pretty much comes from the distinction of code that works and good code. When situations permit, we should always make it a habit to write good code, and the purpose of my review is to help you through that learning process. Please let me know if you have questions.
Specifically for those that suggest specific changes, you can accept them inside the GitHub PR interface (the Add suggestion to batch button).
tests/test_monitor.py
Outdated
@@ -21,6 +21,7 @@ | |||
|
|||
import pynvml | |||
import pytest | |||
import warnings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used when triggering the warning message with warnings.warn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry got it, removed it from the test_monitor.py file.
tests/test_monitor.py
Outdated
test_measurement = monitor.end_window("window0", sync_cuda=False) | ||
assert all(value is 0.0 for value in test_measurement.gpu_energy.values()) | ||
|
||
pynvml_mock.nvmlDeviceGetTotalEnergyConsumption.side_effect = original_side_effect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need to restore the original side effect. The test case is over anyway, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh right. Thanks for the comment.
@@ -394,6 +394,15 @@ def end_window( | |||
gpu_energy_consumption[gpu_index] = power[gpu_index] * ( | |||
time_consumption - power_measurement_time | |||
) | |||
# Trigger a warning if energy consumption is zero and approx_instant_energy is not enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Trigger a warning if energy consumption is zero and approx_instant_energy is not enabled. | |
# Trigger a warning if energy consumption is zero and approx_instant_energy is not enabled. |
zeus/monitor/energy.py
Outdated
@@ -394,6 +394,15 @@ def end_window( | |||
gpu_energy_consumption[gpu_index] = power[gpu_index] * ( | |||
time_consumption - power_measurement_time | |||
) | |||
# Trigger a warning if energy consumption is zero and approx_instant_energy is not enabled. | |||
if not self.approx_instant_energy and all( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not self.approx_instant_energy and all( | |
if not self.approx_instant_energy and any( |
I think this should be any
. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't each energy value in gpu_energy_consumption be 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since counters on each GPU update independently, for a moderately short window, some GPUs may end up with 0 while other GPUs see the counter update once. In this case, I think we should issue the warning.
zeus/monitor/energy.py
Outdated
energy == 0.0 for energy in gpu_energy_consumption.values() | ||
): | ||
warnings.warn( | ||
"Energy consumption is observed as zero. Consider turning on approx_instant_energy option.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Warning messages should be more detailed, since its primary purpose is to inform the user. For instance, see this. So, something like:
The energy consumption of one or more GPUs was measured as zero. This means that the time duration of the measurement window was shorter than the GPU's energy counter update period. Consider turning on the `approx_instant_energy` option in `ZeusMonitor`, which approximates the energy consumption of a short time window as instant power draw x window duration.
zeus/monitor/energy.py
Outdated
UserWarning, | ||
stacklevel=2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UserWarning
is redundant since it's the default.- Why stack level 2? I feel like it should be 1.
tests/test_monitor.py
Outdated
def zero_energy_side_effect(handle): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def zero_energy_side_effect(handle): | |
def zero_energy_side_effect(handle): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this defined in line 281 and used in line 461? Move definition right next to usage.
tests/test_monitor.py
Outdated
energy_counters = { | ||
f"handle{i}": itertools.count(start=1000, step=3) | ||
for i in nvml_gpu_indices | ||
if not is_old_nvml[i] | ||
} | ||
|
||
if "window0" in monitor.measurement_states: | ||
return 0.0 | ||
return next(energy_counters[handle]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why this should be implemented in such a complicated way. I was imagining something like this:
pynvml_mock.nvmlDeviceGetTotalEnergyConsumption.side_effect = lambda: return 0.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, probably should be lambda _: 0.0
to ignore the handle
argument.
tests/test_monitor.py
Outdated
with pytest.warns( | ||
UserWarning, | ||
match="Energy consumption is observed as zero. Consider turning on approx_instant_energy option.", | ||
): | ||
monitor.begin_window("window0", sync_cuda=False) | ||
test_measurement = monitor.end_window("window0", sync_cuda=False) | ||
assert all(value is 0.0 for value in test_measurement.gpu_energy.values()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with pytest.warns( | |
UserWarning, | |
match="Energy consumption is observed as zero. Consider turning on approx_instant_energy option.", | |
): | |
monitor.begin_window("window0", sync_cuda=False) | |
test_measurement = monitor.end_window("window0", sync_cuda=False) | |
assert all(value is 0.0 for value in test_measurement.gpu_energy.values()) | |
monitor.begin_window("window0", sync_cuda=False) | |
with pytest.warns( | |
UserWarning, | |
match="Energy consumption is observed as zero. Consider turning on approx_instant_energy option.", | |
): | |
test_measurement = monitor.end_window("window0", sync_cuda=False) | |
assert all(value == 0.0 for value in test_measurement.gpu_energy.values()) |
pytest.warns
should only wrap around the code it actually tests.is
vs==
. If this wasn't just a typing mistake, you need to invest some time learning the Python language itself deeper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks!
tests/test_monitor.py
Outdated
@@ -440,3 +453,19 @@ def assert_log_file_row(row: str, name: str, begin_time: int, elapsed_time: int) | |||
assert_measurement( | |||
"window5", measurement, begin_time=25, elapsed_time=8, assert_calls=False | |||
) | |||
|
|||
######################################## | |||
# Test content of `energy.py`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bad comment because
- what it says is trivial (the entire file is about testing
energy.py
), - it's too general (the code that follows this comment tests a very specific functionality, not the entire
energy.py
file), - and it breaks consistency among boxed comments in this file (testing zero-energy warnings is not significant enough to be a boxed comment by itself).
Remove the boxed comment and just have a single line comment simply describing what the following code tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the comment. Thanks for letting me know.
Could you rebase to |
…to align with changes in energy.py
af303f9
to
fd0e91f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed nits that were not addressed. Good to go now. Thanks!
Summary
This pull request adds a warning for cases where energy consumption is observed as zero and the
approx_instant_energy
option is not enabled. This helps users identify when they might need to enable this option for more accurate energy measurements.Changes
zeus/monitor/energy.py
to trigger a warning if energy consumption is zero andapprox_instant_energy
is not enabled.tests/test_monitor.py
to ensure the warning is triggered correctly.Motivation
The change ensures users are aware of potential inaccuracies in energy consumption measurements when the
approx_instant_energy
option is not enabled. This can prevent misinterpretations of energy data, particularly in environments where precise energy tracking is crucial.