Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add zero energy warning #93

Merged
merged 6 commits into from
Jul 11, 2024

Conversation

sharonsyh
Copy link
Collaborator

Summary

This pull request adds a warning for cases where energy consumption is observed as zero and the approx_instant_energy option is not enabled. This helps users identify when they might need to enable this option for more accurate energy measurements.

Changes

  • Added a check in zeus/monitor/energy.py to trigger a warning if energy consumption is zero and approx_instant_energy is not enabled.
  • Modified the test in tests/test_monitor.py to ensure the warning is triggered correctly.

Motivation

The change ensures users are aware of potential inaccuracies in energy consumption measurements when the approx_instant_energy option is not enabled. This can prevent misinterpretations of energy data, particularly in environments where precise energy tracking is crucial.

@jaywonchung
Copy link
Member

Thanks for the PR! Does bash scripts/lint.sh pass locally?

@sharonsyh
Copy link
Collaborator Author

There's one error, "zeus/utils/env.py:28:12: E721 Use is and is not for type comparisons, or isinstance() for isinstance checks", which already existed before the modification.

@jaywonchung
Copy link
Member

Ah I see. ruff was updated to 0.5.1 and they newly introduced that check. Please fix it. Updating the line to valtype is bool should be enough.

I'll review the PR this afternoon (Eastern Time).

@sharonsyh
Copy link
Collaborator Author

Sounds good, env.py code updated!

Copy link
Member

@jaywonchung jaywonchung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for your work.

I left quite a few comments, which pretty much comes from the distinction of code that works and good code. When situations permit, we should always make it a habit to write good code, and the purpose of my review is to help you through that learning process. Please let me know if you have questions.

Specifically for those that suggest specific changes, you can accept them inside the GitHub PR interface (the Add suggestion to batch button).

@@ -21,6 +21,7 @@

import pynvml
import pytest
import warnings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used when triggering the warning message with warnings.warn.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in this file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry got it, removed it from the test_monitor.py file.

test_measurement = monitor.end_window("window0", sync_cuda=False)
assert all(value is 0.0 for value in test_measurement.gpu_energy.values())

pynvml_mock.nvmlDeviceGetTotalEnergyConsumption.side_effect = original_side_effect
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to restore the original side effect. The test case is over anyway, isn't it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh right. Thanks for the comment.

@@ -394,6 +394,15 @@ def end_window(
gpu_energy_consumption[gpu_index] = power[gpu_index] * (
time_consumption - power_measurement_time
)
# Trigger a warning if energy consumption is zero and approx_instant_energy is not enabled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Trigger a warning if energy consumption is zero and approx_instant_energy is not enabled.
# Trigger a warning if energy consumption is zero and approx_instant_energy is not enabled.

@@ -394,6 +394,15 @@ def end_window(
gpu_energy_consumption[gpu_index] = power[gpu_index] * (
time_consumption - power_measurement_time
)
# Trigger a warning if energy consumption is zero and approx_instant_energy is not enabled.
if not self.approx_instant_energy and all(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if not self.approx_instant_energy and all(
if not self.approx_instant_energy and any(

I think this should be any. WDYT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't each energy value in gpu_energy_consumption be 0?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since counters on each GPU update independently, for a moderately short window, some GPUs may end up with 0 while other GPUs see the counter update once. In this case, I think we should issue the warning.

energy == 0.0 for energy in gpu_energy_consumption.values()
):
warnings.warn(
"Energy consumption is observed as zero. Consider turning on approx_instant_energy option.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning messages should be more detailed, since its primary purpose is to inform the user. For instance, see this. So, something like:

The energy consumption of one or more GPUs was measured as zero. This means that the time duration of the measurement window was shorter than the GPU's energy counter update period. Consider turning on the `approx_instant_energy` option in `ZeusMonitor`, which approximates the energy consumption of a short time window as instant power draw x window duration.

Comment on lines 403 to 404
UserWarning,
stacklevel=2,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. UserWarning is redundant since it's the default.
  2. Why stack level 2? I feel like it should be 1.

Comment on lines 281 to 282
def zero_energy_side_effect(handle):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def zero_energy_side_effect(handle):
def zero_energy_side_effect(handle):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this defined in line 281 and used in line 461? Move definition right next to usage.

Comment on lines 283 to 291
energy_counters = {
f"handle{i}": itertools.count(start=1000, step=3)
for i in nvml_gpu_indices
if not is_old_nvml[i]
}

if "window0" in monitor.measurement_states:
return 0.0
return next(energy_counters[handle])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why this should be implemented in such a complicated way. I was imagining something like this:

pynvml_mock.nvmlDeviceGetTotalEnergyConsumption.side_effect = lambda: return 0.0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, probably should be lambda _: 0.0 to ignore the handle argument.

Comment on lines 463 to 469
with pytest.warns(
UserWarning,
match="Energy consumption is observed as zero. Consider turning on approx_instant_energy option.",
):
monitor.begin_window("window0", sync_cuda=False)
test_measurement = monitor.end_window("window0", sync_cuda=False)
assert all(value is 0.0 for value in test_measurement.gpu_energy.values())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with pytest.warns(
UserWarning,
match="Energy consumption is observed as zero. Consider turning on approx_instant_energy option.",
):
monitor.begin_window("window0", sync_cuda=False)
test_measurement = monitor.end_window("window0", sync_cuda=False)
assert all(value is 0.0 for value in test_measurement.gpu_energy.values())
monitor.begin_window("window0", sync_cuda=False)
with pytest.warns(
UserWarning,
match="Energy consumption is observed as zero. Consider turning on approx_instant_energy option.",
):
test_measurement = monitor.end_window("window0", sync_cuda=False)
assert all(value == 0.0 for value in test_measurement.gpu_energy.values())
  1. pytest.warns should only wrap around the code it actually tests.
  2. is vs ==. If this wasn't just a typing mistake, you need to invest some time learning the Python language itself deeper.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks!

@@ -440,3 +453,19 @@ def assert_log_file_row(row: str, name: str, begin_time: int, elapsed_time: int)
assert_measurement(
"window5", measurement, begin_time=25, elapsed_time=8, assert_calls=False
)

########################################
# Test content of `energy.py`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bad comment because

  1. what it says is trivial (the entire file is about testing energy.py),
  2. it's too general (the code that follows this comment tests a very specific functionality, not the entire energy.py file),
  3. and it breaks consistency among boxed comments in this file (testing zero-energy warnings is not significant enough to be a boxed comment by itself).

Remove the boxed comment and just have a single line comment simply describing what the following code tests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the comment. Thanks for letting me know.

@jaywonchung
Copy link
Member

Could you rebase to master? There was a problem in our CI that prevented tests from running on PRs that were submitted from forked repos. I believe I've fixed this with #95.

@sharonsyh sharonsyh force-pushed the add-zero-energy-warning branch from af303f9 to fd0e91f Compare July 10, 2024 06:42
Copy link
Member

@jaywonchung jaywonchung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed nits that were not addressed. Good to go now. Thanks!

@jaywonchung jaywonchung linked an issue Jul 11, 2024 that may be closed by this pull request
@jaywonchung jaywonchung merged commit 869a675 into ml-energy:master Jul 11, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add warning for too short window in ZeusMonitor
2 participants