Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] Prometheus metric export #134

Open
wants to merge 54 commits into
base: master
Choose a base branch
from
Open

Conversation

sharonsyh
Copy link
Collaborator

This pull request introduces Prometheus-based metric tracking for energy and power usage within the Zeus framework. It includes functionality for monitoring GPU, CPU, and DRAM energy usage via Histograms, Cumulative Counters, and Gauges.

  • zeus/metric.py:
    A new module that introduces EnergyHistogram, EnergyCumulativeCounter, and PowerGauge classes. These classes enable real-time monitoring of CPU, GPU, and DRAM energy and power consumption by integrating with Prometheus.

  • zeus/prometheus.yml:
    Configuration file for setting up Prometheus monitoring.

  • zeus/docker-compose.yml:
    A Docker Compose file for easily setting up Prometheus with the project for local or cloud-based monitoring.

  • Modified pyproject.toml:
    Added prometheus-client as an optional dependency for Prometheus metric integration.

@sharonsyh sharonsyh changed the title Prometheus Integration Prometheus Integration - Branch Updated Oct 18, 2024
Copy link
Member

@jaywonchung jaywonchung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great work! This is an important piece in making Zeus more usable in a real world scenario. I looked over it at a mid- to high-level (not the nitty gritty details yet) and left some comments. Let me know what you think.

docker/prometheus/prometheus.yml Outdated Show resolved Hide resolved
docs/measure/index.md Outdated Show resolved Hide resolved
docs/measure/index.md Outdated Show resolved Hide resolved
docs/measure/index.md Outdated Show resolved Hide resolved
examples/prometheus/train_single.py Outdated Show resolved Hide resolved
zeus/metric.py Outdated Show resolved Hide resolved
zeus/metric.py Outdated Show resolved Hide resolved
zeus/metric.py Outdated Show resolved Hide resolved
zeus/metric.py Outdated Show resolved Hide resolved
zeus/metric.py Outdated Show resolved Hide resolved
@jaywonchung jaywonchung changed the title Prometheus Integration - Branch Updated [Feat] Prometheus metric export Nov 13, 2024
Co-authored-by: Jae-Won Chung <jwnchung@umich.edu>
sharonsyh and others added 12 commits November 28, 2024 21:42
- Changed metric instantiation to accept CPU and GPU indices directly instead of class objects.
- Improved multiprocessing logic to address and fix pickle-related errors.
- Added consistent handling for sync_execution across begin_window and end_window calls for all metrics.
- Centralized bucket range validation and default handling for EnergyHistogram.
- Improved error handling and logging for multiprocessing processes.
- Standardized Prometheus metric labels (e.g., window and index) across Histogram, Counter, and Gauge.
- Updated docstrings for consistency and clarity across all Metric subclasses.
Adjust target names to standardize pushgateway references, ensuring consistency with the Docker Compose configuration.
docs/measure/index.md Outdated Show resolved Hide resolved
docs/measure/index.md Outdated Show resolved Hide resolved
docs/measure/index.md Outdated Show resolved Hide resolved
docs/measure/index.md Outdated Show resolved Hide resolved
docs/measure/index.md Outdated Show resolved Hide resolved
docs/measure/index.md Outdated Show resolved Hide resolved
docs/measure/index.md Outdated Show resolved Hide resolved
Comment on lines 22 to 23
from PIL import Image, ImageFile, UnidentifiedImageError
#set_start_method("fork", force=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the random image-related stuff that were added and completely remove the power limit optimizer, not commenting it out. Overall, this file needs cleanup.

Comment on lines 203 to 208
# Histogram to track energy consumption over time
energy_histogram = EnergyHistogram(cpu_indices=[0,1], gpu_indices=[0], prometheus_url='http://localhost:9091', job='training_energy_histogram')
# Gauge to track power consumption over time
power_gauge = PowerGauge(gpu_indices=[0], update_period=2, prometheus_url='http://localhost:9091', job='training_power_gauge')
# Counter to track energy consumption over time
energy_counter = EnergyCumulativeCounter(cpu_indices=[0,1], gpu_indices=[0], update_period=2, prometheus_url='http://localhost:9091', job='training_energy_counter')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines are very long. They're better off using multiple lines, like before the change.

Comment on lines 215 to 217
energy_histogram.begin_window("training_energy")
energy_histogram.end_window("training_energy")
train(train_loader, model, criterion, optimizer, epoch, args)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this begin_window, train, then end_window?

print(f"Top-1 accuracy: {acc1}")

# Allow metrics to capture remaining data before shutting down monitoring.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments are useful. Please bring them back.

@@ -430,3 +418,4 @@ def accuracy(output, target, topk=(1,)):

if __name__ == "__main__":
main()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Newline.

gpu_energy={0: 30.0, 1: 35.0, 2: 40.0},
cpu_energy={0: 20.0, 1: 25.0},
gpu_energy={0: 50.0, 1: 100.0, 2: 200.0},
cpu_energy={0: 40.0, 1: 50.0},
dram_energy={},
Copy link
Member

@jaywonchung jaywonchung Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If mock CPU 0 supports DRAM energy measurement (in mock_get_cpus), shouldn't this be something like dram_energy={0: 10.0}?

The metrics would be expecting the monitor to provide DRAM energy measurements for CPU 0, but if the Measurement object has nothing, shouldn't it raise an error?


Args:
name (str): Name of the measurement window.
sync_execution (bool): Whether to execute synchronously. Defaults to None.
Copy link
Member

@jaywonchung jaywonchung Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong. See ZeusMonitor.

@@ -54,6 +73,9 @@ class EnergyHistogram(Metric):
gpu_bucket_range: Histogram buckets for GPU energy.
cpu_bucket_range: Histogram buckets for CPU energy.
dram_bucket_range: Histogram buckets for DRAM energy.
gpu_histograms: A single Prometheus Histogram metric for all GPU energy consumption, indexed by window and GPU index.
cpu_histograms: A single Prometheus Histogram metric for all CPU energy consumption, indexed by window and CPU index.
dram_histograms: A single Prometheus Histogram metric for all DRAM energy consumption, indexed by window and DRAM index.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the entire Attributes section. They're not intended to be public attributes AFAIK.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For every class.

Add link to the push gateway

Co-authored-by: Jae-Won Chung <jwnchung@umich.edu>
@@ -63,7 +85,7 @@ def __init__(
prometheus_url: str,
job: str,
gpu_bucket_range: Sequence[float] = [50.0, 100.0, 200.0, 500.0, 1000.0],
cpu_bucket_range: Sequence[float] = [10.0, 20.0, 50.0, 100.0, 200.0],
cpu_bucket_range: Sequence[float] = [10.0, 50.0, 100.0, 500.0, 1000.0],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You updated the default range, but you never reflected the change in any docstring or doc. (1) Update the docstring, and (2) remove the defaults listed in measure.md and instead point people to the API reference page for defaults.

sharonsyh and others added 2 commits December 10, 2024 13:33
Generalize the device as {component} with the note that Gauge only supports GPU

Co-authored-by: Jae-Won Chung <jwnchung@umich.edu>
Co-authored-by: Jae-Won Chung <jwnchung@umich.edu>
self.energy_monitor.begin_window(
f"__EnergyHistogram_{name}", sync_execution=True
)
self.energy_monitor.begin_window(f"__EnergyHistogram_{name}", sync_execution)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.energy_monitor.begin_window(f"__EnergyHistogram_{name}", sync_execution)
self.energy_monitor.begin_window(f"__EnergyHistogram_{name}", sync_execution=sync_execution)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto for end_window.

sharonsyh and others added 4 commits December 10, 2024 13:35
Co-authored-by: Jae-Won Chung <jwnchung@umich.edu>
Co-authored-by: Jae-Won Chung <jwnchung@umich.edu>
Co-authored-by: Jae-Won Chung <jwnchung@umich.edu>
Co-authored-by: Jae-Won Chung <jwnchung@umich.edu>
@@ -288,28 +319,36 @@ def begin_window(self, name: str) -> None:
self.update_period,
self.prometheus_url,
self.job,
sync_execution,
Copy link
Member

@jaywonchung jaywonchung Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong. If sync_execution is True, you need to call zeus.utils.framework.sync_execution on the main thread there the application is running. On the other hand, the power/energy monitor process's ZeusMonitor should always be invoked with sync_execution=False.

Read ZeusMonitor to see how sync_execution (sometimes a boolean parameter and other times a function in zeus.utils.framework) is being used.

Comment on lines +329 to +331
self.window_state[name] = MonitoringProcessState(
queue=self.queue, proc=self.proc
)
Copy link
Member

@jaywonchung jaywonchung Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you putting these in self.queue and self.proc??

Comment on lines +344 to +347
if self.queue is not None:
self.queue.put("stop")
else:
raise RuntimeError("Queue is not initialized")
Copy link
Member

@jaywonchung jaywonchung Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.queue can the queue from any random window??? More specifically, it's going to be the queue that belongs to the most recently started window.

This level of quality is completely unacceptable. Please re-check the correctness of every line of code and documentation, and then ask for review.

Copy link
Member

@jaywonchung jaywonchung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments on changes to make. I think they will be more or less straightforward ones. Let's hope this is the final round of change requests. Thanks!

@jaywonchung jaywonchung linked an issue Dec 19, 2024 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants