[Feat] Prometheus metric export #134

sharonsyh · 2024-10-18T05:29:25Z

This pull request introduces Prometheus-based metric tracking for energy and power usage within the Zeus framework. It includes functionality for monitoring GPU, CPU, and DRAM energy usage via Histograms, Cumulative Counters, and Gauges.

zeus/metric.py:
A new module that introduces EnergyHistogram, EnergyCumulativeCounter, and PowerGauge classes. These classes enable real-time monitoring of CPU, GPU, and DRAM energy and power consumption by integrating with Prometheus.
zeus/prometheus.yml:
Configuration file for setting up Prometheus monitoring.
zeus/docker-compose.yml:
A Docker Compose file for easily setting up Prometheus with the project for local or cloud-based monitoring.
Modified pyproject.toml:
Added prometheus-client as an optional dependency for Prometheus metric integration.

…rGauge

jaywonchung

Thanks for the great work! This is an important piece in making Zeus more usable in a real world scenario. I looked over it at a mid- to high-level (not the nitty gritty details yet) and left some comments. Let me know what you think.

docker/prometheus/prometheus.yml

docs/measure/index.md

examples/prometheus/train_single.py

zeus/metric.py

Co-authored-by: Jae-Won Chung <jwnchung@umich.edu>

examples/prometheus/requirements.txt

docs/measure/index.md

- Changed metric instantiation to accept CPU and GPU indices directly instead of class objects. - Improved multiprocessing logic to address and fix pickle-related errors. - Added consistent handling for sync_execution across begin_window and end_window calls for all metrics. - Centralized bucket range validation and default handling for EnergyHistogram. - Improved error handling and logging for multiprocessing processes. - Standardized Prometheus metric labels (e.g., window and index) across Histogram, Counter, and Gauge. - Updated docstrings for consistency and clarity across all Metric subclasses.

Adjust target names to standardize pushgateway references, ensuring consistency with the Docker Compose configuration.

…tion

docs/measure/index.md

jaywonchung · 2024-12-10T18:19:17Z

examples/prometheus/train_single.py

+from PIL import Image, ImageFile, UnidentifiedImageError
+#set_start_method("fork", force=True)


Please remove the random image-related stuff that were added and completely remove the power limit optimizer, not commenting it out. Overall, this file needs cleanup.

jaywonchung · 2024-12-10T18:21:14Z

examples/prometheus/train_single.py

+    # Histogram to track energy consumption over time
+    energy_histogram = EnergyHistogram(cpu_indices=[0,1], gpu_indices=[0], prometheus_url='http://localhost:9091', job='training_energy_histogram')
+    # Gauge to track power consumption over time
+    power_gauge = PowerGauge(gpu_indices=[0], update_period=2, prometheus_url='http://localhost:9091', job='training_power_gauge')
+    # Counter to track energy consumption over time
+    energy_counter = EnergyCumulativeCounter(cpu_indices=[0,1], gpu_indices=[0], update_period=2, prometheus_url='http://localhost:9091', job='training_energy_counter')


These lines are very long. They're better off using multiple lines, like before the change.

jaywonchung · 2024-12-10T18:21:50Z

examples/prometheus/train_single.py

+        energy_histogram.begin_window("training_energy")
+        energy_histogram.end_window("training_energy")
+        train(train_loader, model, criterion, optimizer, epoch, args)


Why isn't this begin_window, train, then end_window?

jaywonchung · 2024-12-10T18:22:33Z

examples/prometheus/train_single.py

        print(f"Top-1 accuracy: {acc1}")

-    # Allow metrics to capture remaining data before shutting down monitoring.


These comments are useful. Please bring them back.

jaywonchung · 2024-12-10T18:22:48Z

examples/prometheus/train_single.py

@@ -430,3 +418,4 @@ def accuracy(output, target, topk=(1,)):

 if __name__ == "__main__":
    main()
+


jaywonchung · 2024-12-10T18:25:13Z

tests/test_metric.py

-            gpu_energy={0: 30.0, 1: 35.0, 2: 40.0},
-            cpu_energy={0: 20.0, 1: 25.0},
+            gpu_energy={0: 50.0, 1: 100.0, 2: 200.0},
+            cpu_energy={0: 40.0, 1: 50.0},
            dram_energy={},


If mock CPU 0 supports DRAM energy measurement (in mock_get_cpus), shouldn't this be something like dram_energy={0: 10.0}?

The metrics would be expecting the monitor to provide DRAM energy measurements for CPU 0, but if the Measurement object has nothing, shouldn't it raise an error?

jaywonchung · 2024-12-10T18:28:48Z

zeus/metric.py

+
+        Args:
+            name (str): Name of the measurement window.
+            sync_execution (bool): Whether to execute synchronously. Defaults to None.


This is wrong. See ZeusMonitor.

jaywonchung · 2024-12-10T18:30:28Z

zeus/metric.py

@@ -54,6 +73,9 @@ class EnergyHistogram(Metric):
        gpu_bucket_range: Histogram buckets for GPU energy.
        cpu_bucket_range: Histogram buckets for CPU energy.
        dram_bucket_range: Histogram buckets for DRAM energy.
+        gpu_histograms: A single Prometheus Histogram metric for all GPU energy consumption, indexed by window and GPU index.
+        cpu_histograms: A single Prometheus Histogram metric for all CPU energy consumption, indexed by window and CPU index.
+        dram_histograms: A single Prometheus Histogram metric for all DRAM energy consumption, indexed by window and DRAM index.


Remove the entire Attributes section. They're not intended to be public attributes AFAIK.

For every class.

Add link to the push gateway Co-authored-by: Jae-Won Chung <jwnchung@umich.edu>

jaywonchung · 2024-12-10T18:32:47Z

zeus/metric.py

@@ -63,7 +85,7 @@ def __init__(
        prometheus_url: str,
        job: str,
        gpu_bucket_range: Sequence[float] = [50.0, 100.0, 200.0, 500.0, 1000.0],
-        cpu_bucket_range: Sequence[float] = [10.0, 20.0, 50.0, 100.0, 200.0],
+        cpu_bucket_range: Sequence[float] = [10.0, 50.0, 100.0, 500.0, 1000.0],


You updated the default range, but you never reflected the change in any docstring or doc. (1) Update the docstring, and (2) remove the defaults listed in measure.md and instead point people to the API reference page for defaults.

Generalize the device as {component} with the note that Gauge only supports GPU Co-authored-by: Jae-Won Chung <jwnchung@umich.edu>

Co-authored-by: Jae-Won Chung <jwnchung@umich.edu>

jaywonchung · 2024-12-10T18:34:03Z

zeus/metric.py

-        self.energy_monitor.begin_window(
-            f"__EnergyHistogram_{name}", sync_execution=True
-        )
+        self.energy_monitor.begin_window(f"__EnergyHistogram_{name}", sync_execution)


Suggested change

self.energy_monitor.begin_window(f"__EnergyHistogram_{name}", sync_execution)

self.energy_monitor.begin_window(f"__EnergyHistogram_{name}", sync_execution=sync_execution)

Ditto for end_window.

Co-authored-by: Jae-Won Chung <jwnchung@umich.edu>

jaywonchung · 2024-12-10T18:46:15Z

zeus/metric.py

@@ -288,28 +319,36 @@ def begin_window(self, name: str) -> None:
                self.update_period,
                self.prometheus_url,
                self.job,
+                sync_execution,


This is wrong. If sync_execution is True, you need to call zeus.utils.framework.sync_execution on the main thread there the application is running. On the other hand, the power/energy monitor process's ZeusMonitor should always be invoked with sync_execution=False.

Read ZeusMonitor to see how sync_execution (sometimes a boolean parameter and other times a function in zeus.utils.framework) is being used.

jaywonchung · 2024-12-10T18:47:09Z

zeus/metric.py

+        self.window_state[name] = MonitoringProcessState(
+            queue=self.queue, proc=self.proc
+        )


Why are you putting these in self.queue and self.proc??

jaywonchung · 2024-12-10T18:55:28Z

zeus/metric.py

+        if self.queue is not None:
+            self.queue.put("stop")
+        else:
+            raise RuntimeError("Queue is not initialized")


self.queue can the queue from any random window??? More specifically, it's going to be the queue that belongs to the most recently started window.

This level of quality is completely unacceptable. Please re-check the correctness of every line of code and documentation, and then ask for review.

jaywonchung

Left some comments on changes to make. I think they will be more or less straightforward ones. Let's hope this is the final round of change requests. Thanks!

…ge processing

sharonsyh added 2 commits October 18, 2024 01:25

Add metric.py, prometheus configs, and modify pyproject.toml

9bf5ce9

Reformat metric.py with black

dd881e8

sharonsyh changed the title ~~Prometheus Integration~~ Prometheus Integration - Branch Updated Oct 18, 2024

sharonsyh added 3 commits November 9, 2024 13:58

Add metric monitoring section to documentation

d21cfd1

Add unit tests for EnergyHistogram, EnergyCumulativeCounter, and Powe…

4681796

…rGauge

Add train_single.py for testing energy monitoring metrics

e8bfe7b

jaywonchung reviewed Nov 11, 2024

View reviewed changes

jaywonchung changed the title ~~Prometheus Integration - Branch Updated~~ [Feat] Prometheus metric export Nov 13, 2024

Update docs/measure/index.md

1b9e541

Co-authored-by: Jae-Won Chung <jwnchung@umich.edu>

jaywonchung linked an issue Nov 23, 2024 that may be closed by this pull request

[RFC] Integration of Prometheus Push Gateway and Energy Metrics Collection in Zeus #125

Open

jaywonchung reviewed Nov 23, 2024

View reviewed changes

examples/prometheus/requirements.txt Outdated Show resolved Hide resolved

jaywonchung reviewed Nov 23, 2024

View reviewed changes

docs/measure/index.md Show resolved Hide resolved

sharonsyh and others added 12 commits November 28, 2024 21:42

Update prometheus.yml

3569b68

Adjust target names to standardize pushgateway references, ensuring consistency with the Docker Compose configuration.

Improve example training script to include Zeus metrics

29e615b

Remove unintended file tests/test_metric.py from repository

2ae388f

Update the doc on Metrics Monitoring and Assumptions

6a9daa5

Update index.md

69c42da

Update index.md

4704a67

Add README for example training file with Zeus energy metrics integra…

5666ba5

…tion

Add Metric Name Construction section on index.md

863f257

Update index.md

35ab267

Update README.md to include the dependency on prometheus_client

4aa0f39

Update unit tests for the modified metric.py

1e996a4

sharonsyh requested a review from jaywonchung November 30, 2024 05:23

sharonsyh added 5 commits November 30, 2024 01:00

Fix formatting issues detected by black

8e1d35b

Fix formatting issues detected by black

7e47fbb

Fix formatting issues detected by black

52f2fa7

Fix formatting issues detected by black

30b807e

Resolve unbound variable errors

c53c72e