ml-energy · sharonsyh · Oct 18, 2024 · Oct 18, 2024 · Nov 9, 2024 · Nov 9, 2024
diff --git a/docker/prometheus/docker-compose.yml b/docker/prometheus/docker-compose.yml
@@ -0,0 +1,25 @@
+version: '3.7'
+services:
+  prometheus:
+    image: prom/prometheus
+    volumes:
+      - "./prometheus.yml:/etc/prometheus/prometheus.yml"
+    networks:
+      - localprom
+    ports:
+      - 9090:9090
+  node-exporter:
+    image: prom/node-exporter
+    networks:
+      - localprom
+    ports:
+      - 9100:9100
+  pushgateway:
+    image: prom/pushgateway
+    networks:
+      - localprom
+    ports:
+      - 9091:9091
+networks:
+  localprom:
+    driver: bridge
diff --git a/docker/prometheus/prometheus.yml b/docker/prometheus/prometheus.yml
@@ -0,0 +1,14 @@
+global:
+  scrape_interval: 15s
+
+scrape_configs:
+  - job_name: 'prometheus'
+    static_configs:
+      - targets: ['localhost:9090']  
+  - job_name: 'pushgateway'
+    static_configs:
+      - targets: ['pushgateway:9091']
+  - job_name: 'node'
+    static_configs:
+      - targets: ['node-exporter:9100']
+
diff --git a/docs/measure/index.md b/docs/measure/index.md
@@ -88,7 +88,7 @@ To only measure the energy consumption of the CPU used by the current Python pro
 
 You can pass in `cpu_indices=[]` or `gpu_indices=[]` to [`ZeusMonitor`][zeus.monitor.ZeusMonitor] to disable either CPU or GPU measurements.
 
-```python hl_lines="2 5-7"
+```python hl_lines="2 5-15"
 from zeus.monitor import ZeusMonitor
 from zeus.device.cpu import get_current_cpu_index
 
@@ -114,6 +114,218 @@ if __name__ == "__main__":
         avg_energy = sum(map(lambda m: m.total_energy, steps)) / len(steps)
         print(f"One step takes {avg_time} s and {avg_energy} J for the CPU.")
 ```
+## Metric Monitoring
+
+Zeus allows for efficient monitoring of energy and power consumption for GPUs, CPUs, and DRAM using Prometheus. It tracks key metrics such as energy usage, power draw, and cumulative consumption. Users can define measurement windows to track energy usage for specific operations, enabling granular analysis and optimization.
+
+!!! Assumption
+    A [Prometheus Push Gateway](https://prometheus.io/docs/instrumenting/pushing/) must be deployed and accessible. This ensures that metrics collected by Zeus can be pushed to Prometheus.
+
+### Local Setup Guide
+
+#### Step 1: Install and Start the Prometheus Push Gateway
+Choose either Option 1 (Binary) or Option 2 (Docker).
+
+##### Option 1: Download Binary
+1. Visit the [Prometheus Push Gateway Download Page](https://prometheus.io/download/#pushgateway).
+2. Download the appropriate binary for your operating system.
+3. Extract the binary:
+```sh
+   tar -xvzf prometheus-pushgateway*.tar.gz
+   cd prometheus-pushgateway-*
+```
+4. Start the Push Gateway:
+```sh
+./prometheus-pushgateway --web.listen-address=:9091
+```
+5. Verify the Push Gateway is running by visiting http://localhost:9091 in your browser.
+
+##### Option 2: Using Docker
+1. Pull the official Prometheus Push Gateway Docker image:
+```sh
+docker pull prom/pushgateway
+```
+2. Run the Push Gateway in a container:
+```sh
+docker run -d -p 9091:9091 prom/pushgateway
+```
+3. Verify it is running by visiting http://localhost:9091 in your browser.
+
+#### Step 2: Install and Configure Prometheus
+1. Visit the Prometheus [Prometheus Download Page](https://prometheus.io/download/#prometheus).
+2. Download the appropriate binary for your operating system.
+3. Extract the binary:
+```sh
+tar -xvzf prometheus*.tar.gz
+cd prometheus-*
+```
+4. Update the Prometheus configuration file `prometheus.yml` to scrape metrics from the Push Gateway:
+```sh
+scrape_configs:
+  - job_name: 'pushgateway'
+    honor_labels: true
+    static_configs:
+      - targets: ['localhost:9091']  # Replace with your Push Gateway URL
+```
+5. Start Prometheus:
+```sh
+./prometheus --config.file=prometheus.yml
+```
+6. Visit http://localhost:9090 in your browser, or use curl http://localhost:9090/api/v1/targets
+7. Verify Prometheus is running by visiting http://localhost:9090 in your browser.
+
+### Metric Name Construction
+
+Zeus organizes metrics using **static metric names** and **dynamic labels** for flexibility and ease of querying in Prometheus. Metric names are static and cannot be overridden, but users can customize the context of the metrics by naming the window when using `begin_window()` and `end_window()`.
+
+#### Metric Name
+- For Histogram: `energy_monitor_{component}_energy_joules`
+- For Counter: `energy_monitor_{component}_energy_joules`
+- For Gauge: `power_monitor_{component}_power_watts`
+
+Note that Gauge only supports the GPU component at the moment. Tracking issue: [#128](https://github.com/ml-energy/zeus/issues/128)
+
+
+component: gpu, cpu, or dram
+
+#### Labels
+- window: The user-defined window name provided during `begin_window()` and `end_window()` (e.g., `energy_histogram.begin_window(f"epoch_energy")`).
+- index: The index of the device (e.g., `0` for GPU 0).
+
+### Usage and Initialization
+[`EnergyHistogram`][zeus.metric.EnergyHistogram] records energy consumption data for GPUs, CPUs, and DRAM in Prometheus Histograms. This is ideal for observing how often energy usage falls within specific ranges.
+
+```python hl_lines="2 5-15"
+from zeus.metric import EnergyHistogram
+
+if __name__ == "__main__":
+    # Initialize EnergyHistogram
+    energy_histogram = EnergyHistogram(
+        cpu_indices=[0], 
+        gpu_indices=[0], 
+        prometheus_url='http://localhost:9091', 
+        job='training_energy_histogram'
+    )
+
+    for epoch in range(100):
+        # Start monitoring energy for the entire epoch
+        energy_histogram.begin_window("epoch_energy")
+        # Perform epoch-level operations 
+        train_one_epoch(train_loader, model, optimizer, criterion, epoch, args)
+        acc1 = validate(val_loader, model, criterion, args)
+        # End monitoring energy for the epoch
+        energy_histogram.end_window("epoch_energy")
+        print(f"Epoch {epoch} completed. Validation Accuracy: {acc1}%")
+
+```
+You can use the `begin_window` and `end_window` methods to define a measurement window, similar to other ZeusMonitor operations. Energy consumption data will be recorded for the entire duration of the window.
+
+!!! Tip 
+    You can customize the bucket ranges for GPUs, CPUs, and DRAM during initialization to tailor the granularity of energy monitoring. For example:
+    ```python hl_lines="2 5-15"
+    energy_histogram = EnergyHistogram(
+        cpu_indices=[0], 
+        gpu_indices=[0], 
+        prometheus_url="http://localhost:9091", 
+        job="training_energy_histogram",
+        gpu_bucket_range=[10.0, 25.0, 50.0, 100.0],
+        cpu_bucket_range=[5.0, 15.0, 30.0, 50.0],
+        dram_bucket_range=[2.0, 8.0, 20.0, 40.0],
+    )
+    ```
+
+If no custom `bucket ranges` are specified, Zeus uses these default ranges:
+
+- GPU: `[50.0, 100.0, 200.0, 500.0, 1000.0]`
+- CPU: `[10.0, 20.0, 50.0, 100.0, 200.0]`
+- DRAM: `[5.0, 10.0, 20.0, 50.0, 150.0]`
+!!! Warning
+    Empty bucket ranges (e.g., []) are not allowed and will raise an error. Ensure you provide a valid range for each device or use the defaults.
+
+
+[`EnergyCumulativeCounter`][zeus.metric.EnergyCumulativeCounter] monitors cumulative energy consumption. It tracks energy usage over time, without resetting the values, and is updated periodically.
+
+```python hl_lines="2 5-15"
+
+from zeus.metric import EnergyCumulativeCounter
+
+if __name__ == "__main__":
+
+    cumulative_counter_metric = EnergyCumulativeCounter(
+        cpu_indices=[0], 
+        gpu_indices=[0], 
+        update_period=2,  
+        prometheus_url='http://localhost:9091',
+        job='energy_counter_job'
+    )
+    train_loader = range(10) 
+    val_loader = range(5)  
+
+    cumulative_counter_metric.begin_window("training_energy_monitoring")
+
+    for epoch in range(100):  
+        print(f"\n--- Epoch {epoch} ---")
+        train_one_epoch(train_loader, model, optimizer, criterion, epoch, args)
+        acc1 = validate(val_loader, model, criterion, args)
+        print(f"Epoch {epoch} completed. Validation Accuracy: {acc1:.2f}%.")
+
+        # Simulate additional operations outside of training
+        print("\nSimulating additional operations...")
+        time.sleep(10)
+
+    cumulative_counter_metric.end_window("training_energy_monitoring")
+```
+In this example, `cumulative_counter_metric` monitors energy usage throughout the entire training process rather than on a per-epoch basis. The `update_period` parameter defines how often the energy measurements are updated and pushed to Prometheus. 
+
+[`PowerGauge`][zeus.metric.PowerGauge] tracks real-time power consumption using Prometheus Gauges which monitors fluctuating values such as power usage.
+
+```python hl_lines="2 5-15"
+from zeus.metric import PowerGauge
+
+if __name__ == "__main__":
+
+    power_gauge_metric = PowerGauge(
+        gpu_indices=[0], 
+        update_period=2,  
+        prometheus_url='http://localhost:9091',
+        job='power_gauge_job'
+    )
+    train_loader = range(10) 
+    val_loader = range(5)  
+
+    power_gauge_metric.begin_window("training_power_monitoring")
+
+    for epoch in range(100):  
+        print(f"\n--- Epoch {epoch} ---")
+        train_one_epoch(train_loader, model, optimizer, criterion, epoch, args)
+        acc1 = validate(val_loader, model, criterion, args)
+        print(f"Epoch {epoch} completed. Validation Accuracy: {acc1:.2f}%.")
+
+        # Simulate additional operations outside of training
+        print("\nSimulating additional operations...")
+        time.sleep(10)
+
+    power_gauge_metric.end_window("training_power_monitoring")
+```
+The `update_period` parameter defines how often the power datas are updated and pushed to Prometheus.
+
+
+### How to Query Metrics in Prometheus
+
+Energy for a specific window:
+```promql
+energy_monitor_gpu_energy_joules{window="epoch_energy"}
+```
+
+Sum of energy for a specific window:
+```promql
+sum(energy_monitor_gpu_energy_joules) by (window)
+```
+
+Sum of energy for specific GPU across all windows:
+```promql
+sum(energy_monitor_gpu_energy_joules{index="0"})
+```
 
 ## CLI power and energy monitor
 
@@ -149,3 +361,23 @@ Total time (s): 4.421529293060303
 Total energy (J):
 {'GPU0': 198.52566362297537, 'GPU1': 206.22215216255188, 'GPU2': 201.08565518283845, 'GPU3': 201.79834523367884}
 ```
+
+## Hardware Support
+We currently support both NVIDIA (via NVML) and AMD GPUs (via AMDSMI, with ROCm 6.1 or later).
+
+### `get_gpus`
+The [`get_gpus`][zeus.device.get_gpus] function returns a [`GPUs`][zeus.device.gpu.GPUs] object, which can be either an [`NVIDIAGPUs`][zeus.device.gpu.NVIDIAGPUs] or [`AMDGPUs`][zeus.device.gpu.AMDGPUs] object depending on the availability of `nvml` or `amdsmi`. Each [`GPUs`][zeus.device.gpu.GPUs] object contains one or more [`GPU`][zeus.device.gpu.common.GPU] instances, which are specifically [`NVIDIAGPU`][zeus.device.gpu.nvidia.NVIDIAGPU] or [`AMDGPU`][zeus.device.gpu.amd.AMDGPU] objects.
+
+These [`GPU`][zeus.device.gpu.common.GPU] objects directly call respective `nvml` or `amdsmi` methods, providing a one-to-one mapping of methods for seamless GPU abstraction and support for multiple GPU types. For example:
+- [`NVIDIAGPU.getName`][zeus.device.gpu.nvidia.NVIDIAGPU.getName] calls `pynvml.nvmlDeviceGetName`.
+- [`AMDGPU.getName`][zeus.device.gpu.amd.AMDGPU.getName] calls `amdsmi.amdsmi_get_gpu_asic_info`.
+
+### Notes on AMD GPUs
+
+#### AMD GPUs Initialization
+`amdsmi.amdsmi_get_energy_count` sometimes returns invalid values on certain GPUs or ROCm versions (e.g., MI100 on ROCm 6.2). See [ROCm issue #38](https://github.com/ROCm/amdsmi/issues/38) for more details. During the [`AMDGPUs`][zeus.device.gpu.AMDGPUs] object initialization, we call `amdsmi.amdsmi_get_energy_count` twice for each GPU, with a 0.5-second delay between calls. This difference is compared to power measurements to determine if `amdsmi.amdsmi_get_energy_count` is stable and reliable. Initialization takes 0.5 seconds regardless of the number of AMD GPUs.
+
+`amdsmi.amdsmi_get_power_info` provides "average_socket_power" and "current_socket_power" fields, but the "current_socket_power" field is sometimes not supported and returns "N/A." During the [`AMDGPUs`][zeus.device.gpu.AMDGPUs] object initialization, this method is checked, and if "N/A" is returned, the [`AMDGPU.getInstantPowerUsage`][zeus.device.gpu.amd.AMDGPU.getInstantPowerUsage] method is disabled. Instead, [`AMDGPU.getAveragePowerUsage`][zeus.device.gpu.amd.AMDGPU.getAveragePowerUsage] needs to be used.
+
+#### Supported AMD SMI Versions
+Only ROCm >= 6.1 is supported, as the AMDSMI APIs for power and energy return wrong values. For more information, see [ROCm issue #22](https://github.com/ROCm/amdsmi/issues/22). Ensure your `amdsmi` and ROCm versions are up to date.
diff --git a/examples/pipeline_frequency_optimizer/profile_p2p.py b/examples/pipeline_frequency_optimizer/profile_p2p.py
@@ -1,4 +1,4 @@
-"""Profile the power cosumtion of the GPU while waiting on P2P communication."""
+"""Profile the power consumption of the GPU while waiting on P2P communication."""
 
 import os
 import time

diff --git a/examples/power_limit_optimizer/README.md b/examples/power_limit_optimizer/README.md
@@ -7,7 +7,7 @@ The former script is for simple single GPU training, whereas the latter is for d
 
 ## Dependencies
 
-All packages (including torchvision) are pre-installed if you're using our [Docker image](https://ml.energy/zeus/getting_started/environment/).
+All packages (including torchvision) are pre-installed if you're using our [Docker image](https://ml.energy/zeus/getting_started/#using-docker).
 You just need to download and extract the ImageNet data and mount it to the Docker container with the `-v` option (first step below).
 
 1. Download the ILSVRC2012 dataset from [the ImageNet homepage](http://www.image-net.org/).

diff --git a/examples/prometheus/README.md b/examples/prometheus/README.md
@@ -0,0 +1,55 @@
+# Integrating the power limit optimizer with ImageNet training
+
+This example will demonstrate how to integrate Zeus with `torchvision` and the ImageNet dataset.
+
+[`train_single.py`](train_single.py) and [`train_dp.py`](train_dp.py) were adapted and simplified from [PyTorch's example training code for ImageNet](https://github.com/pytorch/examples/blob/main/imagenet/main.py).
+The former script is for simple single GPU training, whereas the latter is for data parallel training with PyTorch DDP and [`torchrun`](https://pytorch.org/docs/stable/elastic/run.html).
+
+## Dependencies
+
+All packages (including torchvision and prometheus_client) are pre-installed if you're using our [Docker image](https://ml.energy/zeus/getting_started/environment/).
+You just need to download and extract the ImageNet data and mount it to the Docker container with the `-v` option (first step below).
+
+1. Download the ILSVRC2012 dataset from [the ImageNet homepage](http://www.image-net.org/).
+    Then, extract archives using [this script](https://github.com/pytorch/examples/blob/main/imagenet/extract_ILSVRC.sh) provided by PyTorch.
+1. Install `zeus` and build the power monitor, following [Installing and Building](https://ml.energy/zeus/getting_started/installing_and_building/).
+1. Install `torchvision`:
+    ```sh
+    pip install torchvision==0.15.2
+    ```
+1. Install `prometheus_client`:
+    ```sh
+    pip install zeus-ml[prometheus]
+    ```
+
+## EnergyHistogram, PowerGauge, and EnergyCumulativeCounter
+- [`EnergyHistogram`](https://ml.energy/zeus/reference/metric/#zeus.metric.EnergyHistogram): Records energy consumption data for GPUs, CPUs, and DRAM and pushes the data to Prometheus as histogram metrics. This is useful for tracking energy usage distribution over time.
+- [`PowerGauge`](https://ml.energy/zeus/reference/metric/#zeus.metric.PowerGauge): Monitors real-time GPU power usage and pushes the data to Prometheus as gauge metrics, which are updated at regular intervals.
+- [`EnergyCumulativeCounter`](https://ml.energy/zeus/reference/metric/#zeus.metric.EnergyCumulativeCounter): Tracks cumulative energy consumption over time for CPUs and GPUs and pushes the results to Prometheus as counter metrics.
+
+## `ZeusMonitor` and `GlobalPowerLimitOptimizer`
+
+- [`ZeusMonitor`](http://ml.energy/zeus/reference/monitor/#zeus.monitor.ZeusMonitor): Measures the GPU time and energy consumption of arbitrary code blocks.
+- [`GlobalPowerLimitOptimizer`](https://ml.energy/zeus/reference/optimizer/power_limit/#zeus.optimizer.power_limit.GlobalPowerLimitOptimizer): Online-profiles each power limit with `ZeusMonitor` and finds the cost-optimal power limit.
+
+## Example command
+
+You can specify the maximum training time slowdown factor (1.0 means no slowdown) by setting `ZEUS_MAX_SLOWDOWN`. The default is set to 1.1 in this example script, meaning the lowest power limit that keeps training time inflation within 10% will be automatically found.
+`GlobalPowerLimitOptimizer` supports other optimal power limit selection strategies. See [here](https://ml.energy/zeus/reference/optimizer/power_limit).
+
+```bash
+# Single-GPU
+python train_single.py \
+    [DATA_DIR] \
+    --gpu 0                 `# Specify the GPU id to use`
+
+# Multi-GPU Data Parallel
+torchrun \
+    --nnodes 1 \
+    --nproc_per_node gpu    `# Number of processes per node, should be equal to the number of GPUs.` \
+                            `# When set to 'gpu', it means use all the GPUs available.` \
+    train_dp.py \
+    [DATA_DIR]
+```
+
+
diff --git a/examples/prometheus/requirements.txt b/examples/prometheus/requirements.txt
@@ -0,0 +1,2 @@
+torch
+torchvision