Skip to content

Commit

Permalink
Users/prachisingla/adding more metrics for nvidia smi (#280)
Browse files Browse the repository at this point in the history
* Adding additional metrics to nvidia-smi

* Fixing unit tests

* Improving parser

* fixing default value

---------

Co-authored-by: Prachi Singla <prachisingla@microsoft.com>
  • Loading branch information
psingla1210 and Prachi Singla authored Mar 19, 2024
1 parent 4472660 commit 8adcd16
Show file tree
Hide file tree
Showing 8 changed files with 397 additions and 63 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
timestamp, name, pci.bus_id, driver_version, pstate, pcie.link.gen.max, pcie.link.gen.current, utilization.gpu [%], utilization.memory [%], temperature.gpu, temperature.memory, power.draw.average [W], clocks.current.graphics [MHz], clocks.current.sm [MHz], clocks.current.video [MHz], clocks.current.memory [MHz], memory.total [MiB], memory.free [MiB], memory.used [MiB], power.draw.instant [W], pcie.link.gen.gpucurrent, pcie.link.width.current, ecc.errors.corrected.volatile.device_memory, ecc.errors.corrected.volatile.dram, ecc.errors.corrected.volatile.sram, ecc.errors.corrected.volatile.total, ecc.errors.corrected.aggregate.device_memory, ecc.errors.corrected.aggregate.dram, ecc.errors.corrected.aggregate.sram, ecc.errors.corrected.aggregate.total, ecc.errors.uncorrected.volatile.device_memory, ecc.errors.uncorrected.volatile.dram, ecc.errors.uncorrected.volatile.sram, ecc.errors.uncorrected.volatile.total, ecc.errors.uncorrected.aggregate.device_memory, ecc.errors.uncorrected.aggregate.dram, ecc.errors.uncorrected.aggregate.sram, ecc.errors.uncorrected.aggregate.total
2024/03/15 13:57:59.791, NVIDIA H100 80GB HBM3, 00000000:0A:00.0, 535.161.07, P0, 5, 5, 0, 0, 26, 35, 70.89, 345, 345, 765, 2619, 81559, 81007, 0, 70.68, 5, 16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

This file was deleted.

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
timestamp, name, pci.bus_id, driver_version, pstate, pcie.link.gen.max, pcie.link.gen.current, utilization.gpu [%], utilization.memory [%], temperature.gpu, temperature.memory, power.draw.average [W], clocks.current.graphics [MHz], clocks.current.sm [MHz], clocks.current.video [MHz], clocks.current.memory [MHz], memory.total [MiB], memory.free [MiB], memory.used [MiB], power.draw.instant [W], pcie.link.gen.gpucurrent, pcie.link.width.current, ecc.errors.corrected.volatile.device_memory, ecc.errors.corrected.volatile.dram, ecc.errors.corrected.volatile.sram, ecc.errors.corrected.volatile.total, ecc.errors.corrected.aggregate.device_memory, ecc.errors.corrected.aggregate.dram, ecc.errors.corrected.aggregate.sram, ecc.errors.corrected.aggregate.total, ecc.errors.uncorrected.volatile.device_memory, ecc.errors.uncorrected.volatile.dram, ecc.errors.uncorrected.volatile.sram, ecc.errors.uncorrected.volatile.total, ecc.errors.uncorrected.aggregate.device_memory, ecc.errors.uncorrected.aggregate.dram, ecc.errors.uncorrected.aggregate.sram, ecc.errors.uncorrected.aggregate.total
2024/03/15 13:57:59.791, NVIDIA H100 80GB HBM3, 00000000:0A:00.0, 535.161.07, P0, 5, 5, 0, 0, 26, 35, 70.89, 345, 345, 765, 2619, 81559, 81007, 0, 70.68, 5, 16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
2024/03/15 13:57:59.815, NVIDIA H100 80GB HBM3, 00000000:11:00.0, 535.161.07, P0, 5, 5, 0, 0, 26, 34, 71.71, 345, 345, 765, 2619, 81559, 81007, 0, 72.05, 5, 16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
2024/03/15 13:57:59.838, NVIDIA H100 80GB HBM3, 00000000:18:00.0, 535.161.07, P0, 5, 5, 0, 0, 25, 33, 70.78, 345, 345, 765, 2619, 81559, 81007, 0, 70.97, 5, 16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
2024/03/15 13:57:59.861, NVIDIA H100 80GB HBM3, 00000000:69:00.0, 535.161.07, P0, 5, 5, 0, 0, 25, 33, 72.17, 345, 345, 765, 2619, 81559, 81007, 0, 71.44, 5, 16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
2024/03/15 13:57:59.791, NVIDIA H100 80GB HBM3, 00000000:0A:00.0, 535.161.07, P0, 5, 5, 0, 0, 26, 35, 70.89, 345, 345, 765, 2619, 81559, 81007, 0, 70.68, 5, 16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
2024/03/15 13:57:59.815, NVIDIA H100 80GB HBM3, 00000000:11:00.0, 535.161.07, P0, 5, 5, 0, 0, 26, 34, 71.71, 345, 345, 765, 2619, 81559, 81007, 0, 72.05, 5, 16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
2024/03/15 13:57:59.838, NVIDIA H100 80GB HBM3, 00000000:18:00.0, 535.161.07, P0, 5, 5, 0, 0, 25, 33, 70.78, 345, 345, 765, 2619, 81559, 81007, 0, 70.97, 5, 16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
2024/03/15 13:57:59.861, NVIDIA H100 80GB HBM3, 00000000:69:00.0, 535.161.07, P0, 5, 5, 0, 0, 25, 33, 72.17, 345, 345, 765, 2619, 81559, 81007, 0, 71.44, 5, 16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
Loading

0 comments on commit 8adcd16

Please sign in to comment.