Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistently reports 0 W for DRAM consumption #108

Closed
wanecek opened this issue May 7, 2021 · 9 comments · Fixed by #114
Closed

Consistently reports 0 W for DRAM consumption #108

wanecek opened this issue May 7, 2021 · 9 comments · Fixed by #114
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed

Comments

@wanecek
Copy link

wanecek commented May 7, 2021

Bug description

Scaphandre reports 0 W as the DRAM consumption.

Expected behavior

I hoped scaphandre would include DRAM measurements.

Screenshots

Screenshot-2021-05-07T09:09:55

Screenshot-2021-05-07T09:27:00

Environment

  • Linux distribution version: Ubuntu 20.04.2
  • Kernel version: 5.4.0-67-generic
  • CPU: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
  • Scaphandre version: a7b0159ce5135e2b25fe4c97a86434d30f6e0982 (self-built). Also tried pre-built docker image.

Additional context

I realize that this may very well be the intended behavior, and not a bug (e.g. because the Intel RAPL sensor or powercap doesn't output RAM for this CPU make). I was mainly hoping to get clarifications into this - if it's something I will be able to fix (maybe I'm just operating scaphandre wrong or am missing something silly?), or if it's just the way it is. Thank you for having written this really useful tool!

Also, could it simply be because the memory usage is low, i.e. around 1GB used with 11GB buff/cached?

root@Ubuntu-2004-focal-64-minimal ~ # free -m
              total        used        free      shared  buff/cache   available
Mem:          31900        1381       19428         140       11090       29945
Swap:         16366           0       16366
@wanecek wanecek added the bug Something isn't working label May 7, 2021
@wanecek
Copy link
Author

wanecek commented May 7, 2021

From reading other issues and how they're debugged, maybe this is useful? Sorry, I know very little about Powercap 😅

Just ask if there's anything I can do to help in the debugging here :)

root@Ubuntu-2004-focal-64-minimal ~ # ls /sys/class/powercap/
intel-rapl  intel-rapl:0  intel-rapl:0:0  intel-rapl:0:1  intel-rapl:0:2
root@Ubuntu-2004-focal-64-minimal ~ # ls /sys/class/powercap/intel-rapl:0:2
constraint_0_max_power_uw    device               name
constraint_0_name            enabled              power
constraint_0_power_limit_uw  energy_uj            subsystem
constraint_0_time_window_us  max_energy_range_uj  uevent
root@Ubuntu-2004-focal-64-minimal /sys/class/powercap/intel-rapl:0:2 # cat energy_uj
185204869380

@bpetit bpetit added the good first issue Good for newcomers label May 9, 2021
@bpetit
Copy link
Contributor

bpetit commented May 9, 2021

Hi !

Thanks for reporting that. Could you confirm that /sys/class/powercap/intel-rapl:0:2/name contains "dram" ?

What's strange too is that the sum of "core" and "uncore" power consumption is above the total power consumption of the host. I couldn't reproduce it on the machines I have access to. We could have diverse behaviour depending on the hardware and the kernel version. We will surely have to ask you to run some tests to work on it.

I've added this issue to the global board, I'd prefer someone else to look at it but I'll do it if nothing has moved in a few time.

@bpetit bpetit added the help wanted Extra attention is needed label May 9, 2021
@wanecek
Copy link
Author

wanecek commented May 9, 2021

Hi @bpetit, thank you for your quick and kind reply! Yeah, indeed, the sums not adding up is strange, now that you mention it...

I can confirm that the name contains (or equals) "dram". Just let me know if there's any more information that I can provide to help you debug, or if there's any specific place I should start when debugging. I started trying to debug the issue, but didn't get very far. Will try some more later this week.

root@Ubuntu-2004-focal-64-minimal ~ # cat /sys/class/powercap/intel-rapl:0:2/name
dram

@wanecek
Copy link
Author

wanecek commented May 10, 2021

Update: I discovered powerapi-ng/energy-scripts today, which seems to be a shell script reading from the powercap sensor. I was able to have it report DRAM data back to me, suggesting that at least my sensor is able to report data on the dram usage (unless the energy-scripts file is lying to me, haha!). On the other hand, it reports Uncore as 0.

root@Ubuntu-2004-focal-64-minimal ~ # ./measureit.sh

 ----------------------------------------------
|               execution time  (us)           |
 ----------------------------------------------
|               2393                           |
 ----------------------------------------------
| Socket    | Component  | energy (uJ)         |
 ----------------------------------------------
| Socket    | CORE       | 55176               |
| Socket    | CPU        | 59753               |
| Socket    | DRAM       | 30579               |
| Socket    | UNCORE     | 0                   |
 ----------------------------------------------

as far as I can tell from the source code, however, measureit.sh reads from another path than scaphandre, i.e. /sys/devices/virtual/powercap/intel-rapl.


The same applies to when I run the rapl-read script from http://web.eece.maine.edu/~vweaver/projects/rapl/ -

root@Ubuntu-2004-focal-64-minimal ~ # gcc rapl-read.c -lm -o rapl-read.out
root@Ubuntu-2004-focal-64-minimal ~ # ./rapl-read.out -s

RAPL read -- use -s for sysfs, -p for perf_event, -m for msr

Found Kaby Lake Processor type
	0 (0), 1 (0), 2 (0), 3 (0), 4 (0), 5 (0), 6 (0), 7 (0)
	
	Detected 8 cores in 1 packages


Trying sysfs powercap interface to gather results

	Sleeping 1 second

	Package 0
		package-0	: 1.356625J
		core	: 0.942808J
		uncore	: 0.000000J
		dram	: 1.069944J

root@Ubuntu-2004-focal-64-minimal ~ # ./rapl-read.out -p

RAPL read -- use -s for sysfs, -p for perf_event, -m for msr

Found Kaby Lake Processor type
	0 (0), 1 (0), 2 (0), 3 (0), 4 (0), 5 (0), 6 (0), 7 (0)
	
	Detected 8 cores in 1 packages


Trying perf_event interface to gather results

	Event=energy-cores Config=1 scale=2.32831e-10 units=Joules
	Event=energy-gpu Config=4 scale=2.32831e-10 units=Joules
	Event=energy-pkg Config=2 scale=2.32831e-10 units=Joules
	Event=energy-ram Config=3 scale=2.32831e-10 units=Joules

	Sleeping 1 second

	Package 0:
		energy-cores Energy Consumed: 0.878906 Joules
		energy-gpu Energy Consumed: 0.000000 Joules
		energy-pkg Energy Consumed: 1.285156 Joules
		energy-ram Energy Consumed: 1.052246 Joules

@wanecek wanecek changed the title Consistently replorts 0 W for DRAM consumption Consistently reports 0 W for DRAM consumption May 11, 2021
@wanecek
Copy link
Author

wanecek commented May 12, 2021

I'm continuing my futile attempts to debug this.

TL;DR: Could it be that the an assumption is made in the exporter(s) that the domains are in the order [core, uncore, dram], while they (depending on what fs::read_dir inside powercapl_rapl.rs returns) can be in any order? 🤔

When trying to log what the path to the counter_uj_path for sockets are, I get that only one socket is created (and iterated over), and the path to its energy counter uj file is /sys/class/powercap/intel-rapl:0/energy_uj.

I have no familiarity with rust - can I somehow build with a debug flag to view the output of the debug!/trace! functions? :)

diff --git a/src/exporters/stdout.rs b/src/exporters/stdout.rs
index e22efb3..52265f1 100644
--- a/src/exporters/stdout.rs
+++ b/src/exporters/stdout.rs
@@ -99,6 +99,10 @@ impl StdoutExporter {

     fn iterate(&mut self) {
         self.topology.refresh();
+        println!("We have {} sockets", &self.topology.sockets.len());
+        for s in &self.topology.sockets {
+            println!("{}: path was {}", s.id.to_string(), s.counter_uj_path);
+        }
         self.show_metrics();
     }

diff --git a/src/sensors/mod.rs b/src/sensors/mod.rs
index 32fbcdc..69cca6c 100644
--- a/src/sensors/mod.rs
+++ b/src/sensors/mod.rs
@@ -609,6 +609,7 @@ impl CPUSocket {
         counter_uj_path: String,
         buffer_max_kbytes: u16,
     ) -> CPUSocket {
+        println!("Created socket with path: {}", counter_uj_path);
         CPUSocket {
             id,
             domains,

On the other hand, when logging the path(s) used of each domain in the StdoutExporter, I am a bit confused about which path should be read for each domain. With my limited understanding, it almost looks like they are in the wrong order, and that instead of the order [core, uncore, dram], the domains are ordered [dram, core, uncore]...

recorded 1074525 MicroWatts at 1620813069.509794259s
Domain path was: /sys/class/powercap/intel-rapl:0:2/energy_uj

recorded 1099626 MicroWatts at 1620813069.509870416s
Domain path was: /sys/class/powercap/intel-rapl:0:0/energy_uj

recorded 0 MicroWatts at 1620813069.509940331s
Domain path was: /sys/class/powercap/intel-rapl:0:1/energy_uj

Host:	1.520388 W	Core		Uncore		DRAM
Socket0	1.521252 W	1.074525 W	1.099626 W	0 W	
diff --git a/src/exporters/stdout.rs b/src/exporters/stdout.rs
index e22efb3..62c9096 100644
--- a/src/exporters/stdout.rs
+++ b/src/exporters/stdout.rs
@@ -89,7 +89,13 @@ impl StdoutExporter {
         if let Some(socket) = socket_present {
             let mut domains_power: Vec<Option<Record>> = vec![];
             for d in socket.get_domains_passive() {
-                domains_power.push(d.get_records_diff_power_microwatts());
+                let power = d.get_records_diff_power_microwatts();
+                match power {
+                    Some(ref val) => println!("{}", val),
+                    None => println!("No power!"),
+                }
+                println!("Domain path was: {}\n", d.counter_uj_path);
+                domains_power.push(power);
             }
             domains_power
         } else {
# cat /sys/class/powercap/intel-rapl:0:0/name
core
# cat /sys/class/powercap/intel-rapl:0:1/name
uncore
# cat /sys/class/powercap/intel-rapl:0:2/name
dram

Indeed, this is the order that they are added to the socket:

Adding domain to socket dram (/sys/class/powercap/intel-rapl:0:2/energy_uj)
Adding domain to socket core (/sys/class/powercap/intel-rapl:0:0/energy_uj)
Adding domain to socket uncore (/sys/class/powercap/intel-rapl:0:1/energy_uj)
diff --git a/src/sensors/mod.rs b/src/sensors/mod.rs
index 32fbcdc..6a85856 100644
--- a/src/sensors/mod.rs
+++ b/src/sensors/mod.rs
@@ -207,6 +207,7 @@ impl Topology {
         buffer_max_kbytes: u16,
     ) {
         let iterator = self.sockets.iter_mut();
+        println!("Adding domain to socket {} ({})", name, uj_counter);
         for socket in iterator {
             if socket.id == socket_id {
                 socket.safe_add_domain(Domain::new(

@wanecek
Copy link
Author

wanecek commented May 12, 2021

I think the above is even clearer in the json exporter, where we on one hand have a hard-coded names array (exporters/json.rs:166), but when logging the domain name (instead of using the array), I get the following:

Domain.name is dram, but names[index] is core
Domain.name is core, but names[index] is uncore
Domain.name is uncore, but names[index] is dram
diff --git a/src/exporters/json.rs b/src/exporters/json.rs
index a6a3ca3..4a3a68e 100644
--- a/src/exporters/json.rs
+++ b/src/exporters/json.rs
@@ -177,12 +177,17 @@ impl JSONExporter {
                 let domains = socket
                     .get_domains_passive()
                     .iter()
-                    .map(|d| d.get_records_diff_power_microwatts())
+                    .enumerate()
+                    .map(|(index, d)| {
+                        println!("Domain.name is {}, but names[index] is {}", d.name, names[index]);
+                        return d.get_records_diff_power_microwatts();
+                    })
                     .map(|record| record.map(|d| d.value))
                     .enumerate()
                     .map(|(index, d)| {
                         let domain_power =
                             d.map(|value| value.parse::<u64>().unwrap()).unwrap_or(0);
+
                         Domain {
                             name: names[index].to_string(),
                             consumption: domain_power as f32,
``

@PierreRust
Copy link
Collaborator

Hi,
I just had a look and there is indeed the way the stdout exported prints out the results, the order of domains is assumed to be [Core, Uncore, DRAM]which is not necessarily the case. It's probably the same with json, I have not checked yet.

I'll try to push a PR.

@wanecek
Copy link
Author

wanecek commented May 12, 2021

@PierreRust That's fantastic, many thanks!

Yeah, I tried a naive solution, sorting the array inside powercap_rapl.rs (see below). This resulted in me finally getting DRAM results on both json and stdout exporter 🎉

Forgive the probably horrendously unsemantic rust:

diff --git a/src/sensors/powercap_rapl.rs b/src/sensors/powercap_rapl.rs
index 208b420..e57eb35 100644
--- a/src/sensors/powercap_rapl.rs
+++ b/src/sensors/powercap_rapl.rs
@@ -73,8 +73,17 @@ impl Sensor for PowercapRAPLSensor {
         }
         let mut topo = Topology::new();
         let re_domain = Regex::new(r"^.*/intel-rapl:\d+:\d+$").unwrap();
-        for folder in fs::read_dir(&self.base_path).unwrap() {
-            let folder_name = String::from(folder.unwrap().path().to_str().unwrap());
+
+        let mut folder_names = fs::read_dir(&self.base_path)
+            .unwrap()
+            .map(|folder|
+                String::from(folder.unwrap().path().to_str().unwrap())
+            )
+            .collect::<Vec<_>>();
+
+        folder_names.sort();
+
+        for folder_name in folder_names {
             // let's catch domain folders
             if re_domain.is_match(&folder_name) {
                 // let's get the second number of the intel-rapl:X:X string
root@Ubuntu-2004-focal-64-minimal ~/scaphandre/src #

@bpetit
Copy link
Contributor

bpetit commented May 14, 2021

Well done ! Looks like a bug solved ! :)

@bpetit bpetit added this to General Jun 19, 2024
@bpetit bpetit moved this to Previous releases in General Jun 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed
Projects
Status: Previous releases
Development

Successfully merging a pull request may close this issue.

3 participants