Add support for collecting/visualizing Perf Profile data #77

janaknat · 2023-08-18T20:58:29Z

We are using the perf tool to gather the profiling data. There are two outputs from 1 perf profile data. First, is the raw top functions gathered by using perf report --stdio --percent-limit 1. The second, is a flamegraph generated using the output of perf script and the flamegraph rust crate. The flamegraph generated this way is very nearly the same as the one generated with Brendan Greggs' perl script.

A new flag is introduced for aperf record called --intensive. To enable perf profiling, the user must specify the --intensive flag during aperf record.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

compprofile.tar.gz

wash-amzn · 2023-08-22T14:24:50Z

README.md

@@ -91,6 +91,8 @@ To see a step-by-step example, please see our example [here](./EXAMPLE.md)

 `-vv, --verbose --verbose` more verbose messages

+`--intensive` gather data for which the CPU utilization is high when collected


This should also indicate what that current includes (so you know what you're getting in exchange for the CPU usage)

wash-amzn · 2023-08-22T14:29:37Z

src/data/flamegraphs.rs

+            Err(e) => if e.kind() == ErrorKind::NotFound {
+                error!("Perf command not found");
+            },


What about other errors?

I'll add a default case for all other error types.

wash-amzn · 2023-08-22T14:36:06Z

src/html_files/index.html

 		<div class="tab">
 			<button class="tablinks" name="system_info" id="default">SUT Config</button>
-			<button class="tablinks" name="sysctl">Sysctl Data</button>
 			<button class="tablinks" name="cpu_utilization">CPU Utilization</button>
+			<button class="tablinks" name="flamegraphs">Flamegraphs</button>
+			<button class="tablinks" name="top_functions">Top Functions</button>
 			<button class="tablinks" name="processes">Processes</button>
+			<button class="tablinks" name="perfstat">PMU Stats</button>
 			<button class="tablinks" name="meminfo">Meminfo</button>
-			<button class="tablinks" name="vmstat">VM Stat</button>
 			<button class="tablinks" name="kernel_config">Kernel Config</button>
+			<button class="tablinks" name="sysctl">Sysctl Data</button>
+			<button class="tablinks" name="vmstat">VM Stat</button>
 			<button class="tablinks" name="interrupts">Interrupt Data</button>
 			<button class="tablinks" name="disk_stats">Disk Stats</button>
-			<button class="tablinks" name="perfstat">PMU Stats</button>
 			<button class="tablinks" name="netstat">Net Stats</button>
 		</div>


We need to start trying to organize these, e.g. the the two new ones would be under a "profiling" category. Static data would be another. Beyond that the grouping isn't as clear, maybe system, devices, and performance (just making names up).

This order is based on what people generally look at first.

It's already a mess and is going to get worse.

Related: the profiling-based pages, if shown despite there being no data, should explain why there's no data and how to collect it.

wash-amzn · 2023-08-22T14:37:55Z

README.md

@@ -91,6 +91,8 @@ To see a step-by-step example, please see our example [here](./EXAMPLE.md)

 `-vv, --verbose --verbose` more verbose messages

+`--intensive` gather data for which the CPU utilization is high when collected


I think the idea of having a more general argument is likely going to be regretted later. At least for these features, I'm starting to think this should be more like --profiling, and then that opens up the future options --profiling-frequency and others, naturally.

The aim was to have a intensive flag which in the future would include other types of data collection. It would also prevent us from having multiple options for per data type. We would then have a --disable <datatype1, datatype2, ..., datatype n> which could control which ones were gathered.

I know what the intent was, but I am doubting that it will work out (as in it will start getting awkward).

I've changed it to --profile.

wash-amzn · 2023-08-22T14:38:46Z

src/data.rs

+pub struct PrepareParams {
+    pub time: u64,
+    pub file_path: String,
+    pub dir: String,
+}


This kind of thing will turn into a kitchen sink over time. Along with switching to a command line option to enable profiling, this would be profiling options, and you should not force it upon every data type when it only applies to one.

If you are running a custom preparation step, the lib will give you the kitchen sink and you are free to use whichever ones you need.

Similar to the "--intensive" approach, I don't think this is not going to be ideal long-term.

wash-amzn · 2023-08-22T14:40:16Z

src/data.rs

+            is_cpu_intensive: false,
+            prepare_params: PrepareParams::new(),


You really need to find a way such that data types are constructed and added to a list during initialization, rather than all possible ones are statically created and then optionally ignored/skipped at runtime. As we get more types that require more of their own configuration (and optional-ness), this current approach is not going to scale well.

wash-amzn · 2023-08-22T14:40:53Z

src/data.rs

+    pub fn is_cpu_intensive(&mut self) {
+        self.is_cpu_intensive = true;
+    }


Something that starts with is_ should be an accessor, not a mutator.

:( . It was supposed to imply 'hey this data type is_cpu_intensive'.

set_cpu_intensive() or set_cpu_intensive(bool)

or just drop it since you already can access the .is_cpu_intensive attribute directly (and some places do)

wash-amzn · 2023-08-22T14:45:22Z

The example report demonstrates one of the big problems here, need for debuginfo to get symbols. The flamegraph is mostly [unknown]s.

There should be some way to let the user know which are necessary (I know if you try to gdb something it will give you a full command to install all relevant missing debuginfos, I don't recall perf offering such a nicety).

wash-amzn · 2023-08-24T19:02:32Z

src/data/perf_profile.rs

+impl CollectData for PerfProfileRaw {
+    fn prepare_data_collector(&mut self, params: PrepareParams) -> Result<()> {
+        match Command::new("perf")
+            .args(["record", "-a", "-q", "-g", "-k", "1", "-F", "99", "-e", "cpu-clock:pppH", "-o", &params.data_file_path, "--", "sleep", &params.collection_time.to_string()])


What is the significance of params.data_file_path? It's not the final output file here (hasn't been run through the passes like "inject").

It is the perf.data file. APerf names is to perf_profile_timestamp. In the report generation side, we pass this through the perf report --stdio -g none --percent-limit 1 command.

Explain what the definition of data_file_path is. There has to be something very significant about it, particularly in light of having data types that create other files of their own choosing under the data_dir path. What are you going to do when a data type outputs two files, both of which have to be known at report generation time?

During initialization, every data type has a file created by the lib at data_dir/datatype_timestamp.bin. The data_file_path is the path of this file. A datatype will only have 1 file created for it. If more files need to be created, use the prepare_data_collector and do your custom process in data_dir.

Ah, the problem is that you're deferring the inject to report time, so you were still getting away with each data type producing only exactly 1 file.

Since inject has to be done at the end of collection, what are you going to do? Have data_file_path be the injected file, and then create a temporary under data_dir for the real-time data collection?

Moved the inject step to after_data_collection().

wash-amzn · 2023-08-25T15:10:36Z

src/data/flamegraphs.rs

+        let out_jit = Command::new("perf")
+            .args(["inject", "-j", "-i", &file_name, "-o", perf_jit_loc.clone().to_str().unwrap()])
+            .status();


You can't do the inject at report generation time unless you happen to be on the same machine (and more than that, the same processes (as in exact same process, not just the same name) are running)

You also need to make sure that by the end of collection your resulting file has all symbol names already gathered in it, because they won't necessarily be resolvable at report generation time.

Yes. The expectation is you run aperf report on the SUT where aperf record was run. The inject step can be moved to the aperf record after_data_collection part.

wash-amzn · 2023-08-28T14:17:52Z

src/data/flamegraphs.rs

+        if perf_jit_loc.exists() {
+            let out_script = Command::new("perf")
+                .args(["script", "-f", "-i", perf_jit_loc.to_str().unwrap()])
+                .output()?;
+            write!(script_out, "{}", std::str::from_utf8(&out_script.stdout)?.to_string())?;
+            Folder::default().collapse_file(Some(script_loc), collapse_out)?;
+            fg_out = std::fs::OpenOptions::new().read(true).write(true).truncate(true).open(fg_loc)?;
+            flamegraph::from_files(&mut Options::default(), &vec![collapse_loc.to_path_buf()], fg_out)?;
+        }
+        let processed_data = vec![ProcessedData::Flamegraph(profile)];
+        Ok(processed_data)


Nothing in the if appears to have any side-effects that modify profile, so how does any subsequent code/UI/whatever behave differently based on whether perf_jit_loc existed or not?

Before the if, we write the default fg_out. If the if is successful, fg_out gets overwritten. fg_out is what the UI is looking at. The profile is something we have to return to the caller.

wash-amzn reviewed Aug 22, 2023

View reviewed changes

janaknat force-pushed the perf-profile branch 2 times, most recently from b1ac230 to 8f86a18 Compare August 24, 2023 18:24

wash-amzn reviewed Aug 24, 2023

View reviewed changes

wash-amzn reviewed Aug 25, 2023

View reviewed changes

Add support for collecting/visualizing Perf Profile data

a7cf732

janaknat force-pushed the perf-profile branch from 8f86a18 to a7cf732 Compare August 25, 2023 19:58

wash-amzn reviewed Aug 28, 2023

View reviewed changes

wash-amzn approved these changes Aug 28, 2023

View reviewed changes

janaknat merged commit cfb2e49 into main Aug 29, 2023
4 checks passed

janaknat deleted the perf-profile branch September 1, 2023 18:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for collecting/visualizing Perf Profile data #77

Add support for collecting/visualizing Perf Profile data #77

janaknat commented Aug 18, 2023

wash-amzn Aug 22, 2023

janaknat Aug 23, 2023

wash-amzn Aug 22, 2023

janaknat Aug 23, 2023

wash-amzn Aug 22, 2023

janaknat Aug 23, 2023

wash-amzn Aug 23, 2023

wash-amzn Aug 22, 2023

janaknat Aug 23, 2023

wash-amzn Aug 23, 2023

janaknat Aug 24, 2023

wash-amzn Aug 22, 2023

janaknat Aug 23, 2023

wash-amzn Aug 23, 2023

wash-amzn Aug 22, 2023

wash-amzn Aug 22, 2023

janaknat Aug 23, 2023

wash-amzn Aug 23, 2023

wash-amzn commented Aug 22, 2023

wash-amzn Aug 24, 2023

janaknat Aug 24, 2023

wash-amzn Aug 25, 2023

janaknat Aug 25, 2023

wash-amzn Aug 25, 2023

janaknat Aug 25, 2023

wash-amzn Aug 25, 2023

wash-amzn Aug 25, 2023

janaknat Aug 25, 2023

wash-amzn Aug 28, 2023

janaknat Aug 28, 2023

		@@ -91,6 +91,8 @@ To see a step-by-step example, please see our example [here](./EXAMPLE.md)

		`-vv, --verbose --verbose` more verbose messages

		`--intensive` gather data for which the CPU utilization is high when collected

		is_cpu_intensive: false,
		prepare_params: PrepareParams::new(),

Add support for collecting/visualizing Perf Profile data #77

Add support for collecting/visualizing Perf Profile data #77

Conversation

janaknat commented Aug 18, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wash-amzn commented Aug 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment