Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for collecting/visualizing Perf Profile data #77

Merged
merged 1 commit into from
Aug 29, 2023

Conversation

janaknat
Copy link
Contributor

We are using the perf tool to gather the profiling data. There are two outputs from 1 perf profile data. First, is the raw top functions gathered by using perf report --stdio --percent-limit 1. The second, is a flamegraph generated using the output of perf script and the flamegraph rust crate. The flamegraph generated this way is very nearly the same as the one generated with Brendan Greggs' perl script.

A new flag is introduced for aperf record called --intensive. To enable perf profiling, the user must specify the --intensive flag during aperf record.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

compprofile.tar.gz

@@ -91,6 +91,8 @@ To see a step-by-step example, please see our example [here](./EXAMPLE.md)

`-vv, --verbose --verbose` more verbose messages

`--intensive` gather data for which the CPU utilization is high when collected
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also indicate what that current includes (so you know what you're getting in exchange for the CPU usage)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

Comment on lines 92 to 99
Err(e) => if e.kind() == ErrorKind::NotFound {
error!("Perf command not found");
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about other errors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a default case for all other error types.

Comment on lines 11 to 25
<div class="tab">
<button class="tablinks" name="system_info" id="default">SUT Config</button>
<button class="tablinks" name="sysctl">Sysctl Data</button>
<button class="tablinks" name="cpu_utilization">CPU Utilization</button>
<button class="tablinks" name="flamegraphs">Flamegraphs</button>
<button class="tablinks" name="top_functions">Top Functions</button>
<button class="tablinks" name="processes">Processes</button>
<button class="tablinks" name="perfstat">PMU Stats</button>
<button class="tablinks" name="meminfo">Meminfo</button>
<button class="tablinks" name="vmstat">VM Stat</button>
<button class="tablinks" name="kernel_config">Kernel Config</button>
<button class="tablinks" name="sysctl">Sysctl Data</button>
<button class="tablinks" name="vmstat">VM Stat</button>
<button class="tablinks" name="interrupts">Interrupt Data</button>
<button class="tablinks" name="disk_stats">Disk Stats</button>
<button class="tablinks" name="perfstat">PMU Stats</button>
<button class="tablinks" name="netstat">Net Stats</button>
</div>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to start trying to organize these, e.g. the the two new ones would be under a "profiling" category. Static data would be another. Beyond that the grouping isn't as clear, maybe system, devices, and performance (just making names up).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This order is based on what people generally look at first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already a mess and is going to get worse.

Related: the profiling-based pages, if shown despite there being no data, should explain why there's no data and how to collect it.

@@ -91,6 +91,8 @@ To see a step-by-step example, please see our example [here](./EXAMPLE.md)

`-vv, --verbose --verbose` more verbose messages

`--intensive` gather data for which the CPU utilization is high when collected
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea of having a more general argument is likely going to be regretted later. At least for these features, I'm starting to think this should be more like --profiling, and then that opens up the future options --profiling-frequency and others, naturally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The aim was to have a intensive flag which in the future would include other types of data collection. It would also prevent us from having multiple options for per data type. We would then have a --disable <datatype1, datatype2, ..., datatype n> which could control which ones were gathered.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know what the intent was, but I am doubting that it will work out (as in it will start getting awkward).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed it to --profile.

src/data.rs Outdated
Comment on lines 43 to 48
pub struct PrepareParams {
pub time: u64,
pub file_path: String,
pub dir: String,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of thing will turn into a kitchen sink over time. Along with switching to a command line option to enable profiling, this would be profiling options, and you should not force it upon every data type when it only applies to one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are running a custom preparation step, the lib will give you the kitchen sink and you are free to use whichever ones you need.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the "--intensive" approach, I don't think this is not going to be ideal long-term.

src/data.rs Outdated
Comment on lines 79 to 78
is_cpu_intensive: false,
prepare_params: PrepareParams::new(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You really need to find a way such that data types are constructed and added to a list during initialization, rather than all possible ones are statically created and then optionally ignored/skipped at runtime. As we get more types that require more of their own configuration (and optional-ness), this current approach is not going to scale well.

src/data.rs Outdated
Comment on lines 88 to 90
pub fn is_cpu_intensive(&mut self) {
self.is_cpu_intensive = true;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something that starts with is_ should be an accessor, not a mutator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:( . It was supposed to imply 'hey this data type is_cpu_intensive'.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set_cpu_intensive() or set_cpu_intensive(bool)

or just drop it since you already can access the .is_cpu_intensive attribute directly (and some places do)

@wash-amzn
Copy link
Contributor

The example report demonstrates one of the big problems here, need for debuginfo to get symbols. The flamegraph is mostly [unknown]s.

There should be some way to let the user know which are necessary (I know if you try to gdb something it will give you a full command to install all relevant missing debuginfos, I don't recall perf offering such a nicety).

@janaknat janaknat force-pushed the perf-profile branch 2 times, most recently from b1ac230 to 8f86a18 Compare August 24, 2023 18:24
impl CollectData for PerfProfileRaw {
fn prepare_data_collector(&mut self, params: PrepareParams) -> Result<()> {
match Command::new("perf")
.args(["record", "-a", "-q", "-g", "-k", "1", "-F", "99", "-e", "cpu-clock:pppH", "-o", &params.data_file_path, "--", "sleep", &params.collection_time.to_string()])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the significance of params.data_file_path? It's not the final output file here (hasn't been run through the passes like "inject").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the perf.data file. APerf names is to perf_profile_timestamp. In the report generation side, we pass this through the perf report --stdio -g none --percent-limit 1 command.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain what the definition of data_file_path is. There has to be something very significant about it, particularly in light of having data types that create other files of their own choosing under the data_dir path. What are you going to do when a data type outputs two files, both of which have to be known at report generation time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During initialization, every data type has a file created by the lib at data_dir/datatype_timestamp.bin. The data_file_path is the path of this file. A datatype will only have 1 file created for it. If more files need to be created, use the prepare_data_collector and do your custom process in data_dir.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the problem is that you're deferring the inject to report time, so you were still getting away with each data type producing only exactly 1 file.

Since inject has to be done at the end of collection, what are you going to do? Have data_file_path be the injected file, and then create a temporary under data_dir for the real-time data collection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the inject step to after_data_collection().

Comment on lines 77 to 79
let out_jit = Command::new("perf")
.args(["inject", "-j", "-i", &file_name, "-o", perf_jit_loc.clone().to_str().unwrap()])
.status();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't do the inject at report generation time unless you happen to be on the same machine (and more than that, the same processes (as in exact same process, not just the same name) are running)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also need to make sure that by the end of collection your resulting file has all symbol names already gathered in it, because they won't necessarily be resolvable at report generation time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The expectation is you run aperf report on the SUT where aperf record was run. The inject step can be moved to the aperf record after_data_collection part.

Comment on lines +104 to +114
if perf_jit_loc.exists() {
let out_script = Command::new("perf")
.args(["script", "-f", "-i", perf_jit_loc.to_str().unwrap()])
.output()?;
write!(script_out, "{}", std::str::from_utf8(&out_script.stdout)?.to_string())?;
Folder::default().collapse_file(Some(script_loc), collapse_out)?;
fg_out = std::fs::OpenOptions::new().read(true).write(true).truncate(true).open(fg_loc)?;
flamegraph::from_files(&mut Options::default(), &vec![collapse_loc.to_path_buf()], fg_out)?;
}
let processed_data = vec![ProcessedData::Flamegraph(profile)];
Ok(processed_data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing in the if appears to have any side-effects that modify profile, so how does any subsequent code/UI/whatever behave differently based on whether perf_jit_loc existed or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before the if, we write the default fg_out. If the if is successful, fg_out gets overwritten. fg_out is what the UI is looking at. The profile is something we have to return to the caller.

@janaknat janaknat merged commit cfb2e49 into main Aug 29, 2023
4 checks passed
@janaknat janaknat deleted the perf-profile branch September 1, 2023 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants