Add support for collecting, visualizing per-process data #66

janaknat · 2023-05-31T18:37:59Z

Shows only the top 15 processes recorded during the collection period.

Also, update the way we detect a run not having data of a particular type. Previously, if one run does not contain a data that is recorded, all runs of the visualization type would be empty. With this update, only the run without data does not contain visualization.

Attached a tarball of the aperf report which exercises the changes.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
comp.tar.gz

With a boolean, if one run doesn't have the data, no data is visualized for any of the runs.

wash-amzn

Threads can have a different name from their parent process. This happens quite a bit for server applications, so could really be a problem here doing aggregation by name and not by pids and explicit relationship attributes.

wash-amzn · 2023-05-31T19:29:15Z

src/data/processes.rs

+            };
+            let name = pstat.comm;
+            let pid = pstat.pid;
+            let time_ticks = pstat.utime as i64 + pstat.stime as i64 + pstat.cutime + pstat.cstime +


Are those actually signed values?

Some were signed, some were unsigned. Changed all to unsigned. Not sure how you can have negative CPU usage.

wash-amzn · 2023-05-31T19:55:05Z

src/data/processes.rs

+    let mut total_time: u64 = 1;
+    if let TimeEnum::TimeDiff(v) = values.last().unwrap().time - values[0].time {
+        if v > 0 {
+            total_time = v;
+        }
+    }


If I collect with a period = 0, there's only one data point and letting total_time be 1 is wrong

Changed code so that we can't give a time of 0 for the collection interval or collection period.

wash-amzn · 2023-05-31T20:39:32Z

src/data/processes.rs

+    for (name, entry) in &process_map {
+        process_list.insert(name.to_string(), entry.total_cpu_time / total_time);
+    }
+    let mut ordered_process_list = Vec::from_iter(process_list);
+    ordered_process_list.sort_by(|&(_, a), &(_, b)| b.cmp(&a));


Why put everything in a b-tree if you're just going to then pull everything out and re-sort it by value (discarding the work the b-tree did to sort by key)?

wash-amzn · 2023-05-31T20:40:03Z

src/data/processes.rs

+                    let mut process_entry = ProcessEntry {
+                        name: entry.name.clone(),
+                        prev_sample: sample.clone(),
+                        collection_time: TimeEnum::TimeDiff(total_time),


This is a fixed value being attached to every ProcessEntry. If all of them have the same value, why store it in them at all?

Removed it. The collection_time is present only once now.

wash-amzn · 2023-05-31T20:43:22Z

src/data/processes.rs

+        let pe = process_map.get(name);
+        for sample in &pe.unwrap().samples {
+            if sample.cpu_time != 0 {
+                not_zero = true;
+                break;
+            }
+        }


Can't you just use the total_cpu_time you already have there (instead of looking for at least one non-zero sample)?

Yeah. Used the total_cpu_time.

wash-amzn · 2023-05-31T20:44:06Z

src/data/processes.rs

+            }
+        }
+        if not_zero {
+            let name_split: Vec<&str> = name.split("-").collect();


The process name can contain a -, so you need to re-jigger this one way or another not to break in that case.

Changed it to ';'.

The process name can likely contain a semicolon as well

If I use serialization like serde, the CPU usage increase by 0.5%. So that is not an option. It is unlikely that process names will have a semicolon. However, I can change it to using "=>" as the separator.

wash-amzn · 2023-05-31T21:07:37Z

src/data/processes.rs

+                    end_entry.entries.push(pe.unwrap().clone());
+                    end_map.insert(actual_name.to_string(), end_entry);
+                    count = count + 1;
+                },


The actual logic here for which processes are considered in the top 15 is pretty complicated to explain. I think it's the 15 processes (keyed by their name name, not pid), ordered by each one's top CPU-using thread (not all threads combined, and thread is defined as being in the same process or in a child process with the same name).

I doubt I got it right. The point is that this is really complicated to explain to someone, so it's going to be an ongoing head-scratcher for users and anyone who has to answer users' questions.

wash-amzn · 2023-05-31T21:08:31Z

src/data/processes.rs

+    let value_zero = values[0].clone();
+    let time_zero = value_zero.time;
+    let mut end_map: HashMap<String, EndEntry> = HashMap::new();
+    let end_values: Vec<EndEntry>;


I finally figured out what "end" really is, but the troubling thing is even now giving it a name is tough. It's not total aggregate per process (since we stop at 16 threads). Regardless, it needs a different name.

wash-amzn

The processes tab would be a good place to have things like number of processes created, context switches, etc.

wash-amzn · 2023-06-13T16:23:14Z

src/data/processes.rs

+            let pid = pstat.pid;
+            let time_ticks = pstat.utime + pstat.stime + pstat.cutime as u64 + pstat.cstime as u64 +
+                pstat.guest_time.unwrap_or_default() + pstat.cguest_time.unwrap_or_default() as u64;
+            if time_ticks / procfs::ticks_per_second()? as u64 <= 0 {


I don't think you want to be constantly calling ticks_per_second, AFAICT it winds up making a sysconf system call each time (and then again a couple lines down)

Okay. Rust and static global. Eek.

wash-amzn · 2023-06-13T17:56:54Z

src/data/processes.rs

+    }
+    /* Order the processes by Total CPU Time per collection time */
+    let mut ordered_process_list: Vec<ProcessEntry> = process_map.clone().into_values().collect();
+    ordered_process_list.sort_by(|a, b| (b.total_cpu_time / total_time).cmp(&(a.total_cpu_time / total_time)));


dividing both numbers by total_time doesn't change anything

wash-amzn · 2023-06-13T18:00:23Z

src/data/processes.rs

+        if entry.total_cpu_time != 0 {
+            match end_map.get_mut(&entry.name) {
+                Some(e) => {
+                    if e.total_threads > 16 {


You aren't actually counting threads, it's the number of difference processes with the same name you're aggregating.

Gah. Changing it total_processes.

wash-amzn · 2023-06-13T18:00:56Z

src/data/processes.rs

+        if count >= 15 {
+            break;
+        }


Couldn't you aggregate everything else into an "Other" entry, so then at least the sum of everything would match the system's overall cpu usage? It would avoid hiding a bunch of usage when there's a long tail of low-cpu usage processes that together use a significant amount.

We could. That has a high probability of confusing the user though.

I assure you the current behavior is going to be more confusing.

I would prefer to have the current set of functionality as is. I'd increase the processes count per graph to 32 and see how it fares.

wash-amzn · 2023-06-13T18:02:03Z

src/data/processes.rs

+        if entry.total_cpu_time != 0 {
+            match end_map.get_mut(&entry.name) {
+                Some(e) => {
+                    if e.total_threads > 16 {


Why stop at aggregating together 16?

We have to stop somewhere. For a Linux kernel build, there are thousands of sub process. Choices are 16, 32 or 64 (pushing it).

What are you actually saving? You've already iterated through everything once, O(2n) vs O(n)? Not even that, the above is linear to amount of time and number of prcesses, this is just number of processes.

We cannot have 1000's of entries in the graph. Also, the size of the report would become huge if we measure long running, fork bombing processes.

This case isn't about changing the size of any data structure, only aggregating together CPU time (more additions).

Regardless, I'll re-read all of this and see if I'm misunderstanding things.

I think I see what you were talking about on output size. However, there's a better way to do this and get the correct result.

Right now you're aggregating by pid, sorting those aggregates, and then partially (depending on who you ask partially would mean incorrectly) re-aggregating based on process name. Just aggregate on process name in the first place, then the sort will give you the actual top N that you want without a second partial aggregation that will have cases of giving the wrong results.

Doing a single aggregation isn't quite as easy as simply dropping the pid part of the key, you'll need to have a second-level map in each entry to keep the last, per-pid, samples, in order to compute diffs and add time to the process-by-name's aggregate

wash-amzn · 2023-06-14T18:51:02Z

src/data/processes.rs

+    ordered_process_list.sort_by(|a, b| (b.total_cpu_time).cmp(&(a.total_cpu_time)));
+
+    /*
+     * Create 1 entry per process name. Add similarly named processes to it.


Not really "similarly named", it's an exact match grouping.

janaknat · 2023-06-20T23:22:41Z

3 Linux kernel builds which fork off a bunch of make sub-processes. Plotly was able to handle it without issues. The run was for 1 hour.
aperf_report_fullbuild4.tar.gz

action-rs/toolchain already includes cargo. Don't use another actions-rs for cargo commands.

janaknat added 2 commits May 31, 2023 17:44

Remove dangling Processes TS file

ea46bd9

Track which run has data with HashMap

2c00e6b

With a boolean, if one run doesn't have the data, no data is visualized for any of the runs.

wash-amzn reviewed May 31, 2023

View reviewed changes

janaknat force-pushed the processes branch from 970a71f to 86a1101 Compare June 8, 2023 19:52

wash-amzn reviewed Jun 13, 2023

View reviewed changes

wash-amzn reviewed Jun 14, 2023

View reviewed changes

janaknat force-pushed the processes branch from 661b392 to dc25e49 Compare June 20, 2023 22:51

janaknat force-pushed the processes branch 3 times, most recently from b6f296f to 2c00e6b Compare July 5, 2023 16:18

janaknat and others added 3 commits July 5, 2023 19:08

Don't allow 0 for collection period/interval

695fcee

Add support for collecting, visualizing Process data

671cc11

Update ci.yml

f61f409

action-rs/toolchain already includes cargo. Don't use another actions-rs for cargo commands.

janaknat merged commit ca998e1 into main Jul 7, 2023

janaknat deleted the processes branch July 31, 2023 20:30

Add support for collecting, visualizing per-process data #66

Add support for collecting, visualizing per-process data #66

Conversation

janaknat commented May 31, 2023

wash-amzn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wash-amzn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

janaknat commented Jun 20, 2023