-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for collecting, visualizing per-process data #66
Conversation
With a boolean, if one run doesn't have the data, no data is visualized for any of the runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Threads can have a different name from their parent process. This happens quite a bit for server applications, so could really be a problem here doing aggregation by name and not by pids and explicit relationship attributes.
src/data/processes.rs
Outdated
}; | ||
let name = pstat.comm; | ||
let pid = pstat.pid; | ||
let time_ticks = pstat.utime as i64 + pstat.stime as i64 + pstat.cutime + pstat.cstime + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are those actually signed values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some were signed, some were unsigned. Changed all to unsigned. Not sure how you can have negative CPU usage.
let mut total_time: u64 = 1; | ||
if let TimeEnum::TimeDiff(v) = values.last().unwrap().time - values[0].time { | ||
if v > 0 { | ||
total_time = v; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I collect with a period = 0, there's only one data point and letting total_time be 1 is wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed code so that we can't give a time of 0 for the collection interval or collection period.
src/data/processes.rs
Outdated
for (name, entry) in &process_map { | ||
process_list.insert(name.to_string(), entry.total_cpu_time / total_time); | ||
} | ||
let mut ordered_process_list = Vec::from_iter(process_list); | ||
ordered_process_list.sort_by(|&(_, a), &(_, b)| b.cmp(&a)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why put everything in a b-tree if you're just going to then pull everything out and re-sort it by value (discarding the work the b-tree did to sort by key)?
src/data/processes.rs
Outdated
let mut process_entry = ProcessEntry { | ||
name: entry.name.clone(), | ||
prev_sample: sample.clone(), | ||
collection_time: TimeEnum::TimeDiff(total_time), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fixed value being attached to every ProcessEntry. If all of them have the same value, why store it in them at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed it. The collection_time is present only once now.
src/data/processes.rs
Outdated
let pe = process_map.get(name); | ||
for sample in &pe.unwrap().samples { | ||
if sample.cpu_time != 0 { | ||
not_zero = true; | ||
break; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you just use the total_cpu_time
you already have there (instead of looking for at least one non-zero sample)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. Used the total_cpu_time.
src/data/processes.rs
Outdated
} | ||
} | ||
if not_zero { | ||
let name_split: Vec<&str> = name.split("-").collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The process name can contain a -
, so you need to re-jigger this one way or another not to break in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to ';'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The process name can likely contain a semicolon as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I use serialization like serde, the CPU usage increase by 0.5%. So that is not an option. It is unlikely that process names will have a semicolon. However, I can change it to using "=>" as the separator.
src/data/processes.rs
Outdated
end_entry.entries.push(pe.unwrap().clone()); | ||
end_map.insert(actual_name.to_string(), end_entry); | ||
count = count + 1; | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual logic here for which processes are considered in the top 15 is pretty complicated to explain. I think it's the 15 processes (keyed by their name name, not pid), ordered by each one's top CPU-using thread (not all threads combined, and thread is defined as being in the same process or in a child process with the same name).
I doubt I got it right. The point is that this is really complicated to explain to someone, so it's going to be an ongoing head-scratcher for users and anyone who has to answer users' questions.
src/data/processes.rs
Outdated
let value_zero = values[0].clone(); | ||
let time_zero = value_zero.time; | ||
let mut end_map: HashMap<String, EndEntry> = HashMap::new(); | ||
let end_values: Vec<EndEntry>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finally figured out what "end" really is, but the troubling thing is even now giving it a name is tough. It's not total aggregate per process (since we stop at 16 threads). Regardless, it needs a different name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The processes tab would be a good place to have things like number of processes created, context switches, etc.
src/data/processes.rs
Outdated
let pid = pstat.pid; | ||
let time_ticks = pstat.utime + pstat.stime + pstat.cutime as u64 + pstat.cstime as u64 + | ||
pstat.guest_time.unwrap_or_default() + pstat.cguest_time.unwrap_or_default() as u64; | ||
if time_ticks / procfs::ticks_per_second()? as u64 <= 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you want to be constantly calling ticks_per_second, AFAICT it winds up making a sysconf system call each time (and then again a couple lines down)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. Rust and static global. Eek.
src/data/processes.rs
Outdated
} | ||
/* Order the processes by Total CPU Time per collection time */ | ||
let mut ordered_process_list: Vec<ProcessEntry> = process_map.clone().into_values().collect(); | ||
ordered_process_list.sort_by(|a, b| (b.total_cpu_time / total_time).cmp(&(a.total_cpu_time / total_time))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dividing both numbers by total_time doesn't change anything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah.
src/data/processes.rs
Outdated
if entry.total_cpu_time != 0 { | ||
match end_map.get_mut(&entry.name) { | ||
Some(e) => { | ||
if e.total_threads > 16 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You aren't actually counting threads, it's the number of difference processes with the same name you're aggregating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gah. Changing it total_processes.
src/data/processes.rs
Outdated
if count >= 15 { | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't you aggregate everything else into an "Other" entry, so then at least the sum of everything would match the system's overall cpu usage? It would avoid hiding a bunch of usage when there's a long tail of low-cpu usage processes that together use a significant amount.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could. That has a high probability of confusing the user though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assure you the current behavior is going to be more confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to have the current set of functionality as is. I'd increase the processes count per graph to 32 and see how it fares.
src/data/processes.rs
Outdated
if entry.total_cpu_time != 0 { | ||
match end_map.get_mut(&entry.name) { | ||
Some(e) => { | ||
if e.total_threads > 16 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why stop at aggregating together 16?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to stop somewhere. For a Linux kernel build, there are thousands of sub process. Choices are 16, 32 or 64 (pushing it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are you actually saving? You've already iterated through everything once, O(2n) vs O(n)? Not even that, the above is linear to amount of time and number of prcesses, this is just number of processes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot have 1000's of entries in the graph. Also, the size of the report would become huge if we measure long running, fork bombing processes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case isn't about changing the size of any data structure, only aggregating together CPU time (more additions).
Regardless, I'll re-read all of this and see if I'm misunderstanding things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I see what you were talking about on output size. However, there's a better way to do this and get the correct result.
Right now you're aggregating by pid, sorting those aggregates, and then partially (depending on who you ask partially would mean incorrectly) re-aggregating based on process name. Just aggregate on process name in the first place, then the sort will give you the actual top N that you want without a second partial aggregation that will have cases of giving the wrong results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing a single aggregation isn't quite as easy as simply dropping the pid part of the key, you'll need to have a second-level map in each entry to keep the last, per-pid, samples, in order to compute diffs and add time to the process-by-name's aggregate
src/data/processes.rs
Outdated
ordered_process_list.sort_by(|a, b| (b.total_cpu_time).cmp(&(a.total_cpu_time))); | ||
|
||
/* | ||
* Create 1 entry per process name. Add similarly named processes to it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really "similarly named", it's an exact match grouping.
3 Linux kernel builds which fork off a bunch of make sub-processes. Plotly was able to handle it without issues. The run was for 1 hour. |
b6f296f
to
2c00e6b
Compare
action-rs/toolchain already includes cargo. Don't use another actions-rs for cargo commands.
Shows only the top 15 processes recorded during the collection period.
Also, update the way we detect a run not having data of a particular type. Previously, if one run does not contain a data that is recorded, all runs of the visualization type would be empty. With this update, only the run without data does not contain visualization.
Attached a tarball of the aperf report which exercises the changes.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
comp.tar.gz