Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate and at least document why a compilation process may create a difference between total host power and sum of processes power #20

Closed
bpetit opened this issue Nov 30, 2020 · 4 comments · Fixed by #132
Labels
bug Something isn't working

Comments

@bpetit
Copy link
Contributor

bpetit commented Nov 30, 2020

Running a build --release of scaphandre with lto=thin I see that host total power differs from the sum of all processes power, without catching up.

screen-weird-diff cleaned

It needs to be investigated, then either fixed if fixable or documented if due to the powercap rapl counters behavior. (or even create a bug report about the powercap or intel rapl module ?

Never seen that king of glitch before with any type of process...

@bpetit bpetit added the bug Something isn't working label Dec 1, 2020
@PierreRust
Copy link
Collaborator

PierreRust commented Jan 21, 2021

My guest is that the issue does not comes from rapl measurement but from the way this power is splited among processes. I suspect the way cpu % usage for processes is computed can cause the sum of cpu usage to be greater than 100% ( x cpu count), and thus leads a sum of power greater than exposed by rapl.
I had a quick look at the way this cpu usage is computed and I cannot really say what would be wrong. I guest we should log the computed cpu usage (and maybe the output of total_time_jiffies) to see if there is anything fishy going on.

As I said in the chat, I'm also afraid that cpu usage is not an accurate way of splitting energy among process but that's a bit out of topic and could be the subject of an independent discussion.

Actually, on my laptop, it happens almost all the time:
image
It seems to be more common when the load is high, which might explain why you saw it during compilation.

@bpetit
Copy link
Contributor Author

bpetit commented Jan 22, 2021

That's interesting. I don't have a real difference on my laptop. On metrics.hubblo.org, there are some differences, but not especially higher, but also lower. It made me think that there may be some differences in the way the CPU time allocated to processes is stored in /proc from one CPU model to another.

I never had a "constant" different pattern like you have. It would be very interesting to compare those contexts.

I think about adding more internal metrics in the prometheus exporter to help with such investigations.

@bpetit
Copy link
Contributor Author

bpetit commented Feb 1, 2021

@PierreRust I got the same pattern as you, in other use cases (sometimes, the pattern changes over a long period of time too). I'll try to work on that next week. Do you have some resources to share about best practices on splitting the consumption across processes ? I'd like to think and discuss on what could be the best way to be more accurate, while staying light and simple.

For the record, I think we should investigate how data in /proc are computed as it is what scaph is based on for the split.

@bpetit
Copy link
Contributor Author

bpetit commented Oct 30, 2021

Hi @PierreRust

I get much better results (comparing total power consumption of all processes to total power consumption of the host) with this fix : #132

Could you try that too and give me your thoughts ?

Thanks !

@bpetit bpetit added this to General Jun 19, 2024
@bpetit bpetit moved this to Previous releases in General Jun 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Previous releases
Development

Successfully merging a pull request may close this issue.

2 participants