Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ProcTotal Inaccurate in MiscStats #1606

Closed
1 task done
eric1234 opened this issue Mar 5, 2024 · 4 comments · Fixed by #1612
Closed
1 task done

ProcTotal Inaccurate in MiscStats #1606

eric1234 opened this issue Mar 5, 2024 · 4 comments · Fixed by #1612

Comments

@eric1234
Copy link
Contributor

eric1234 commented Mar 5, 2024

Describe the bug

PR #682 added the ability to get the total number of processes via the MiscStats struct. It gets this information by reading the /proc/loadavg file which looks like:

0.35 0.22 0.18 2/1095 5671

The proc man page describes that file as:

The first three fields in this file are load average figures giving the number of jobs in the run queue (state R) or waiting for disk I/O (state D) averaged over 1, 5, and 15 minutes. They are the same as the load average numbers given by uptime(1) and other programs. The fourth field consists of two numbers separated by a slash (/). The first of these is the number of currently runnable kernel scheduling entities (processes, threads). The value after the slash is the number of kernel scheduling entities that currently exist on the system. The fifth field is the PID of the process that was most recently created on the system.

The bolded text is the key part. In my example above, it's showing 1095 as the total number of kernel scheduled entities. But note it uses that term instead of "processes". It even indicates in parentheses that "kernel scheduled entities" is both processes and threads. But by the name ProcTotal and the fact that all other numbers in MiscStat are processes only (not threads) this is implying I have 1095 processes when I do not. Thread count varies greatly. If I run it just a few minutes later it's showing 1127:

0.22 0.27 0.22 1/1127 5877

Also if we actually count the processes in my system we find we only have a bit over 300:

%> $ ps -A | wc -l
307

Expected behavior

I would expect ProcsTotal to return the total number of processes not processes and threads. I don't know if there is something else that can be read in /proc that gives just the total number of processes (I couldn't find anything) or if the attribute needs to be deprecated and removed for being in-accurate.

I came across this due to OpenTelemtry using this struct to help measure the number of processes in the system and getting wildly high numbers. It actually calculates this number itself by reading all directories under /proc that are number. But then it only uses that as a starting place and offsets it with what this struct is providing saying:

Processes are actively changing as we run this code, so this reason the above loop will tend to underestimate process counts. getMiscStats is a single read/syscall so it should be more accurate.

Based on that I don't think the OpenTelemetry strategy will be sufficient to get an accurate process count.

Environment (please complete the following information):

  • Linux: [paste contents of /etc/os-release and the result of uname -a]
$ cat /etc/os-release
NAME="Pop!_OS"
VERSION="22.04 LTS"
ID=pop
ID_LIKE="ubuntu debian"
PRETTY_NAME="Pop!_OS 22.04 LTS"
VERSION_ID="22.04"
HOME_URL="https://pop.system76.com"
SUPPORT_URL="https://support.system76.com"
BUG_REPORT_URL="https://github.com/pop-os/pop/issues"
PRIVACY_POLICY_URL="https://system76.com/privacy"
VERSION_CODENAME=jammy
UBUNTU_CODENAME=jammy
LOGO=distributor-logo-pop-os
$ uname -a
Linux personal 6.6.10-76060610-generic #202401051437~1704728131~22.04~24d69e2 SMP PREEMPT_DYNAMIC Mon J x86_64 x86_64 x86_64 GNU/Linux
@shirou
Copy link
Owner

shirou commented Mar 10, 2024

As I understand it, your argument is as follows.

The ProcsTotal of MiscStat returned by load.Misc() is supposed to return the number of Processes that do not include Threads by name, but actually returns a value that does include Threads. This is because /proc/loadavg is read in load.Misc(), and the value of this loadavg contains threads.

After reading man proc, I agree with you. However, it seems to me that the only way to get a pure process count that does not include threads is to count /proc as you write, or call the ps command. Is there any other way?

I will try to find a better way, but if you know of any, I would appreciate it if you could let me know.

Thank you.

@eric1234
Copy link
Contributor Author

However, it seems to me that the only way to get a pure process count that does not include threads is to count /proc as you write, or call the ps command. Is there any other way?

I don't think so. I also searched through the man page as well as did some general web searches and the general consensus is to count /proc as gopsutil is already doing with readPidsFromDir.

OpenTelemetry expressed concern about this strategy being accurate due to changing processes while the count is going on. This initially made me think this wasn't a viable strategy.

But it's also trying to collect status information on the pids as it's going. If we are just trying to get a process count then it's really just the one call to Readdirnames which if we dig down into go seems to be just a single syscall also so I think in this context it would be accurate.

I would be open to putting together a PR that refactors load.Misc to instead using the existing readPidsFromDir instead of the loadavg file. This should mean gopsutil provides the same interface externally but will report an accurate total process count. Let me know if that sounds good and I'll work on it.

@shirou
Copy link
Owner

shirou commented Mar 11, 2024

Great to hear that! There does not seem to be a way to get the exact value in the proc file system, using readPidsFromDir seems like a good way to get number of processes.

I think it would be better to move readPidsFromDir() to internal/common/common_linux.go and call it from the process and load packages. If you could create a PR, it would be greatly appreciated. Thank you!

@eric1234
Copy link
Contributor Author

Set down to implement this and it looks like there is already a different method in common_linux.go that counts the number of processes by reading the directories entries that look like a number. NumberProcWithContext is very similar to the readPidsFromDir. I think the only main difference is the former is based around a context while the later is based around a path.

I'm going to work on making getProcsTotal read from the existing NumberProcWithContext already in common. That will fix my issue. Then as a separate bit of work if we wanted to DRY up NumberProcWithContext and readPidsFromDir that could be done.

eric1234 added a commit to eric1234/gopsutil that referenced this issue Mar 21, 2024
The `ProcsTotal` in the `MiscStat` structure was very inaccurate. It was reading
a value which is the total number of kernel scheduling entities. This includes
both processes and threads significantly overcounting.

This instead uses an existing method already in common to count the number of
processes via the /proc filesystem where any directory is a number. This should
still be a single syscall to read that directory entry.

This fixes shirou#1606.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants