Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect procstat IO counters #9189

Closed
phemmer opened this issue Apr 26, 2021 · 13 comments · Fixed by #15186
Closed

Incorrect procstat IO counters #9189

phemmer opened this issue Apr 26, 2021 · 13 comments · Fixed by #15186
Assignees
Labels
area/procstat bug unexpected problem or unintended behavior help wanted Request for community participation, code, contribution size/m 2-4 day effort upstream bug or issues that rely on dependency fixes

Comments

@phemmer
Copy link
Contributor

phemmer commented Apr 26, 2021

Relevant telegraf.conf:

[[inputs.procstat]]
pid_file = "/foo.pid"

System info:

Telegraf v1.18.1
Linux

Docker

Steps to reproduce:

  1. Run telegraf
  2. View metrics

Expected behavior:

  • The "read_bytes" metric should show the number of bytes read by the process.
  • The "write_bytes" metric should show the number of bytes written by the process.

Actual behavior:

  • The "read_bytes" metric tracks the number of bytes read from block devices.
  • The "write_bytes" metric tracks the number of bytes written to block devices.
  • There are no metrics which track the number of bytes read by the process.
  • There are no metrics which track the number of bytes written by the process.

Additional info:

The plugin is pulling the read_bytes and write_bytes values from /proc/$pid/io. These metrics track block device IO, not all IO. For all IO the rchar and wchar metrics need to be used.

@phemmer phemmer added the bug unexpected problem or unintended behavior label Apr 26, 2021
@powersj
Copy link
Contributor

powersj commented Apr 11, 2022

next steps: add the additional metrics requested, clearly explain the existing fields measure data read/written from block devices and new fields of process_read_bytes and process_write_bytes (happy to hear different/better names) are specific to what the process has read/write.

@powersj powersj added the help wanted Request for community participation, code, contribution label Apr 11, 2022
@powersj powersj added the size/m 2-4 day effort label Aug 2, 2022
@teclab-at
Copy link

Is this why I am seeing weird numbers? Because I am running rsync between two disks (simply a mirror operation) and it shows about ~80MB/s. But when looking at read_bytes and write_bytes I get ~600 (whatever unit that is). When I divide it by eight it somehow resembles the actuelly traffic rate.

thx

@srebhan
Copy link
Member

srebhan commented Nov 21, 2023

This is an upstream issue as gopsutil does not report those metrics: shirou/gopsutil#672

@srebhan srebhan added the upstream bug or issues that rely on dependency fixes label Nov 21, 2023
@srebhan srebhan self-assigned this Nov 21, 2023
@srebhan
Copy link
Member

srebhan commented Apr 18, 2024

@phemmer please test the binary in #15186 which adds the new fields cached_read_bytes and cached_write_bytes for Linux.

@phemmer
Copy link
Contributor Author

phemmer commented Apr 18, 2024

I have not tested, but I do not think names such as cached_read_bytes are appropriate. The rchar and wchar fields have nothing to do with caching. If a process writes to the TTY, that is counted. If the process writes to another process over a pipe, that is counted. If the process writes to a network socket, that is counted. None of these are related to caching.

@srebhan
Copy link
Member

srebhan commented Apr 18, 2024

@phemmer can you suggest another name or should we keep rchar and wchar?

@phemmer
Copy link
Contributor Author

phemmer commented Apr 18, 2024

I don't know. First it depends on whether the other operating system which is supported here (windows) uses disk IO or all IO for its counters. If on windows read_bytes tracks all IO, and on linux read_bytes tracks disk IO, then I'd say there's quite the mess going on. I don't think it would be appropriate to have one field mean different things based on the operating system, and then have a field with a different name mean the same thing as another field with a different name. So the question is, is the value consistent? If not, then I suspect the best option is to fix the issue, in which case the value represented is going to have to change on one of the operating systems. In which case I would propose read_bytes track exactly that, bytes read. Not bytes read from disk. If you want to track bytes read from disk, the name should be disk_read_bytes. However if the usage is consistent across both OSs, then you get to either deal with a breaking change, or have a mess of a naming situation, as read_bytes implies all IO since the name is not qualified. You could have all_read_bytes or whatever, but then the naming is just going to be confusing, and it won't be obvious to people that read_bytes is really restricted to disk.

@srebhan
Copy link
Member

srebhan commented Apr 18, 2024

That sounds reasonable. However, I have no idea about the situation on other OSes. Will probably take a look next week...

@srebhan
Copy link
Member

srebhan commented Apr 19, 2024

@phemmer it looks like the underlying data-structure is about disk I/O according to this documentation.
So I would keep the read_bytes and write_bytes fields as is and add total_read_bytes and total_write_bytes for Linux and improve the documentation. Does that work for you?

@phemmer
Copy link
Contributor Author

phemmer commented Apr 19, 2024

That's not the data that gopsutil is gathering. As can be seen here, it calls procGetProcessIoCounters which is the API call GetProcessIoCounters and provides IO_COUNTERS. The documentation on IO_COUNTERS does not mention anything about disk, and says all I/O. From various searching on the internet, I can find comments from people confirming that it provides all IO, not just disk.

@srebhan
Copy link
Member

srebhan commented Apr 19, 2024

Ok thanks! So I will change the PR to report rchar and wchar as read_bytes and write_bytes and the current values as disk_read_bytes and disk_write_bytes!?

@phemmer
Copy link
Contributor Author

phemmer commented Apr 19, 2024

That proposal seems like it immediately fixes the problem. The values will have accurate names, and the values will be consistent across OSs. But it is a breaking change. Though this whole thing might be qualified as a bug, so is really a fix. However there is the underlying issue that this is all caused by a gopsutil inconsistency issue. The value means one thing on one OS, and something else on another OS. If they ever fix the inconsistency, what is going to prevent telegraf from breaking again?

@srebhan
Copy link
Member

srebhan commented Apr 19, 2024

Nothing, but that's a long history of gopsutils breaking us... :-(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/procstat bug unexpected problem or unintended behavior help wanted Request for community participation, code, contribution size/m 2-4 day effort upstream bug or issues that rely on dependency fixes
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants