Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect number of process FDs and /proc/self/limits #90

Merged
merged 6 commits into from
Nov 6, 2023

Conversation

lissyx
Copy link
Contributor

@lissyx lissyx commented Oct 12, 2023

this actually depends on rust-minidump/rust-minidump#881 to the minidump-common crate would need to be updated

@lissyx lissyx requested a review from Jake-Shadle as a code owner October 12, 2023 10:53
@lissyx lissyx force-pushed the linuxFD branch 6 times, most recently from ad40d66 to 4ccc432 Compare October 12, 2023 15:49
Copy link
Collaborator

@Jake-Shadle Jake-Shadle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some prior work you can point to for this? AFAICT neither breakpad nor crashpad do something similar, so it would help to know why this change is useful. In particular the /proc/{crash_thread}/fds doesn't really seem very useful as it only records the number of files, not their names/attributes/etc.

@lissyx
Copy link
Contributor Author

lissyx commented Oct 13, 2023

Is there some prior work you can point to for this? AFAICT neither breakpad nor crashpad do something similar, so it would help to know why this change is useful. In particular the /proc/{crash_thread}/fds doesn't really seem very useful as it only records the number of files, not their names/attributes/etc.

So we have had an unexplained rise of socketpair failure on release 116 and 117 of firefox, and it would be "magically" disappearing with 118. One of the working assumption is that this is due to fd exhaustion, but I have had a look at many of those bugs and it's unclear whether we might not be rather hitting an OOM condition.

Either way, we lack proper view to assert the root cause here, and collecting /proc/PID/limits as well as the amount of FDs is a tentative to get more data and be more certain of why we crashed when we hit those conditions.

@lissyx
Copy link
Contributor Author

lissyx commented Oct 13, 2023

(I should probably have opened it as a draft PR though)

@Jake-Shadle
Copy link
Collaborator

Ok, that makes sense. I think this can be merged as is, but I think before it is released (once rust-minidump/rust-minidump#881 is merged and released) it would make sense to spec out the contents of the streams, for example changing the FDs stream to be the true companion of the limits stream, with one field for each limit as well as a version header so that the fields can be changed over time, as things like OOM killing could also be inferred from this information, as limiting to just FDs feels too specific for your particular use case.

@gabrielesvelto
Copy link
Contributor

FYI I'm looking at the existing structures that are used for Windows, maybe we could repurpose one to store the fd count like we do for other pieces of information.

@Jake-Shadle
Copy link
Collaborator

Yah, that could work as well, just feels too specific and limiting to have an entire stream for a single number.

@luser
Copy link

luser commented Oct 13, 2023

So we have had an unexplained rise of socketpair failure on release 116 and 117 of firefox, and it would be "magically" disappearing with 118. One of the working assumption is that this is due to fd exhaustion, but I have had a look at many of those bugs and it's unclear whether we might not be rather hitting an OOM condition.

Thanks for the explanation! Since we already capture /proc/self/status, does the FDSize entry there give you useful info? Per the kernel docs, that's "number of file descriptor slots currently allocated", which is only an upper bound on the number of open file descriptors. The kernel appears to allocate 256 slots by default on my machine and allocates them in chunks of 256 more as needed until the process hits the limit. Capturing /proc/self/limits definitely seems useful since that will let you compare the two values.

@lissyx
Copy link
Contributor Author

lissyx commented Oct 14, 2023

So we have had an unexplained rise of socketpair failure on release 116 and 117 of firefox, and it would be "magically" disappearing with 118. One of the working assumption is that this is due to fd exhaustion, but I have had a look at many of those bugs and it's unclear whether we might not be rather hitting an OOM condition.

Thanks for the explanation! Since we already capture /proc/self/status, does the FDSize entry there give you useful info? Per the kernel docs, that's "number of file descriptor slots currently allocated", which is only an upper bound on the number of open file descriptors. The kernel appears to allocate 256 slots by default on my machine and allocates them in chunks of 256 more as needed until the process hits the limit. Capturing /proc/self/limits definitely seems useful since that will let you compare the two values.

that's indeed one of the things I considered, but when I tested, ~28 opened FDs would show up as FDSize=512 and I'm worried it might not be precise enough that we dont get a clear view and we would end up with still not being sure. I'll see if I can extract more from telemetry around that metric anyway.

@gabrielesvelto
Copy link
Contributor

We've discussed this a bit offline and I came up with an idea to gather richer information that would still help us explain the Firefox issue we're dealing with. Windows minidumps have introduced in recent times the HANDLE_DATA_STREAM stream which records all the handles opened by a process. I could repurpose it to record all the open fds, which we could then use to inspect them (and count them, obviously).

src/linux/minidump_writer.rs Outdated Show resolved Hide resolved
@lissyx
Copy link
Contributor Author

lissyx commented Nov 3, 2023

BTW I'm unsure what do to about aff27cf ?

@gabrielesvelto
Copy link
Contributor

@Jake-Shadle should I merge PR #94 first?

@Jake-Shadle
Copy link
Collaborator

@lissyx please keep it, as the comment says, if you link with LLD it will by default use a 64 bit hash for the build id which causes some tests to fail, but it can't be on by default since the flags only work on LLD and not ld/gold.

@Jake-Shadle
Copy link
Collaborator

@gabrielesvelto sure, then this PR can rebase and bump the stream count and get merged.

@Jake-Shadle Jake-Shadle merged commit 349a65d into rust-minidump:main Nov 6, 2023
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants