Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the mechanism used to suspend threads #28

Closed
gabrielesvelto opened this issue May 24, 2022 · 6 comments · Fixed by #108
Closed

Improve the mechanism used to suspend threads #28

gabrielesvelto opened this issue May 24, 2022 · 6 comments · Fixed by #108

Comments

@gabrielesvelto
Copy link
Contributor

Suspending the target process threads is currently an all-or-nothing operation that prevents minidump generation upon failure. The system we currently use is to iterate over the threads we've enumerated during initialization. This is an inherently racy operation as new threads can be created in the meantime and existing ones can quit. This is partially handled in that we tolerate being unable to suspend certain threads assuming that we stop at least a single one.

There should be ways to improve this but they're not clear-cut and probably needs some experimentation. One possibility is to send a SIGSTOP signal to the target process and wait for it to settle down before ptrace()-ing it. SIGSTOP is a big hammer that needs to be used with caution so I don't know what kind of issues we might encounter, but it could make the process more reliable by getting rid of some races (or at least leaving the kernel to handle them).

Additionally we might consider dumping threads we couldn't suspend. The data we'll get might be garbage but I don't really mind, we get garbage all the time in crash reports, it could also be useful. Let's say it would certainly be better than nothing. The reason why we never tried to do it with Breakpad is simple enough: handling errors is a chore! But here it's a lot easier. We could factorize the code for writing a single thread in thread_list_stream::write() and ignore individual failures (or even better, record them!).

@cpeterso
Copy link

This work is related to Firefox Android's "no crashing thread identified" bug where many crash reports are missing thread stack traces:

https://bugzilla.mozilla.org/show_bug.cgi?id=1644486

afranchuk added a commit to afranchuk/minidump-writer that referenced this issue Mar 20, 2024
things.

If this fails, we continue as we used to. This is an attempt to get a
consistent/static process state.

Closes rust-minidump#28.
Jake-Shadle added a commit that referenced this issue Mar 21, 2024
…ace things. (#108)

* Send SIGSTOP to linux/android processes before doing other procfs/ptrace
things.

If this fails, we continue as we used to. This is an attempt to get a
consistent/static process state.

Closes #28.

* Fix mac

* Fix warning about unused doc comment

* Allow user customization of timeout

Not sure this is needed, but better to allow users to customize this rather than rely on a hardcoded value

* Update nix

---------

Co-authored-by: Jake Shadle <jake.shadle@embark-studios.com>
@afranchuk
Copy link
Contributor

I've just tested this in the android emulator by running all test suites there.

I guess the easier way to do this might have been to install rust in the emulator, but instead I messed around with cargo configuration to set up the NDK linker and a runner script which pushed files with adb. There was an additional code change needed: adding support for an env variable to specify the test helper binary which is used to create child processes. The test suites previously used cargo run or CARGO_BIN_EXE_test env vars. I added a bit of code to use a TEST_HELPER env var if set. There's also an extra step of running cargo run --bin test to get it copied to the emulator since cargo test won't do that. I can upstream this if we think it useful, otherwise compiling and running in the emulator is very likely less setup/friction.

All tests pass in the emulator! Which gives a bit more confidence that there's nothing weird around SIGSTOP on android.

@Jake-Shadle
Copy link
Collaborator

I can upstream this if we think it useful, otherwise compiling and running in the emulator is very likely less setup/friction.

I'd very much appreciate android tests being run in this repo's CI if you can get it working reliably

@afranchuk
Copy link
Contributor

afranchuk commented Apr 5, 2024

@Jake-Shadle I see in the CI that cross is already being used, but the test lines are commented out. Was that due to flakiness? I was considering possibly using https://github.com/ReactiveCircus/android-emulator-runner.

@Jake-Shadle
Copy link
Collaborator

Cross uses qemu for running test binaries, which doesn't support ptrace at all, unfortunately.

@afranchuk
Copy link
Contributor

Oh I see. I'll try out this other action, it seems to make it a bit easier than manually writing all the stuff to fetch and start the emulator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants