-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry if pthread_create fails with EAGAIN #824
Conversation
Ping? |
May I please ping this? |
Would it be possible to land this ? thanks |
TBB has a bug in thread creation which is exposed with usage in mold. See rui314/mold#410, rui314/mold#600. We apply a patch that has been upstreamed to TBB at oneapi-src/oneTBB#824 to fix this. The patch implements a mechanism similar to one used by Go [1], and has been adopted in the Arch and OpenSUSE TBB packages. While we're here, let's align the `cmake` invocation with other formulae, fix references to `python3.10` (Homebrew#108008), and change the `-rpath` flag so that it does not require relocation when pouring a bottle (which should speed bottle pour times up slightly). [1] https://go-review.googlesource.com/c/go/+/33894/
TBB has a bug in thread creation which is exposed with usage in mold. See rui314/mold#410, rui314/mold#600. We apply a patch that has been upstreamed to TBB at oneapi-src/oneTBB#824 to fix this. The patch implements a mechanism similar to one used by Go [1], and has been adopted in the Arch and OpenSUSE TBB packages. While we're here, let's align the `cmake` invocation with other formulae, fix references to `python3.10` (#108008), and change the `-rpath` flag so that it does not require relocation when pouring a bottle (which should speed bottle pour times up slightly). [1] https://go-review.googlesource.com/c/go/+/33894/ Closes #108431. Signed-off-by: Sean Molenaar <1484494+SMillerDev@users.noreply.github.com> Signed-off-by: BrewTestBot <1589480+BrewTestBot@users.noreply.github.com>
Sorry this test might just be flakey/failing randomly on my system... Please ignore, will do more testing :) |
Can someone in the TBB team review and merge this patch? More and more Linux distros are cherrypicking this as an unofficial patch. |
@rui314 According to https://man7.org/linux/man-pages/man3/pthread_create.3.html
Or here https://pubs.opengroup.org/onlinepubs/9699919799/
So this means that the system either lacks the necessary resources, or has been set some system limit. Is there any thread, article or bug report where mentioned that |
I've seen |
Also if this is some particular problem/bug I would expect to wrap the solution into macros. |
There seems to be some discussions on Go discussion board, which you can visit from https://go-review.googlesource.com/c/go/+/33894/. Note that Go does the same thing as this is. I also got many bug reports caused by the spurious failure of I can wrap this with a function (not a macro because it's bad for readability), but since this function itself is a wrapper function to call |
The macro could showed that this is known problem, because as I said this behavior is counter intuitive (and pthread_create API). |
We need to count the number of iterations while retrying, so macro wouldn't be a good choice. Let me factor it out as a function. |
9f1c18a
to
ea895c5
Compare
I factored out the code to a new function and wrote a test. |
@rui314 I tried reproduce this failure with test you proposed and it didn't fail. Are there special conditions that should be met to reproduce it? |
I can quite stably reproduce that when using mold linker in GCC test suite on a pretty modern AMD Zen system. |
Accordance with Dmitry Vyukov tweet it is general problem with pthread_create. |
I don't know if there's a bug filed to glibc for this particular issue. glibc's Speaking of the retry count, 20 is copied from Go. I don't know if it is the best max retry count, but it should at least be battle-tested in the wild. |
Uh, what? pthread_create does a whole lot of stuff, for example apply the pthread_attr_t, allocate a stack, allocate thread local storage, and various stuff I don't know what it means. https://github.com/bminor/glibc/blob/master/nptl/pthread_create.c#L619 Even glibc clone() isn't a straight syscall wrapper; glibc clone() calls a function in the child, but kernel clone() returns twice. https://man7.org/linux/man-pages/man2/clone.2.html#NOTES https://github.com/bminor/glibc/blob/master/sysdeps/unix/sysv/linux/x86_64/clone.S |
Sorry for noise, I wrote a bit compact code for same, which does not call
|
@rui314 Could you please apply @rozhuk-im proposal and remove test (it doesn't reproduce the problem)? |
@pavelkumbrasev Do you want me to remove |
Simplified the code and removed the test. |
Actually the original code was better. I didn't call nanosleep with 0 milliseconds delay, and the new code introduced an unnecessary 20 milliseconds sleep if all pthread_create failed. I'll roll it back. |
I don't know why, but that test fails even without my change. |
Yes, I was wrong with that.
It do like you original code do.
|
It actually doesn't. i was incremented before the control reaches here. |
On many Unix-like systems, pthread_create can fail spuriously even if the running machine has enough resources to spawn a new thread. Therefore, if EAGAIN is returned from pthread_create, we actually have to try again. I observed this issue when running the mold linker (https://github.com/rui314/mold) under a heavy load. mold uses OneTBB for parallelization. As another data point, Go has the same logic to retry on EAGAIN: https://go-review.googlesource.com/c/go/+/33894/ nanosleep is defined in POSIX 2001, so I believe that all Unix-like systems support it. Signed-off-by: Rui Ueyama <ruiu@cs.stanford.edu>
Please take another look. Now I inlined the function because it's now applied to all systems that uses pthread_create and the function look too small to be outlined (and I didn't like its complicated parameter signatures). |
The problem with this approach that tests or application that uses oneTBB might terminate because it is just 20 attempts still no guarantee for success. (as with the test that I mentioned above) |
In a situation in which pthread_create fails 20 times in a row, other resources are very likely to be also very low, and the system isn't probably stable anyway; the OOM killer might kick in and kill your application however robust it is, for example. There are many other failure scenarios. So can we merge this patch now and discuss further improvements after that? Practically, this patch alone seems to eliminate all crashes we've observed in the wild. |
I have analyzed failures in |
It seems that this test is unstable by itself. Could you please apply this changes? diff --git a/test/tbb/test_eh_thread.cpp b/test/tbb/test_eh_thread.cpp
index 51b97976..4a883511 100644
--- a/test/tbb/test_eh_thread.cpp
+++ b/test/tbb/test_eh_thread.cpp
@@ -119,6 +119,7 @@ TEST_CASE("Too many threads") {
return;
}
}
+ tbb::global_control g(tbb::global_control::max_allowed_parallelism, 2);
g_exception_caught = false;
try {
// Initialize the library to create worker threads
@@ -132,7 +133,9 @@ TEST_CASE("Too many threads") {
}
// Do not CHECK to avoid memory allocation (we can be out of memory)
if (!g_exception_caught) {
- FAIL("No exception was caught");
+ // There is no guarantee that new thread creation will fail even if we directly set the limit
+ // because another process might free resources during library initialization.
+ WARN_MESSAGE(false, "No exception was thrown on library initialization");
}
finalize();
}).join(); |
@pavelkumbrasev Maybe that should be submitted as an independent patch than a part of this patch, if it fixes the test itself's flakiness. Once you submit that change I'll rebase this patch. |
Hi all, I got redirected here after a lot of investigations... I'm now trying to work out which version of TBB has the fix in: I'm seeing this issue with |
It should be 2021.9.0: v2021.8.0...v2021.9.0#diff-9cb055f5be587e2b154a4ff943f714c689b0aae1e8547e97bcbb76d89e0598fd |
Thanks @ZhongRuoyu . We will have to upgrade: thanks. |
Description
On many Unix-like systems, pthread_create can fail spuriously even if
the running machine has enough resources to spawn a new thread.
Therefore, if EAGAIN is returned from pthread_create, we actually have
to try again.
I observed this issue when running the mold linker
(https://github.com/rui314/mold) under a heavy load. mold uses OneTBB
for parallelization.
As another data point, Go has the same logic to retry on EAGAIN:
https://go-review.googlesource.com/c/go/+/33894/
nanosleep is defined in POSIX 2001, so I believe that all Unix-like
systems support it.
Type of change
Tests
Documentation
Breaks backward compatibility
Notify the following users
Other information