Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry if pthread_create fails with EAGAIN #824

Merged
merged 1 commit into from
Oct 26, 2022

Conversation

rui314
Copy link
Contributor

@rui314 rui314 commented May 7, 2022

Description

On many Unix-like systems, pthread_create can fail spuriously even if
the running machine has enough resources to spawn a new thread.
Therefore, if EAGAIN is returned from pthread_create, we actually have
to try again.

I observed this issue when running the mold linker
(https://github.com/rui314/mold) under a heavy load. mold uses OneTBB
for parallelization.

As another data point, Go has the same logic to retry on EAGAIN:
https://go-review.googlesource.com/c/go/+/33894/

nanosleep is defined in POSIX 2001, so I believe that all Unix-like
systems support it.

  • - git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details)

Type of change

  • bug fix - change that fixes an issue
  • new feature - change that adds functionality
  • tests - change in tests
  • infrastructure - change in infrastructure and CI
  • documentation - documentation update

Tests

  • added - required for new features and some bug fixes
  • not needed

Documentation

  • updated in # - add PR number
  • needs to be updated
  • not needed

Breaks backward compatibility

  • Yes
  • No
  • Unknown

Notify the following users

Other information

@rui314
Copy link
Contributor Author

rui314 commented May 18, 2022

Ping?

@rui314 rui314 mentioned this pull request Jun 17, 2022
6 tasks
pld-gitsync pushed a commit to pld-linux/tbb that referenced this pull request Jun 18, 2022
@marxin
Copy link

marxin commented Jul 22, 2022

May I please ping this?

@sylvestre
Copy link

Would it be possible to land this ? thanks

carlocab added a commit to carlocab/homebrew-core that referenced this pull request Aug 19, 2022
TBB has a bug in thread creation which is exposed with usage in mold.
See rui314/mold#410, rui314/mold#600. We apply a patch that has been
upstreamed to TBB at oneapi-src/oneTBB#824 to fix this.

The patch implements a mechanism similar to one used by Go [1], and has
been adopted in the Arch and OpenSUSE TBB packages.

While we're here, let's align the `cmake` invocation with other
formulae, fix references to `python3.10` (Homebrew#108008), and change the
`-rpath` flag so that it does not require relocation when pouring a
bottle (which should speed bottle pour times up slightly).

[1] https://go-review.googlesource.com/c/go/+/33894/
BrewTestBot pushed a commit to Homebrew/homebrew-core that referenced this pull request Aug 19, 2022
TBB has a bug in thread creation which is exposed with usage in mold.
See rui314/mold#410, rui314/mold#600. We apply a patch that has been
upstreamed to TBB at oneapi-src/oneTBB#824 to fix this.

The patch implements a mechanism similar to one used by Go [1], and has
been adopted in the Arch and OpenSUSE TBB packages.

While we're here, let's align the `cmake` invocation with other
formulae, fix references to `python3.10` (#108008), and change the
`-rpath` flag so that it does not require relocation when pouring a
bottle (which should speed bottle pour times up slightly).

[1] https://go-review.googlesource.com/c/go/+/33894/

Closes #108431.

Signed-off-by: Sean Molenaar <1484494+SMillerDev@users.noreply.github.com>
Signed-off-by: BrewTestBot <1589480+BrewTestBot@users.noreply.github.com>
@MatthewGentoo
Copy link

MatthewGentoo commented Aug 19, 2022

This seems to cause one of the tests to fail:

62/136 Testing: test_eh_thread
62/136 Test: test_eh_thread
Command: "/var/tmp/portage/dev-cpp/tbb-2021.7.0_rc1/work/oneTBB-2021.7.0-rc1_build-abi_x86_64.amd64/gnu_11.3_cxx11_64_relwithdebinfo/test_eh_t
hread" "--force-colors=1"
Directory: /var/tmp/portage/dev-cpp/tbb-2021.7.0_rc1/work/oneTBB-2021.7.0-rc1_build-abi_x86_64.amd64/gnu_11.3_cxx11_64_relwithdebinfo
"test_eh_thread" start time: Aug 19 16:44 BST
Output:
----------------------------------------------------------
[doctest] doctest version is "2.4.7"
[doctest] run with "--help" for options
===============================================================================
/var/tmp/portage/dev-cpp/tbb-2021.7.0_rc1/work/oneTBB-2021.7.0-rc1/test/tbb/test_eh_thread.cpp:90:
TEST CASE:  Too many threads

/var/tmp/portage/dev-cpp/tbb-2021.7.0_rc1/work/oneTBB-2021.7.0-rc1/test/tbb/test_eh_thread.cpp:90: ERROR: test case THREW exception: Resource temporarily unavailable

===============================================================================
[doctest] test cases: 1 | 0 passed | 1 failed | 0 skipped
[doctest] assertions: 2 | 2 passed | 0 failed |
[doctest] Status: FAILURE!
<end of output>
Test time =   0.00 sec
----------------------------------------------------------
Test Failed.
"test_eh_thread" end time: Aug 19 16:44 BST
"test_eh_thread" time elapsed: 00:00:00
----------------------------------------------------------

It seems like this test is supposed to check for the case where there are too many threads and pthread_create returns EAGAIN, but I'm not sure what to look at following that to debug. Maybe thread_monitor::launch is now throwing the wrong kind of exception so the test isn't catching it?


Sorry this test might just be flakey/failing randomly on my system... Please ignore, will do more testing :)

@rui314
Copy link
Contributor Author

rui314 commented Oct 10, 2022

Can someone in the TBB team review and merge this patch? More and more Linux distros are cherrypicking this as an unofficial patch.

@isaevil
Copy link
Contributor

isaevil commented Oct 10, 2022

@rui314 According to https://man7.org/linux/man-pages/man3/pthread_create.3.html

EAGAIN A system-imposed limit on the number of threads was
encountered. There are a number of limits that may
trigger this error: the RLIMIT_NPROC soft resource limit
(set via setrlimit(2)), which limits the number of
processes and threads for a real user ID, was reached; the
kernel's system-wide limit on the number of processes and
threads, /proc/sys/kernel/threads-max, was reached (see
proc(5)); or the maximum number of PIDs,
/proc/sys/kernel/pid_max, was reached (see proc(5)).

Or here https://pubs.opengroup.org/onlinepubs/9699919799/

[EAGAIN]
The system lacked the necessary resources to create another thread, or the system-imposed limit on the total number of threads in a process {PTHREAD_THREADS_MAX} would be exceeded.

So this means that the system either lacks the necessary resources, or has been set some system limit.

Is there any thread, article or bug report where mentioned that pthread_create can fail even with having enough resources or respecting a set limit?

@lnicola
Copy link

lnicola commented Oct 10, 2022

I've seen pthread_create fail with EAGAIN on busy systems, without reaching any user or kernel limit I was aware of. I suspect @rui314 has a collection of bug reports like this.

@pavelkumbrasev
Copy link
Contributor

pavelkumbrasev commented Oct 10, 2022

Also if this is some particular problem/bug I would expect to wrap the solution into macros.
In my point of view this particular workaround looks strange because OS reported that there are not enough resources but we spins and wait for something. (It is counter intuitive)

@rui314
Copy link
Contributor Author

rui314 commented Oct 10, 2022

There seems to be some discussions on Go discussion board, which you can visit from https://go-review.googlesource.com/c/go/+/33894/. Note that Go does the same thing as this is. I also got many bug reports caused by the spurious failure of pthread_create.

I can wrap this with a function (not a macro because it's bad for readability), but since this function itself is a wrapper function to call pthread_create, wrapping it with one more function doesn't seems like a good idea.

@pavelkumbrasev
Copy link
Contributor

The macro could showed that this is known problem, because as I said this behavior is counter intuitive (and pthread_create API).
Could you also please add a test case that will reproduce this problem?

@rui314
Copy link
Contributor Author

rui314 commented Oct 10, 2022

We need to count the number of iterations while retrying, so macro wouldn't be a good choice. Let me factor it out as a function.

@rui314 rui314 force-pushed the master branch 3 times, most recently from 9f1c18a to ea895c5 Compare October 12, 2022 08:06
@rui314
Copy link
Contributor Author

rui314 commented Oct 12, 2022

I factored out the code to a new function and wrote a test.

@pavelkumbrasev
Copy link
Contributor

@rui314 I tried reproduce this failure with test you proposed and it didn't fail. Are there special conditions that should be met to reproduce it?
Also, I think if proposed solution will be used on system with real limit we might face with situation when thread will wait on this loop instead useful work.

@marxin
Copy link

marxin commented Oct 13, 2022

The test cannot reproduce the issue unless your system is under a heavy load, and even under a heavy load, the probably that the problem occur is very low. I don't think you can see a failure easily.

I can quite stably reproduce that when using mold linker in GCC test suite on a pretty modern AMD Zen system.
When running 16 parallel tests each spawning 16 threads in mold.

@pavelkumbrasev
Copy link
Contributor

Accordance with Dmitry Vyukov tweet it is general problem with pthread_create.
Is there issue that was created on glibc maintainers? (As I understand this issue persist for more than 6 years already)
I also found out that some other projects use retry loop on pthread_create. But the amount of retry iterations depends on project. Were there any checked performed to pick retry constant = 20 or it is copy from Go?
We will discuss the way how oneTBB should handle this particular problem. It could be graceful degradation or some retry loop.

@rui314
Copy link
Contributor Author

rui314 commented Oct 13, 2022

I don't know if there's a bug filed to glibc for this particular issue. glibc's pthread_create(2) is after all just a wrapper to the system call, so it's not their problem, no?

Speaking of the retry count, 20 is copied from Go. I don't know if it is the best max retry count, but it should at least be battle-tested in the wild.

@Alcaro
Copy link

Alcaro commented Oct 13, 2022

glibc's pthread_create(2) is after all just a wrapper to the system call, so it's not their problem, no?

Uh, what? pthread_create does a whole lot of stuff, for example apply the pthread_attr_t, allocate a stack, allocate thread local storage, and various stuff I don't know what it means. https://github.com/bminor/glibc/blob/master/nptl/pthread_create.c#L619

Even glibc clone() isn't a straight syscall wrapper; glibc clone() calls a function in the child, but kernel clone() returns twice. https://man7.org/linux/man-pages/man2/clone.2.html#NOTES https://github.com/bminor/glibc/blob/master/sysdeps/unix/sysv/linux/x86_64/clone.S

@rozhuk-im
Copy link

Sorry for noise, I wrote a bit compact code for same, which does not call nanosleep() with zeros on first error.

static int
pthread_create_eagain(pthread_t *handle, const pthread_attr_t *attr,
    void *(*fn)(void*), void *arg) {
	int error;
	const size_t max_tries = 20;
	struct timespec rqts = {0, 0};

	for (size_t i = 1; i <= max_tries; i ++) {
		error = pthread_create(handle, attr, fn, arg);
		if (0 == error || /* Ok, done. */
		    EAGAIN != error) /* Other error. */
			return (error);
		/* Retry after tries * 1 millisecond. */
		rqts.tv_nsec = (i * 1000 * 1000);
		nanosleep(&rqts, NULL);
	}

	return (EAGAIN);
}

@pavelkumbrasev
Copy link
Contributor

@rui314 Could you please apply @rozhuk-im proposal and remove test (it doesn't reproduce the problem)?

@rui314
Copy link
Contributor Author

rui314 commented Oct 21, 2022

@pavelkumbrasev Do you want me to remove #ifdef __linux__? There's no guarantee that other systems will never fail with the same spurious error.

@rui314
Copy link
Contributor Author

rui314 commented Oct 21, 2022

Simplified the code and removed the test.

@rui314
Copy link
Contributor Author

rui314 commented Oct 21, 2022

Actually the original code was better. I didn't call nanosleep with 0 milliseconds delay, and the new code introduced an unnecessary 20 milliseconds sleep if all pthread_create failed. I'll roll it back.

src/tbb/rml_thread_monitor.h Outdated Show resolved Hide resolved
src/tbb/rml_thread_monitor.h Outdated Show resolved Hide resolved
src/tbb/rml_thread_monitor.h Outdated Show resolved Hide resolved
@pavelkumbrasev
Copy link
Contributor

Could you please check why test is failing?
image

@rui314
Copy link
Contributor Author

rui314 commented Oct 21, 2022

I don't know why, but that test fails even without my change.

@rozhuk-im
Copy link

I didn't call nanosleep with 0 milliseconds delay,

Yes, I was wrong with that.

and the new code introduced an unnecessary 20 milliseconds sleep if all pthread_create failed.

It do like you original code do.

    // Retry after tries * 1 millisecond.
    struct timespec ts = {0, tries * 1000 * 1000};

@rui314
Copy link
Contributor Author

rui314 commented Oct 21, 2022

It do like you original code do.

It actually doesn't. i was incremented before the control reaches here.

On many Unix-like systems, pthread_create can fail spuriously even if
the running machine has enough resources to spawn a new thread.
Therefore, if EAGAIN is returned from pthread_create, we actually have
to try again.

I observed this issue when running the mold linker
(https://github.com/rui314/mold) under a heavy load. mold uses OneTBB
for parallelization.

As another data point, Go has the same logic to retry on EAGAIN:
https://go-review.googlesource.com/c/go/+/33894/

nanosleep is defined in POSIX 2001, so I believe that all Unix-like
systems support it.

Signed-off-by: Rui Ueyama <ruiu@cs.stanford.edu>
@rui314
Copy link
Contributor Author

rui314 commented Oct 22, 2022

Please take another look. Now I inlined the function because it's now applied to all systems that uses pthread_create and the function look too small to be outlined (and I didn't like its complicated parameter signatures).

@pavelkumbrasev
Copy link
Contributor

The problem with this approach that tests or application that uses oneTBB might terminate because it is just 20 attempts still no guarantee for success. (as with the test that I mentioned above)
The idea is to combine this approach with graceful degradation.

@rui314
Copy link
Contributor Author

rui314 commented Oct 24, 2022

In a situation in which pthread_create fails 20 times in a row, other resources are very likely to be also very low, and the system isn't probably stable anyway; the OOM killer might kick in and kill your application however robust it is, for example. There are many other failure scenarios. So can we merge this patch now and discuss further improvements after that? Practically, this patch alone seems to eliminate all crashes we've observed in the wild.

@pavelkumbrasev
Copy link
Contributor

I have analyzed failures in test_eh_thread and seems that this code line: FAIL("No exception was caught"); leads to fail.
The test logic: set limit on threads = P -> create P threads that means that limit is reached -> try to execute parallel_for that should create some threads, but we already reached the limit, so it should throw exception in this case -> for some reason this patch helps skip the limit and create a thread and that why test fails.
It might be even worse if created worker thread will call pthread_create after it and fail on it that will lead to test termination. So behavior of this test become pretty unstable.

@pavelkumbrasev
Copy link
Contributor

It seems that this test is unstable by itself. Could you please apply this changes?

diff --git a/test/tbb/test_eh_thread.cpp b/test/tbb/test_eh_thread.cpp
index 51b97976..4a883511 100644
--- a/test/tbb/test_eh_thread.cpp
+++ b/test/tbb/test_eh_thread.cpp
@@ -119,6 +119,7 @@ TEST_CASE("Too many threads") {
                 return;
             }
         }
+        tbb::global_control g(tbb::global_control::max_allowed_parallelism, 2);
         g_exception_caught = false;
         try {
             // Initialize the library to create worker threads
@@ -132,7 +133,9 @@ TEST_CASE("Too many threads") {
         }
         // Do not CHECK to avoid memory allocation (we can be out of memory)
         if (!g_exception_caught) {
-            FAIL("No exception was caught");
+            // There is no guarantee that new thread creation will fail even if we directly set the limit
+            // because another process might free resources during library initialization.
+            WARN_MESSAGE(false, "No exception was thrown on library initialization");
         }
         finalize();
     }).join();

@rui314
Copy link
Contributor Author

rui314 commented Oct 25, 2022

@pavelkumbrasev Maybe that should be submitted as an independent patch than a part of this patch, if it fixes the test itself's flakiness. Once you submit that change I'll rebase this patch.

@mattgodbolt
Copy link

Hi all, I got redirected here after a lot of investigations... I'm now trying to work out which version of TBB has the fix in: I'm seeing this issue with mold as per the description above with mold 1.11.0 and TBB 2021.8.0, but never saw it with 2021.7.0.

@ZhongRuoyu
Copy link

I'm now trying to work out which version of TBB has the fix in

It should be 2021.9.0: v2021.8.0...v2021.9.0#diff-9cb055f5be587e2b154a4ff943f714c689b0aae1e8547e97bcbb76d89e0598fd

@mattgodbolt
Copy link

Thanks @ZhongRuoyu . We will have to upgrade: thanks.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.