-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Severe thread starvation issues #41586
Comments
Thanks for this great writeup. We also face these same issues (though to a lesser degree of operational severity) as described in #34267 (comment) As @tkf referred to in #34267 (comment), an attractive option may be to add dynamically scoped scheduling policies. Then normal code using Getting this working in full generality with custom user-defined schedulers seems like a longer term design project, but adding some tunable knobs to the builtin scheduler feels more doable.
Related to this, what about having a compiler plugin insert implicit yield points for any code called in a latency-sensitive scheduling context? This is like what the Go compiler does (or used to do) by inserting yield points at every function call boundary. This has performance overhead, but for Julia we could avoid emitting that code unless called in a latency sensitive context. (IIUC Go has now moved away from doing this at compile time and now has runtime asynchronous preemption of goroutines https://medium.com/a-journey-with-go/go-asynchronous-preemption-b5194227371c. That seems really great, but also a huge pile of work.) |
Since
This can introduce races to code that relies on the lack of Getting rid of such code and asking users to write fully "thread-like" code may be one option. But I think it'd be nice (and possible) to be able to declare a given function is blocking. If we have such annotations, a compiler (plugin) can safely insert |
The issue with that is that you will likely run a mix of sensitive and non-sensitive code, and the non-sensitive code will mess up your latency. So systematically figuring out where we are missing safepoints for GC latencies would be important, the obvious one is loops (maybe we can enter safepoints after vectorization?) and probably non-allocating non-inlined function calls? |
Our systems are complex enough that migration to new versions is non-trivial. We are working on it but it will take a while to make it viable. While the task migrations will likely make this issue less frequent, it is still possible to generate enough non-yielding task to block critical operations so the potential for this remains even then. |
Having a |
Right, I guess there's two related but different issues:
|
Isn't the requirement here rather that other tasks originated from other workers can't use the specified worker? We'd need to handle cases where tasks are scheduled from the dedicated worker since otherwise we can't call arbitrary library functions from the dedicated worker. An interesting case is when a library (that you don't control) uses @sync begin
@spawn f()
g()
end If Alternatively, maybe we can pull in outsider tasks (if it's not running) upon
Didn't Go move away from this approach? GopherCon 2020: Austin Clements - Pardon the Interruption: Loop Preemption in Go 1.14 - YouTube If it didn't work in Go, I think it's much less likely to work in Julia since people write very performance-sensitive (throughput-oriented) loops. |
Our GC safepoints are async based, e.g. they are implemented as a load from a location that will seqfault and the fault handler will then switch to GC, I think the Go approach was to do more work in the user side of the safepoint. Adding an extra load in the backedge is certainly sub-optimal (and potentially additional work to maintain the data in the gcframe). We already have the infrastructure for pre-empting the code, that is part of how the profiler work, but we are missing a way to find GC pointers at the interrupt location. e.g. stackmaps (let's not do that), conservative scanning (meh), or on ARM we could do pointer tagging (xD). I would say we should see how big of an performance impact it would actually be to late insert a safepoint into the backedges of the loop. Also non-allocating functions are also missing safepoints.
So the recent work on reducing allocation might have made this problem worse :) |
The load in the safepoint is sandwiched with seq_cst fences. It's a signal fence and so "free" for CPUs. But isn't it harmful to the optimizer? |
Yeah, that's why I was thinking of inserting at the end of the optimization pipeline |
We discussed today having a Multithreading Birds of a Feather working group discussion entirely focused on this topic to work through the design. The main challenge here remains designing the feature, so meeting together could help unblock it. 👍 |
Regarding the direction for adding dedicated worker threads, I think it's worth considering not adding a Julia API and only provide a command line option to do it. This way, application authors can isolate latency-critical code but package authors are encouraged to write composable concurrent/parallel programs. Having a single central resource manager is important. Not many languages have this and it'd be nice if we can keep this property. |
I disagree, making users figure out how many threads to start on the CLI is annoying and error-prone. I'd consider it much more important to document that such an API is only to be used very sparingly, and that most users shouldn't even consider using it. I agree that it would be ideal if we could always write composable concurrent programs, but that is not the case. We have libraries like CUDA which we have no control over, and have to use whatever API they give us, including all their non-composable, blocking routines. Providing a way for us to work around such problematic APIs is key to allowing the rest of our program to remain composable. |
I've suggested a couple of ways for solving the tricky CUDA integration with the current concurrency API in Julia 1. I don't think dedicated worker is necessary or ease the implementation. It's also not straightforward to get this right due to the priority inversion I mentioned in #41586 (comment). I think a more convincing use-case for the dedicated worker would be Distributed.jl and Dagger.jl (as the distributed scheduling is latency sensitive). But Distributed.jl can control the command line flag of the worker processes and so it can create a "manager thread" for each worker process quite easily. It can't add the dedicated worker thread in the main process but the main process usually doesn't do the computation in Distributed.jl. It's also possible to create a scheduler process and move the controlling responsibility out of the main process (which could be better for distributing schedulers anyway). Footnotes
|
For the VS Code extension we would also like to be able to add a low latency worker thread without the need for command line arguments. One scenario for us is that a user starts a Julia process (without us being able to control or add any command line args), and then we give the user the ability to connect this process to the extension integration points, and in that scenario we would want to now run a low latency worker thread for the communication in that Julia process. If the only way to add a dedicated worker thread was via a command line option, we couldn't solve that scenario. |
Dedicated threads/threadpool PR is posted: #42302 |
Mitigated by #41616 |
🎉 thanks everyone! Great to see this closed 👍 |
Summary
We have observed highly unpredictable thread starvation occurring on our production servers that are written in Julia. In particular, this thread starvation causes our monitoring (instrumentation) subsystem to stall for extended periods of time (tens of minutes, up to hours in extreme cases). This often leads to total loss of visibility into the operation of our production systems which makes it virtually impossible to diagnose the services and to run a reliable production system.
Solving this issue is a high priority for us at RelationalAI due to its impact on our production systems.
Background
Our database server (rai-server) is written in julia. The production deployment currently runs on Julia 1.6.1 with multi-threading enabled and number of threads set to number of cores of the underlying virtual machines (we are using Azure cloud).
Our monitoring subsystem relies heavily on background tasks - tasks that are spawned at the server startup and periodically kick off certain operations such as:
/proc
files) and so on.In addition to metrics that are updated via background tasks, many other metrics are updated as soon as the events of interest occurs (e.g. when pager allocates memory for a new page, we increment metric that represents number of pages currently in use).
Users can run database queries. When these queries are received via http they're parsed, analyzed and executed. During execution, potentially large number of tasks that can have very long execution times can be generated. However, we currently do not have a good understanding of the fine-grained details (e.g. number of tasks spawned, duration of individual tasks). Some workloads may trigger long-running cpu-heavy work within a single task.
Perodic tasks
Monitoring subsystem and other batched or maintenance tasks are using the concept of a periodic task which implements the following pseudo-code:
In reality, the actual code is somewhat more complex because the above code reacts slowly to termination signals, so we are attempting to use analogue to C++ function
wait_for
that can wait for a specific period or a notification signal (whatever comes first).Observed symptoms
On a server startup, we kick off periodic task that increments
server_heartbeats
metric once per second. Under normal conditions, when we plot the rate of change, we should get a flat line that is close to 1. However, when the server starts handling user queries, the rate of change of this metric dips well below 1 indicating that the background task is not running as frequently as it should.Sometimes we experience complete blackout during which rai-server does not export any metrics at all. In some situations, this complete blackout can take tens of minutes or, in the example below, roughly 90 minutes. This seems to indicate that the statsd export thread is not running as frequently as it should.
Suspected root cause
We suspect that the following factors contribute to this issue:
Specifically we think that combination of (1), (3), (4) and maybe (5) are the primary causes that can explain the observed symptoms. The critical background tasks are scheduled on arbitrary threads. Whenever user generated tasks are spawned, they can get scheduled on those same threads. If any of these tasks are long-lived (we have queries that may run for several hours), will not yield and get scheduled on the same thread as the critical tasks, they can and will block them.
Suggested solutions
In the end, this is all about sharing of limited resources (threads) among tasks and enforcing some form of isolation. Ultimately, we need to be able to flag work that is critical (latency sensitive) and ensure that other tasks will not be able to starve it. There are several possible approaches:
We think that solution (3) seems like the most robust and flexible solution and would like feedback on the feasibility/complexity of this.
Example
The following example reproduces thread starvation.
metronome
function runs in fixed interval and measures the interval accuracy (drift between expected and observed interval).run_for
method implements long-running non-yielding task that is spawned and blocksmetronome
.The text was updated successfully, but these errors were encountered: