Severe thread starvation issues #41586

janrous-rai · 2021-07-14T20:24:37Z

Summary

We have observed highly unpredictable thread starvation occurring on our production servers that are written in Julia. In particular, this thread starvation causes our monitoring (instrumentation) subsystem to stall for extended periods of time (tens of minutes, up to hours in extreme cases). This often leads to total loss of visibility into the operation of our production systems which makes it virtually impossible to diagnose the services and to run a reliable production system.

Solving this issue is a high priority for us at RelationalAI due to its impact on our production systems.

Background

Our database server (rai-server) is written in julia. The production deployment currently runs on Julia 1.6.1 with multi-threading enabled and number of threads set to number of cores of the underlying virtual machines (we are using Azure cloud).

Our monitoring subsystem relies heavily on background tasks - tasks that are spawned at the server startup and periodically kick off certain operations such as:

flushing changes to metrics to the remote monitoring services
collecting specific metrics such as Julia garbage collection statistics, memory usage details (by parsing /proc files) and so on.

In addition to metrics that are updated via background tasks, many other metrics are updated as soon as the events of interest occurs (e.g. when pager allocates memory for a new page, we increment metric that represents number of pages currently in use).

Users can run database queries. When these queries are received via http they're parsed, analyzed and executed. During execution, potentially large number of tasks that can have very long execution times can be generated. However, we currently do not have a good understanding of the fine-grained details (e.g. number of tasks spawned, duration of individual tasks). Some workloads may trigger long-running cpu-heavy work within a single task.

Perodic tasks

Monitoring subsystem and other batched or maintenance tasks are using the concept of a periodic task which implements the following pseudo-code:

while !periodic_task_should_terminate
  sleep(period)
  do_stuff()
end

In reality, the actual code is somewhat more complex because the above code reacts slowly to termination signals, so we are attempting to use analogue to C++ function wait_for that can wait for a specific period or a notification signal (whatever comes first).

Observed symptoms

On a server startup, we kick off periodic task that increments server_heartbeats metric once per second. Under normal conditions, when we plot the rate of change, we should get a flat line that is close to 1. However, when the server starts handling user queries, the rate of change of this metric dips well below 1 indicating that the background task is not running as frequently as it should.

Sometimes we experience complete blackout during which rai-server does not export any metrics at all. In some situations, this complete blackout can take tens of minutes or, in the example below, roughly 90 minutes. This seems to indicate that the statsd export thread is not running as frequently as it should.

Suspected root cause

We suspect that the following factors contribute to this issue:

task thread affinity where a long running task can only be scheduled on the same OS thread that it was assigned to initially. (Task migration is introduced in julia 1.7, which we've only just started testing.)
Lack of task isolation - both user query handling as well as critical components of the server (http handlers, monitoring subsystem, ...) use the same Julia tasks and compete for the same limited pool of OS threads. There's no built-in way to isolate the two or to prioritize critical work over the user-supplied workloads.
No task preemptions while tasks can explicitly yield and give up their OS thread, if they do not do that they can occupy OS thread indefinitely.
Long running tasks if the code doesn't explicitly contain yield points or uses code that implicitly contains yield points, individual tasks can run for indefinite periods of time.
Freeze-the-world GC pause waiting on long-running tasks if one Task triggers GC, and attempts to freeze the world, every thread will stop scheduling new work until all threads have paused. If one thread is running a long-running Task, GC pauses can keep the other threads starved for a long time.

Specifically we think that combination of (1), (3), (4) and maybe (5) are the primary causes that can explain the observed symptoms. The critical background tasks are scheduled on arbitrary threads. Whenever user generated tasks are spawned, they can get scheduled on those same threads. If any of these tasks are long-lived (we have queries that may run for several hours), will not yield and get scheduled on the same thread as the critical tasks, they can and will block them.

Example

The following example reproduces thread starvation. metronome function runs in fixed interval and measures the interval accuracy (drift between expected and observed interval). run_for method implements long-running non-yielding task that is spawned and blocks metronome.

using Dates
import Dates
using Base.Threads: Atomic, @spawn

function run_for(id::Int64, num_ns::Int64)
    println("$id: Kicked off run_for($num_ns) on thread: $(Threads.threadid())")
    start_ns = time_ns()
    local i = 0
    while time_ns() < start_ns + num_ns
        i += 1
    end
    println("$id: Finished run_for on thread: $(Threads.threadid())")
end

function metronome(term::Atomic{Bool}, period_ns::Int64)
    println("Kicked off metronome $period_ns on thread: $(Threads.threadid())")
    local i = 0
    while !term[]
        i += 1
        prev_ns = time_ns()
        println("Presleep $(time_ns())")
        sleep(Dates.Nanosecond(period_ns))
        actual_ns = time_ns() - prev_ns
        println("Postsleep $(time_ns()), delta $(actual_ns)")
        drift_ns = actual_ns - period_ns
        drift_s = drift_ns / 1_000_000_000
        println("$(Threads.threadid()): Metronome drift $(drift_s) seconds")
    end
    println("Terminating metronome after $i iterations")
end

# Kick of one metronome and bunch of worker threads that will compete for threads
@sync begin
    mterm = Atomic{Bool}(false)
    @spawn metronome(mterm, 1_000_000 * 250)  # Runs every 250ms
    sleep(4)
    println("Spawning busy tasks")
 
    @sync begin
       for j in 1:10
            @spawn run_for(j, 1_000_000 * 4000) # Blocks for 4s
        end
    end
    mterm[] = true
end

The text was updated successfully, but these errors were encountered:

PallHaraldsson · 2021-07-14T23:38:43Z

Can you try 1.7-beta3? I had some issues, that maybe related (or not, I didn't even read your issue to carefully) #39598 and possibly #41411 fixed one or both.

c42f · 2021-07-15T04:37:51Z

Thanks for this great writeup. We also face these same issues (though to a lesser degree of operational severity) as described in #34267 (comment)

As @tkf referred to in #34267 (comment), an attractive option may be to add dynamically scoped scheduling policies. Then normal code using @spawn inherits the scheduling policy from its calling context (for example, to schedule newly spawned tasks only onto such-and-such a subset of CPU threads).

Getting this working in full generality with custom user-defined schedulers seems like a longer term design project, but adding some tunable knobs to the builtin scheduler feels more doable.

Explicit yield points - the code itself can be instrumented with yield points to ensure that long-running tasks are unlikely

Related to this, what about having a compiler plugin insert implicit yield points for any code called in a latency-sensitive scheduling context?

This is like what the Go compiler does (or used to do) by inserting yield points at every function call boundary. This has performance overhead, but for Julia we could avoid emitting that code unless called in a latency sensitive context. (IIUC Go has now moved away from doing this at compile time and now has runtime asynchronous preemption of goroutines https://medium.com/a-journey-with-go/go-asynchronous-preemption-b5194227371c. That seems really great, but also a huge pile of work.)

tkf · 2021-07-15T06:19:29Z

normal code using @spawn inherits the scheduling policy from its calling context

Since Threads.@spawn needs to support full concurrency, I think adding more constraints that the scheduler has to consider might introduce more overhead to the scheduler. (Edit: hmm... maybe possible)

what about having a compiler plugin insert implicit yield points for any code called in a latency-sensitive scheduling context?

This can introduce races to code that relies on the lack of yields (i.e., certain blocks of code are treated as "atomic" with respect to Julia's concurrency).

Getting rid of such code and asking users to write fully "thread-like" code may be one option. But I think it'd be nice (and possible) to be able to declare a given function is blocking. If we have such annotations, a compiler (plugin) can safely insert yields.

vchuravy · 2021-07-15T08:27:26Z

This has performance overhead, but for Julia we could avoid emitting that code unless called in a latency sensitive context

The issue with that is that you will likely run a mix of sensitive and non-sensitive code, and the non-sensitive code will mess up your latency. So systematically figuring out where we are missing safepoints for GC latencies would be important, the obvious one is loops (maybe we can enter safepoints after vectorization?) and probably non-allocating non-inlined function calls?

janrous-rai · 2021-07-15T15:02:59Z

Can you try 1.7-beta3? I had some issues, that maybe related (or not, I didn't even read your issue to carefully) #39598 and possibly #41411 fixed one or both.

Our systems are complex enough that migration to new versions is non-trivial. We are working on it but it will take a while to make it viable.

While the task migrations will likely make this issue less frequent, it is still possible to generate enough non-yielding task to block critical operations so the potential for this remains even then.

JeffBezanson · 2021-07-15T17:17:25Z

Having a @threads begin ... end that starts a new thread that runs only a specific task seems like a fairly clean way to accomplish this. We've also discussed having a dedicated thread for running tasks designated as latency-sensitive, but then you have to trust everybody to actually write low-latency tasks... :) We'll think about it.

c42f · 2021-07-16T02:08:25Z

The issue with that is that you will likely run a mix of sensitive and non-sensitive code, and the non-sensitive code will mess up your latency. So systematically figuring out where we are missing safepoints for GC latencies would be important, the obvious one is loops

Right, I guess there's two related but different issues:

Ensuring latency-sensitive tasks get scheduled regularly. This can be helped by dedicated latency-sensitive threads and sufficient yield points in all tasks running on that thread.
Ensuring that latency sensitive tasks don't get held up in GC by unrelated compute tasks which spend a long time in a GC unsafe region without hitting any GC safe points. This seems harder. I don't see a lot of code for gc_state transitions, does codegen reason about this at all?

tkf · 2021-07-16T08:10:28Z

Having a @threads begin ... end that starts a new thread that runs only a specific task

Isn't the requirement here rather that other tasks originated from other workers can't use the specified worker? We'd need to handle cases where tasks are scheduled from the dedicated worker since otherwise we can't call arbitrary library functions from the dedicated worker.

An interesting case is when a library (that you don't control) uses @spawn inside and wait on it

@sync begin
    @spawn f()
    g()
end

If @spawn f() is scheduled on other non-dedicated workers, we'll still have the latency problem by priority inversion. So, it seems like @spawn f() needs to be sticky on the dedicated worker. Or, maybe it'd be even better if it has a special property that it can be scheduled on non-dedicated and this dedicated worker.

Alternatively, maybe we can pull in outsider tasks (if it's not running) upon wait from the dedicated worker. This approach has the benefit that starting throughput-oriented tasks from the dedicated worker is easier. In the above approach, the application authors need to do some plumbing using channels etc. to start such tasks.

So systematically figuring out where we are missing safepoints for GC latencies would be important, the obvious one is loops (maybe we can enter safepoints after vectorization?) and probably non-allocating non-inlined function calls?

Didn't Go move away from this approach? GopherCon 2020: Austin Clements - Pardon the Interruption: Loop Preemption in Go 1.14 - YouTube

If it didn't work in Go, I think it's much less likely to work in Julia since people write very performance-sensitive (throughput-oriented) loops.

vchuravy · 2021-07-16T09:31:01Z

If it didn't work in Go, I think it's much less likely to work in Julia since people write very performance-sensitive (throughput-oriented) loops.

Our GC safepoints are async based, e.g. they are implemented as a load from a location that will seqfault and the fault handler will then switch to GC, I think the Go approach was to do more work in the user side of the safepoint. Adding an extra load in the backedge is certainly sub-optimal (and potentially additional work to maintain the data in the gcframe).

We already have the infrastructure for pre-empting the code, that is part of how the profiler work, but we are missing a way to find GC pointers at the interrupt location. e.g. stackmaps (let's not do that), conservative scanning (meh), or on ARM we could do pointer tagging (xD).

I would say we should see how big of an performance impact it would actually be to late insert a safepoint into the backedges of the loop. Also non-allocating functions are also missing safepoints.

julia> fib(x) = x <= 1 ? 1 : fib(x-1) + fib(x-2)
fib (generic function with 1 method)

julia> @code_llvm fib(2)
;  @ REPL[2]:1 within `fib'
define i64 @julia_fib_177(i64 signext %0) {
top:
; ┌ @ int.jl:442 within `<='
   %1 = icmp sgt i64 %0, 1
; └
  br i1 %1, label %L4, label %L3

L3:                                               ; preds = %top
  ret i64 1

L4:                                               ; preds = %top
; ┌ @ int.jl:86 within `-'
   %2 = add i64 %0, -1
; └
  %3 = call i64 @julia_fib_177(i64 signext %2)
; ┌ @ int.jl:86 within `-'
   %4 = add i64 %0, -2
; └
  %5 = call i64 @julia_fib_177(i64 signext %4)
; ┌ @ int.jl:87 within `+'
   %6 = add i64 %5, %3
; └
  ret i64 %6
}

So the recent work on reducing allocation might have made this problem worse :)

tkf · 2021-07-16T10:12:12Z

The load in the safepoint is sandwiched with seq_cst fences. It's a signal fence and so "free" for CPUs. But isn't it harmful to the optimizer?

vchuravy · 2021-07-16T12:36:04Z

But isn't it harmful to the optimizer?

Yeah, that's why I was thinking of inserting at the end of the optimization pipeline

NHDaly · 2021-09-15T15:30:17Z

We discussed today having a Multithreading Birds of a Feather working group discussion entirely focused on this topic to work through the design.

The main challenge here remains designing the feature, so meeting together could help unblock it. 👍

tkf · 2021-09-18T02:53:08Z

Regarding the direction for adding dedicated worker threads, I think it's worth considering not adding a Julia API and only provide a command line option to do it. This way, application authors can isolate latency-critical code but package authors are encouraged to write composable concurrent/parallel programs. Having a single central resource manager is important. Not many languages have this and it'd be nice if we can keep this property.

jpsamaroo · 2021-09-18T12:22:44Z

I disagree, making users figure out how many threads to start on the CLI is annoying and error-prone. I'd consider it much more important to document that such an API is only to be used very sparingly, and that most users shouldn't even consider using it.

I agree that it would be ideal if we could always write composable concurrent programs, but that is not the case. We have libraries like CUDA which we have no control over, and have to use whatever API they give us, including all their non-composable, blocking routines. Providing a way for us to work around such problematic APIs is key to allowing the rest of our program to remain composable.

tkf · 2021-09-18T17:36:36Z

I've suggested a couple of ways for solving the tricky CUDA integration with the current concurrency API in Julia ¹. I don't think dedicated worker is necessary or ease the implementation. It's also not straightforward to get this right due to the priority inversion I mentioned in #41586 (comment).

I think a more convincing use-case for the dedicated worker would be Distributed.jl and Dagger.jl (as the distributed scheduling is latency sensitive). But Distributed.jl can control the command line flag of the worker processes and so it can create a "manager thread" for each worker process quite easily. It can't add the dedicated worker thread in the main process but the main process usually doesn't do the computation in Distributed.jl. It's also possible to create a scheduler process and move the controlling responsibility out of the main process (which could be better for distributing schedulers anyway).

Discussion of CUDA.jl hostcall implementation in https://julialang.slack.com/archives/C688QKS7Q/p1631550952060800 ↩

davidanthoff · 2021-09-18T18:03:20Z

For the VS Code extension we would also like to be able to add a low latency worker thread without the need for command line arguments. One scenario for us is that a user starts a Julia process (without us being able to control or add any command line args), and then we give the user the ability to connect this process to the extension integration points, and in that scenario we would want to now run a low latency worker thread for the communication in that Julia process. If the only way to add a dedicated worker thread was via a command line option, we couldn't solve that scenario.

jpsamaroo · 2021-09-18T18:42:41Z

Dedicated threads/threadpool PR is posted: #42302

vtjnash · 2023-03-10T16:25:44Z

Mitigated by #41616

NHDaly · 2023-03-14T19:06:26Z

🎉 thanks everyone! Great to see this closed 👍

janrous-rai mentioned this issue Jul 14, 2021

Feature request: prevent spawning to thread 1 #34267

Closed

JeffBezanson added the multithreading Base.Threads and related functionality label Jul 15, 2021

vchuravy mentioned this issue Jul 16, 2021

Emit safepoints at function entry #41616

Merged

vchuravy mentioned this issue Jul 23, 2021

Add support for USDT probes #41685

Merged

NHDaly mentioned this issue Jul 29, 2021

Add thread_ids::Vector option to Profile.init() #39746

Closed

tkf mentioned this issue Oct 19, 2021

Feature to stop accepting new tasks on existing threads #42709

Closed

tkf mentioned this issue Jan 4, 2022

RFC/POC: Multi-scheduler for and in Julia #43649

Closed

vtjnash closed this as completed Mar 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Severe thread starvation issues #41586

Severe thread starvation issues #41586

janrous-rai commented Jul 14, 2021

PallHaraldsson commented Jul 14, 2021 •

edited

Loading

c42f commented Jul 15, 2021

tkf commented Jul 15, 2021 •

edited

Loading

vchuravy commented Jul 15, 2021

janrous-rai commented Jul 15, 2021

JeffBezanson commented Jul 15, 2021

c42f commented Jul 16, 2021

tkf commented Jul 16, 2021

vchuravy commented Jul 16, 2021

tkf commented Jul 16, 2021

vchuravy commented Jul 16, 2021

NHDaly commented Sep 15, 2021

tkf commented Sep 18, 2021

jpsamaroo commented Sep 18, 2021

tkf commented Sep 18, 2021

davidanthoff commented Sep 18, 2021

jpsamaroo commented Sep 18, 2021

vtjnash commented Mar 10, 2023

NHDaly commented Mar 14, 2023

Severe thread starvation issues #41586

Severe thread starvation issues #41586

Comments

janrous-rai commented Jul 14, 2021

Summary

Background

Perodic tasks

Observed symptoms

Suspected root cause

Suggested solutions

Example

PallHaraldsson commented Jul 14, 2021 • edited Loading

c42f commented Jul 15, 2021

tkf commented Jul 15, 2021 • edited Loading

vchuravy commented Jul 15, 2021

janrous-rai commented Jul 15, 2021

JeffBezanson commented Jul 15, 2021

c42f commented Jul 16, 2021

tkf commented Jul 16, 2021

vchuravy commented Jul 16, 2021

tkf commented Jul 16, 2021

vchuravy commented Jul 16, 2021

NHDaly commented Sep 15, 2021

tkf commented Sep 18, 2021

jpsamaroo commented Sep 18, 2021

tkf commented Sep 18, 2021

Footnotes

davidanthoff commented Sep 18, 2021

jpsamaroo commented Sep 18, 2021

vtjnash commented Mar 10, 2023

NHDaly commented Mar 14, 2023

PallHaraldsson commented Jul 14, 2021 •

edited

Loading

tkf commented Jul 15, 2021 •

edited

Loading