WIP: Fixing the @sync/@async disappearing exception issue #38916

tro3 · 2020-12-17T02:23:53Z

Fixing #32677. This is the "disappearing exception" problem in the @sync/@async structure. The problem occurs because sync_end currently waits for the tasks in order, and therefore locks up when an earlier task is waiting for a later task that throws an exception.

This mod changes sync_end so that it monitors Tasks/Futures in parallel so that the order no longer matters and this specific lockup condition cannot occur. Lockups due to poorly-constructed Channels or other resources are not dealt with - this is a just a bandage to help the programmer, making sure exceptions propagate rather than hang. One step towards a more complete structured concurrency solution. I recommend unifying the Task / Future interfaces as that happens, btw.

This has been tested against 1.5-release, 1.6-release, and master, and tests have been added to check the lockup condition for @async, Threads.@spawn, and Distributed.@spawnat. If you would prefer that this go into Experimental, just let me know I can move and resubmit.

tro3 · 2020-12-17T18:16:03Z

An example of code that hangs currently and is fixed by this PR:

c = Channel(0)
@sync begin
    @async begin
        println("$(current_task())")
        put!(c,0)
        println("Done $(current_task())")
    end
    @async begin
        println("$(current_task())")
        undefined()
        take!(c)
        println("Done $(current_task())")
    end
    println("$(current_task())")
end

JeffBezanson · 2020-12-18T19:18:25Z

Welcome, thanks for looking at this!

The main approach I can think of to solve this is the existing Experimental.@sync, which propagates the first exception that happens. IIUC, the algorithm here is similar, but batches up whatever failures have happened as soon as we detect at least one. I'm not sure that adds much over just throwing the first one. It also changes the semantics of Base.@sync, which is supposed to wait for all tasks to finish. Of course, many argue that such a thing shouldn't exist, but to the extent we want that behavior I don't think the problem can be fixed.

There is also an issue with the implementation here, in that it busy waits, e.g. @sync @async wait() spins at 100% CPU.

tro3 · 2020-12-18T19:35:20Z

@JeffBezanson - the existing Experimental.@sync will still lock up. It does throw the first exception to occur, but it still wait()s on the Tasks in order. The hang occurs when an earlier Task in that order is waiting for a later one, but the later one throws. Then Experimental.@sync will never get to the exception. The example code above fails both the current and Experimental versions.

This PR just moves to parallel monitoring to remove that order dependence. But it does not change the semantic - the PR sync_end will not leave until all Tasks have completed and no new Tasks are coming in. In fact, I had to jump through some hoops around Distributed to make sure exceptions get handed back the same way, specifically to not mess with the interface.

tro3 · 2020-12-18T19:39:12Z

Agreed on the @sync @async wait() issue. That is a problem in the current implementation as well. That's an issue for a larger structured concurrency effort. Translates literally to "wait for the task that has been told to sit indefinitely". :-)

StefanKarpinski · 2020-12-18T19:50:39Z

Agreed on the @sync @async wait() issue. That is a problem in the current implementation as well.

On master I see no CPU utilization for that.

tro3 · 2020-12-18T20:15:55Z

Oh, I see what you're saying. Yes, the parallel monitor is "active" in the sense that it keeps checking the Task status while that Task is waiting. In the original implementation with this scenario, it ends with neither Task scheduled to do anything at all. But a control-C is still detected (presumably through task_end_hook), so let me chew on that and see if there is a way to improve this condition here. First thought is forcing a minimum time between yield()s, but I'll have to check CPU usage on such a strategy.

Thanks!

StefanKarpinski · 2020-12-18T21:31:25Z

Waiting in a slightly less busy manner is generally not what one wants to do. Rather waiting should be event-driven: the process should do nothing until some event occurs which causes it to stop having to wait (timer, input, etc.).

tro3 · 2020-12-18T23:26:46Z

But that passivity is what leads to the lock in the first place. Task A launches Tasks B and C, and begins waiting specifically for Task B to awaken it. Task B waits for a Channel that Task C is supposed to write to. But Task C throws an error first. So no Tasks are scheduled, A is frozen until B wakes it, B is frozen until C wakes it, and C is frozen waiting to hand an exception back to A.

The only way to beat this is a select command ("wait for one of B or C"). That's what this PR attempts, but it uses CPU cycles to do the job. I'll take a look at the lower-level wait mechanisms and see if there is a way to wire in a more passive version.

tro3 · 2020-12-18T23:28:48Z

My original version went a different route - adding a "wake up the parent" function to each Task and waiting instead of yielding in sync_end. This implementation was less invasive, but that earlier version would have had improved CPU efficiency. I'll see if I can combine that with the select thing in some way.

tro3 · 2020-12-19T18:29:41Z

Okay guys, thanks for the feedback. I just pushed a commit that moves the task monitor loop from active/yield to passive/wait, so that @sync @async wait() does not crush the CPU, and keeps the PR non-invasive (only modifying sync_end). I also added a slew of tests to check every combinational pair and order of @async, Threads.@spawn, and Distributed.@spawnat, both to ensure that exceptions are always thrown and also that all tasks are complete before the main task continues (to address @JeffBezanson's concern above). Please let me know what you think.

tro3 · 2020-12-19T18:35:46Z

BTW, I tried getting a wait(Vector{Condition}) to work, but the InvasiveLinkedList structure used for waitq will not allow a Task to be "notifyable" by more than one Condition. A method like wait(Vector{Condition}) is a better solution, IMO, but I don't know the reason for the usage of the ILL here and it is way too low-level (with too many unknown side-effects) for me to mess with. This PR works well, at the minimal expense of having to create monitor tasks.

tro3 · 2020-12-19T19:30:51Z

Also, I did not test this latest commit against 1.5-release and 1.6-release, only master. But the change is still limited to one function, and that function's interface remains unchanged, so I suspect it can still backport. Let me know if you need confirmation.

tro3 · 2020-12-21T01:49:44Z

Moving to WIP untiI I can debug some random build check fails on Windows (passes on win64 at home). Will get rid of / squash the commits before I move back to RFC.

tro3 · 2020-12-25T15:06:03Z

@JeffBezanson , @StefanKarpinski - okay, after wrapping my head around the lower-level scheduling details, I saw that I should move the above structure to a channel "completion notification" system that counts the remaining tasks... which sounded familiar. So I went back to look at my Experimental.sync_end testbench and found a bug in the test that made it appear to not handle the scenario.

So no need for this PR - you already have it covered in Experimental. Why not move to Base, BTW? If you guys want, when Experimental does move over, I can move the lockup tests here into a different PR just for regression purposes. Thanks for the learnings, guys.

tro3 · 2020-12-25T19:12:00Z

FWIW, when I combine the structure I envisioned (which created an extra "completion or exception" Channel) with the more efficient loop in Experimental, I end up with the below. Only advantage is that it mimics the exception behavior of the current Base.sync_end.

function _schedule_parent(t, rs)
    schedule(Task(() -> begin
        ex = nothing
        try
            wait(t)
        catch e
            ex = e
        finally
            put!(rs, ex)
        end
    end))
end

function sync_end(c::Channel{Any})
    try
        n = 0
        isready(c) || return
        while true
            t = take!(c)
            if t isa Exception
                c_ex = CompositeException([t])
                while isready(c)
                    t = take!(c)
                    t isa Exception && push!(c_ex, t)
                end
                throw(c_ex)
            elseif isnothing(t)
                n -= 1
                n == 0 && !isready(c) && break
            else
                n += 1
                _schedule_parent(t, c)
            end
        end
    finally
        close(c)
    end
    nothing
end

Moving sync_end to a parallel structure

e1aa8d4

tro3 force-pushed the sync_end branch from de884d9 to e1aa8d4 Compare December 17, 2020 15:55

jonas-schulze added the error handling Handling of exceptions by Julia or the user label Dec 18, 2020

StefanKarpinski assigned JeffBezanson Dec 18, 2020

tro3 force-pushed the sync_end branch 4 times, most recently from 6cdb94d to 43e0897 Compare December 21, 2020 00:34

tro3 changed the title ~~RFC: Fixing the @sync/@async disappearing exception issue~~ WIP: Fixing the @sync/@async disappearing exception issue Dec 21, 2020

tro3 force-pushed the sync_end branch from e0a4da0 to 64d9edd Compare December 22, 2020 06:26

tro3 mentioned this pull request Dec 24, 2020

Scheduler gap in stream_uv? #38992

Closed

tro3 force-pushed the sync_end branch from fb93df5 to 64d9edd Compare December 24, 2020 21:51

Moving sync_end parallel loop from yield to wait

195af3d

tro3 force-pushed the sync_end branch from 7c53f3b to 195af3d Compare December 25, 2020 05:31

tro3 closed this Dec 25, 2020

tro3 deleted the sync_end branch December 25, 2020 15:06

tro3 mentioned this pull request Dec 25, 2020

Error Handling in Tasks #32677

Closed

tro3 mentioned this pull request Dec 26, 2020

RFC: Parallelize sync_end to remove async hang mechanism #39007

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Fixing the @sync/@async disappearing exception issue #38916

WIP: Fixing the @sync/@async disappearing exception issue #38916

tro3 commented Dec 17, 2020 •

edited

Loading

tro3 commented Dec 17, 2020

JeffBezanson commented Dec 18, 2020

tro3 commented Dec 18, 2020

tro3 commented Dec 18, 2020

StefanKarpinski commented Dec 18, 2020

tro3 commented Dec 18, 2020

StefanKarpinski commented Dec 18, 2020

tro3 commented Dec 18, 2020 •

edited

Loading

tro3 commented Dec 18, 2020 •

edited

Loading

tro3 commented Dec 19, 2020 •

edited

Loading

tro3 commented Dec 19, 2020 •

edited

Loading

tro3 commented Dec 19, 2020

tro3 commented Dec 21, 2020 •

edited

Loading

tro3 commented Dec 25, 2020

tro3 commented Dec 25, 2020

WIP: Fixing the @sync/@async disappearing exception issue #38916

WIP: Fixing the @sync/@async disappearing exception issue #38916

Conversation

tro3 commented Dec 17, 2020 • edited Loading

tro3 commented Dec 17, 2020

JeffBezanson commented Dec 18, 2020

tro3 commented Dec 18, 2020

tro3 commented Dec 18, 2020

StefanKarpinski commented Dec 18, 2020

tro3 commented Dec 18, 2020

StefanKarpinski commented Dec 18, 2020

tro3 commented Dec 18, 2020 • edited Loading

tro3 commented Dec 18, 2020 • edited Loading

tro3 commented Dec 19, 2020 • edited Loading

tro3 commented Dec 19, 2020 • edited Loading

tro3 commented Dec 19, 2020

tro3 commented Dec 21, 2020 • edited Loading

tro3 commented Dec 25, 2020

tro3 commented Dec 25, 2020

tro3 commented Dec 17, 2020 •

edited

Loading

tro3 commented Dec 18, 2020 •

edited

Loading

tro3 commented Dec 18, 2020 •

edited

Loading

tro3 commented Dec 19, 2020 •

edited

Loading

tro3 commented Dec 19, 2020 •

edited

Loading

tro3 commented Dec 21, 2020 •

edited

Loading