Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Fixing the @sync/@async disappearing exception issue #38916

Closed
wants to merge 2 commits into from

Conversation

tro3
Copy link

@tro3 tro3 commented Dec 17, 2020

Fixing #32677. This is the "disappearing exception" problem in the @sync/@async structure. The problem occurs because sync_end currently waits for the tasks in order, and therefore locks up when an earlier task is waiting for a later task that throws an exception.

This mod changes sync_end so that it monitors Tasks/Futures in parallel so that the order no longer matters and this specific lockup condition cannot occur. Lockups due to poorly-constructed Channels or other resources are not dealt with - this is a just a bandage to help the programmer, making sure exceptions propagate rather than hang. One step towards a more complete structured concurrency solution. I recommend unifying the Task / Future interfaces as that happens, btw.

This has been tested against 1.5-release, 1.6-release, and master, and tests have been added to check the lockup condition for @async, Threads.@spawn, and Distributed.@spawnat. If you would prefer that this go into Experimental, just let me know I can move and resubmit.

@tro3
Copy link
Author

tro3 commented Dec 17, 2020

An example of code that hangs currently and is fixed by this PR:

c = Channel(0)
@sync begin
    @async begin
        println("$(current_task())")
        put!(c,0)
        println("Done $(current_task())")
    end
    @async begin
        println("$(current_task())")
        undefined()
        take!(c)
        println("Done $(current_task())")
    end
    println("$(current_task())")
end

@jonas-schulze jonas-schulze added the error handling Handling of exceptions by Julia or the user label Dec 18, 2020
@JeffBezanson
Copy link
Member

Welcome, thanks for looking at this!

The main approach I can think of to solve this is the existing Experimental.@sync, which propagates the first exception that happens. IIUC, the algorithm here is similar, but batches up whatever failures have happened as soon as we detect at least one. I'm not sure that adds much over just throwing the first one. It also changes the semantics of Base.@sync, which is supposed to wait for all tasks to finish. Of course, many argue that such a thing shouldn't exist, but to the extent we want that behavior I don't think the problem can be fixed.

There is also an issue with the implementation here, in that it busy waits, e.g. @sync @async wait() spins at 100% CPU.

@tro3
Copy link
Author

tro3 commented Dec 18, 2020

@JeffBezanson - the existing Experimental.@sync will still lock up. It does throw the first exception to occur, but it still wait()s on the Tasks in order. The hang occurs when an earlier Task in that order is waiting for a later one, but the later one throws. Then Experimental.@sync will never get to the exception. The example code above fails both the current and Experimental versions.

This PR just moves to parallel monitoring to remove that order dependence. But it does not change the semantic - the PR sync_end will not leave until all Tasks have completed and no new Tasks are coming in. In fact, I had to jump through some hoops around Distributed to make sure exceptions get handed back the same way, specifically to not mess with the interface.

@tro3
Copy link
Author

tro3 commented Dec 18, 2020

Agreed on the @sync @async wait() issue. That is a problem in the current implementation as well. That's an issue for a larger structured concurrency effort. Translates literally to "wait for the task that has been told to sit indefinitely". :-)

@StefanKarpinski
Copy link
Member

Agreed on the @sync @async wait() issue. That is a problem in the current implementation as well.

On master I see no CPU utilization for that.

@tro3
Copy link
Author

tro3 commented Dec 18, 2020

Oh, I see what you're saying. Yes, the parallel monitor is "active" in the sense that it keeps checking the Task status while that Task is waiting. In the original implementation with this scenario, it ends with neither Task scheduled to do anything at all. But a control-C is still detected (presumably through task_end_hook), so let me chew on that and see if there is a way to improve this condition here. First thought is forcing a minimum time between yield()s, but I'll have to check CPU usage on such a strategy.

Thanks!

@StefanKarpinski
Copy link
Member

Waiting in a slightly less busy manner is generally not what one wants to do. Rather waiting should be event-driven: the process should do nothing until some event occurs which causes it to stop having to wait (timer, input, etc.).

@tro3
Copy link
Author

tro3 commented Dec 18, 2020

But that passivity is what leads to the lock in the first place. Task A launches Tasks B and C, and begins waiting specifically for Task B to awaken it. Task B waits for a Channel that Task C is supposed to write to. But Task C throws an error first. So no Tasks are scheduled, A is frozen until B wakes it, B is frozen until C wakes it, and C is frozen waiting to hand an exception back to A.

The only way to beat this is a select command ("wait for one of B or C"). That's what this PR attempts, but it uses CPU cycles to do the job. I'll take a look at the lower-level wait mechanisms and see if there is a way to wire in a more passive version.

@tro3
Copy link
Author

tro3 commented Dec 18, 2020

My original version went a different route - adding a "wake up the parent" function to each Task and waiting instead of yielding in sync_end. This implementation was less invasive, but that earlier version would have had improved CPU efficiency. I'll see if I can combine that with the select thing in some way.

@tro3
Copy link
Author

tro3 commented Dec 19, 2020

Okay guys, thanks for the feedback. I just pushed a commit that moves the task monitor loop from active/yield to passive/wait, so that @sync @async wait() does not crush the CPU, and keeps the PR non-invasive (only modifying sync_end). I also added a slew of tests to check every combinational pair and order of @async, Threads.@spawn, and Distributed.@spawnat, both to ensure that exceptions are always thrown and also that all tasks are complete before the main task continues (to address @JeffBezanson's concern above). Please let me know what you think.

@tro3
Copy link
Author

tro3 commented Dec 19, 2020

BTW, I tried getting a wait(Vector{Condition}) to work, but the InvasiveLinkedList structure used for waitq will not allow a Task to be "notifyable" by more than one Condition. A method like wait(Vector{Condition}) is a better solution, IMO, but I don't know the reason for the usage of the ILL here and it is way too low-level (with too many unknown side-effects) for me to mess with. This PR works well, at the minimal expense of having to create monitor tasks.

@tro3
Copy link
Author

tro3 commented Dec 19, 2020

Also, I did not test this latest commit against 1.5-release and 1.6-release, only master. But the change is still limited to one function, and that function's interface remains unchanged, so I suspect it can still backport. Let me know if you need confirmation.

@tro3 tro3 force-pushed the sync_end branch 4 times, most recently from 6cdb94d to 43e0897 Compare December 21, 2020 00:34
@tro3 tro3 changed the title RFC: Fixing the @sync/@async disappearing exception issue WIP: Fixing the @sync/@async disappearing exception issue Dec 21, 2020
@tro3
Copy link
Author

tro3 commented Dec 21, 2020

Moving to WIP untiI I can debug some random build check fails on Windows (passes on win64 at home). Will get rid of / squash the commits before I move back to RFC.

@tro3
Copy link
Author

tro3 commented Dec 25, 2020

@JeffBezanson , @StefanKarpinski - okay, after wrapping my head around the lower-level scheduling details, I saw that I should move the above structure to a channel "completion notification" system that counts the remaining tasks... which sounded familiar. So I went back to look at my Experimental.sync_end testbench and found a bug in the test that made it appear to not handle the scenario.

So no need for this PR - you already have it covered in Experimental. Why not move to Base, BTW? If you guys want, when Experimental does move over, I can move the lockup tests here into a different PR just for regression purposes. Thanks for the learnings, guys.

@tro3 tro3 closed this Dec 25, 2020
@tro3 tro3 deleted the sync_end branch December 25, 2020 15:06
@tro3 tro3 mentioned this pull request Dec 25, 2020
@tro3
Copy link
Author

tro3 commented Dec 25, 2020

FWIW, when I combine the structure I envisioned (which created an extra "completion or exception" Channel) with the more efficient loop in Experimental, I end up with the below. Only advantage is that it mimics the exception behavior of the current Base.sync_end.

function _schedule_parent(t, rs)
    schedule(Task(() -> begin
        ex = nothing
        try
            wait(t)
        catch e
            ex = e
        finally
            put!(rs, ex)
        end
    end))
end

function sync_end(c::Channel{Any})
    try
        n = 0
        isready(c) || return
        while true
            t = take!(c)
            if t isa Exception
                c_ex = CompositeException([t])
                while isready(c)
                    t = take!(c)
                    t isa Exception && push!(c_ex, t)
                end
                throw(c_ex)
            elseif isnothing(t)
                n -= 1
                n == 0 && !isready(c) && break
            else
                n += 1
                _schedule_parent(t, c)
            end
        end
    finally
        close(c)
    end
    nothing
end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
error handling Handling of exceptions by Julia or the user
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants