Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove "unhandled task failure" message printing #27722

Merged
merged 1 commit into from
Jun 28, 2018
Merged

Conversation

JeffBezanson
Copy link
Member

This is a repeat of #12736; I'm going to propose this again.

  • With the new multi-threading system approaching, it will be helpful to remove as much logic as possible from task startup and shutdown. Ideally we'd like to remove task_done_hook entirely. It's bad because every task has to pay for everything in that hook whether it needs the feature or not. Instead there can be just one mechanism (waiting in a task's completion queue) you can use to do whatever you need when a task exits. I plan to remove TASKDONE_HOOKS as well in a future PR.
  • Other asynchronous programming systems don't do this. One source I found says that C# handles this on task GC, which has also been proposed. I think that's still too complex and costly, and also somewhat arbitrary: if you want to see exceptions, make sure you don't keep any references to the task? If you're going to go to that trouble, it's easier to add a try-catch or start a monitoring task.
  • I'll say it again: having to set a SUPPRESS_EXCEPTION_PRINTING flag sometimes is just silly. No such thing should ever exist.

@ararslan ararslan added parallelism Parallel or distributed computation error handling Handling of exceptions by Julia or the user labels Jun 21, 2018
@staticfloat
Copy link
Member

So I guess the flipside is that we should just make it easy for users to launch tasks with wrappers around them; e.g. there should be a @schedule_and_print_errors or something that wraps a Task in nice things like this; that gives us fast Tasks if we want it, with the user being able to decide what should or should not receive the special attentions of an exception catcher/printer.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Jun 22, 2018

Or do something like trio: #6283 (comment). Basic premise: tasks must form a tree and all child tasks must terminate before their parent task does. There's an escape hatch, however, where you can spawn a task that outlives the current task by spawning it explicitly into a "nursery" which then becomes responsible for that task. This way errors can propagate up the task tree in a well-defined manner and there's always some task that's responsible for it. Note this comment from the trio docs:

This is why you have to use async with to get a nursery: it gives us a way to make sure that the child calls can’t run away and get lost. One reason this is important is that if there’s a bug or other problem in one of the children, and it raises an exception, then it lets us propagate that exception into the parent; in many other frameworks, exceptions like this are just discarded. Trio never discards exceptions.

Yes, that's exactly the property we want. It would also mean that cancelling tasks would make sense without chaos: when you cancel a task it cancels all its child tasks; it doesn't matter how many subtasks some task spawned in order to do its work, if you cancel a task they're all neatly killed.

@vtjnash
Copy link
Member

vtjnash commented Jun 22, 2018

should just make it easy for users to launch tasks with wrappers around them

+1. I think we need a better approach to launching tasks, as Stefan suggested above, before we remove this.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Jun 22, 2018

In considering this, I'm not sure if we'd need a "nursery" object type since we have first-class tasks: if you want to spawn a task that outlives the current task, you could be required to explicitly given it a parent task which is responsible for it (gets its unhandled exceptions, cancels it if cancelled). All you really need is a task tree which exceptions propagate up and cancellations propagate down. With normal blocking APIs, the parent is the caller; with some APIs you can give an explicit parent task.

@JeffBezanson
Copy link
Member Author

I believe the way Trio ensures exceptions are handled is roughly that it requires all tasks to be spawned inside the equivalent of our @sync, which calls wait on all of them at the end of the block. Simply assigning a parent to a task is not enough, since you still have to decide when to print (or otherwise handle) the child task's exception. AFAIU, in Trio you still must reach the parent task's call to wait before anything happens with the child exception. It does not spontaneously get printed as a result of the wait not having been reached yet. Indeed, the only way to implement @sync or something like Trio's nurseries in julia is to use the SUPPRESS_EXCEPTION_PRINTING flag. Without it there's no hope of getting the right behavior.

Put differently, the only way Trio argues against this change is if it has a SUPPRESS_EXCEPTION_PRINTING flag, and that flag has generally been considered a good API.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Jun 22, 2018

Simply assigning a parent to a task is not enough, since you still have to decide when to print (or otherwise handle) the child task's exception.

My understanding is that if a child task dies, the error propagates into the parent. If the parent handles it and continues, all is well (the child is still dead, but everything else continues). If the parent doesn't handle the child's exception, then it too dies, which kills off all its children—i.e. the sibling tasks of the original child task—and the error propagates further up to the grandparent of the original child. If no task ever handles the exception, it eventually reaches the root of the task tree and the entire process terminates with the child's exception as the cause. Hence the claim "Trio never discards exceptions"—either some task handles an exception or it kills the process.

@JeffBezanson
Copy link
Member Author

If the parent doesn't handle the child's exception, then it too dies

I assume that happens when the parent task exits? I.e. when a task exits, it waits on all of its child tasks, unless they were explicitly "detached" somehow. That might be ok. But it still doesn't argue for printing an exception as soon as a task dies before anybody has called wait on it, so I still want to merge this PR.

@StefanKarpinski
Copy link
Member

so I still want to merge this PR.

I'm not really arguing against this PR but I do think that we should do something so that every exception must be handled. It's true that the trio approach does not entail a failed task printing its error as soon as it ends, it just has a comparable effect because if no one catches the error then the root task prints its error and quits.

I.e. when a task exits, it waits on all of its child tasks, unless they were explicitly "detached" somehow.

I'm not 100% clear on that. The parent may be "blocked" until all of its children return which is a somewhat different model than what we've had. But I could be wrong.

@JeffBezanson
Copy link
Member Author

if no one catches the error then the root task prints its error and quits.

I'm fine with something like this in theory. But the whole problem is defining what it means for nobody to catch the error --- how long do you wait for somebody to catch the error? And Trio does provide an escape hatch:

The only way to violate causality (which is an important feature, just one that needs to be handled carefully) is to explicitly create a nursery object in one task and then pass it into another task.
(from https://trio.readthedocs.io/en/latest/design.html)

I'm not sure I fully understand this, but I believe it means you can have a "rogue" nursery sitting around with possibly-failed tasks in it, waiting (possibly forever) for somebody to look at it.

@JeffBezanson JeffBezanson added the triage This should be discussed on a triage call label Jun 25, 2018
@JeffBezanson
Copy link
Member Author

Does anybody consider this a breaking API change? Seems like a borderline case.

@JeffBezanson
Copy link
Member Author

Triage seems to be ok with this. We can try to do something better in the future, possibly handling it in the gc.

@JeffBezanson JeffBezanson merged commit e55f9b5 into master Jun 28, 2018
@JeffBezanson JeffBezanson deleted the jb/taskfailure branch June 28, 2018 18:37
@JeffBezanson JeffBezanson removed the triage This should be discussed on a triage call label Jun 28, 2018
@vtjnash
Copy link
Member

vtjnash commented Jun 28, 2018

:,( The end of a useful feature that has helped find numerous bugs and broken tests. I will impatiently await the day when we get that something better in the future!

@JeffBezanson
Copy link
Member Author

Are you ok with printing on gc? Or anything that's not a race condition? The race condition is really what I object to most here.

@StefanKarpinski
Copy link
Member

I think printing on GC seems like a good compromise.

@vtjnash
Copy link
Member

vtjnash commented Jun 28, 2018

Yes, I agree that it is nice to remove the race condition, although since it only happens when the code fails, it seems like it would be rare to observe in practice. Handling it in the GC (or at least, spawning a task to do the printing) seems good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
error handling Handling of exceptions by Julia or the user parallelism Parallel or distributed computation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants