-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What to do about asynchronous exceptions #52291
Comments
This was originally a slack thread, so I missed some prior discussion, but cancellation was also extensively discussed in #33248. |
@gbaraldi raised the question of what to do on allocation failure. Copying my response here: I think we need to separate allocation failure into two separate things.
I think it's fine to have an explicit, synchronous exception for the former. I think the latter just needs to freeze the task - potentially with some notification to an oom monitor. We just do way too much explicit allocation to have this turn into an exception. The optimizer should also have liberty to turn the former into the latter (if it can prove the allocation is small, or if it knows the allocation can be elided entirely regardless of size). |
Does it mean that Ctrl-C would not actually interrupt the process, but only "hide" that it keeps running?
It would be even more frustrating to realise that the "interrupted" process actually has not been interrupted and keeps changing the workspace state in the background |
Yes, but this is the "emergency fallback mode", and would have to be appropriately messaged to the user. I'm thinking red flashing REPL prompt or something with a big red warning message.
I think it would be fine to suspend the hung thread by default with an option to resume it explicitly using some command. Again 99% of users are not expected to ever hit this state. It's supposed to improve the situation where current you're getting crashes, segfaults and arbitrary memory writes on ^C. |
Perhaps the "emergency fallback mode" should use a different combination instead of Ctrl-C. SIGINT has a clear semantic of interrupting the process, and it doesn't necessarily imply that the user should regain control at any cost. It might be counterintuitive to let the user "interrupt" the process and then inform them that it hasn't actually been interrupted with flashy messages. Instead of abruptly exiting after multiple Ctrl-C presses, Julia could display a message suggesting a different keyboard combination for entering the "emergency fallback mode."
It sounds like a good idea, but it's not what I would expect from the intended function of Ctrl-C. There are specific signals ( |
This is mostly about the behavior in the REPL. In that instance, Julia is taking on job control responsibilities, and there's no reason to expect to require that to match POSIX job control semantics. That said, I think the idea of separating cancellation and suspension is reasonable. We'd have to play with it and see what people like best. |
|
Nathaniel J Smith of " |
Since this is not modeled by the exception logic, and it can interrupt arbitrary program state or corrupt locks (leading to hangs and other issues), as well as just frequently segfaulting again afterwards, give a printed message as soon as we notice things are going badly before attempting to recover. For example: ``` $ ./julia -e 'f() = f(); f()' Warning: detected a stack overflow; program state may be corrupted, so further execution might be unreliable. ERROR: StackOverflowError: Stacktrace: [1] f() (repeats 2 times) @ Main ./none:1 ``` Refs #52291
note that this actually can cause segfaults currently:
the TLDR here is that type inference proves that |
Why did we turn on exception type inference, if we are not doing it soundly? Introducing UB for any user that uses |
^C has never been sound, e.g. if you hit it in LLVM it's pretty common to just get corrupted state and crash. That said, as we optimize exceptions more, this issue is of course more pressing. I agree that the exception type confusion (which this issue predates, so I didn't include it in the original list) is particularly nasty. That said, I think we should look at what it takes to fix this properly. Playing with the "how UB is it" dial is fine but the correct answer is obviously just to turn that dial all the way to the left. |
Following up on recent discussions caused by our improvements to the modeling of exception propagation, I've been thinking again about the semantics of asynchronous exceptions (in particular StackOverflowError and InterruptException). This is of course not a new discussion. See e.g. #4037 #7026 #15514. However, I because of our recent improvements to exception flow modeling, I think this issue has gained new urgency.
Current situation
Before jumping into some discussion about potential enhancements, let me summarize some relevant history and our current state. Let me know if I forgot anything and I'll edit it in.
InterruptException
By default, we defer interrupt exceptions to the next GC safepoint. This helps avoid corruption caused by unwinding over state in C that isn't interrupt safe. This helps a bit, but of course, if you are actually in this kind of region, your experience will be something like the following:
Which isn't any better than before (because we're falling back to what we used to do) and arguably worse (because you had to press ^C many times).
StackOverflowError
Stack overflow error just exposes the OS notion of stack overflow (i.e. if something touches the guard page, the OS sends us a SEGV, which we turn into an appropriate Julia error). We are slightly better here than we used to be since we now at least stack probe large allocations (#40068).
Nevertheless, this again is still not particularly well defined semantically. For example, is the following actually semantically sound:
julia/base/strings/io.jl
Lines 32 to 40 in 187e8c2
I think the answer is probably "no", because setting up the exception from touches the stack, so we could be generating a stackoverflowerror after the lock, but before we enter the try/finally region. Additionally, if we are close enough to the stack bottom to cause a stack overflow, there's no guarantee that we won't immediately hit that same stack overflow again trying to run the unlock code.
Recent try/finally elision
On master, inference has the capability to reason about the type of exceptions and whether or not catch blocks run. As a result, we can end up eliding try/finally blocks if everything inside the try block is proven nothrow:
As a result, we can get the following behavior:
i.e. the finally block never ran.
Some thoughts on how to move forward
I don't think I really have answers, but here's some scattered thoughts:
A possible design for cancellation
I think the general consensus among language and API designers is that arbitrary cancellation is unworkable as an interface. Instead, one should favor explicit cancellation requests and cancellation checks. In that vein, we could consider having an explicit
@cancel_check
macro that expands to:For more complex use cases
cancellation_requested
could be called directly and additional cleanup (e.g. requesting the cancellation of any synchronously launched I/O operations or something). As an additional, optimization, we can take advantage of our (recently much improved):effect_free
modeling to add the ability to reset to the previous (by longjmp)cancellation_requested
check if there have been no intervening side effects. This extension could then also be used by external C libraries to register their own cancellation mechanism, in effect giving us back some variant of scoped asynchronous cancellation, but only when semantically unobservable or explicitly opted into.That of course leaves the question of what would happen if there is no cancellation point set. My preference here would be to wait a reasonable amount of time (a few seconds or so, bypassable by a second press of ^C) and if no cancellation point is reached in time,
This way, we never throw any unsafe asynchronous exceptions that could be corrupting the process state, but give the user back a REPL that they can use to either investigate the problem, or at least save any in progress work they may have. There's very little things more frustrating than losing your workspace state, because the ^C you did happened to corrupt and crash your process.
One final note here is to ask the question what should happen while we're in inference or LLVM. Since they are not modeled, we are not semantically supposed to throw any InterruptExceptions here. With the design above, the answer here would be that on entry, we would stop infering/compiling things, instead proceeding in the interpreter in the hope to hit the next cancellation point as quickly as possible. If cancellation becomes active while we are compiling, we would try to bail out as soon as feasible.
My recommendation
Having written all this down, I think my preference would be a combination of the above cancellation proposal with some mechanism to avoid StackOverflowErrors entirely. I think to start with I think we could enable some sort of segmented task stack support, but treat triggering this as an error to be thrown at the next cancellation point. I think we should also investigate if we can more fully model a function's stack size requirements since we tend to be more aggressively devirtualized. If we can, then, we could consider using a segmented stack mechanism more widely, but I think even if there is some performance penalty, getting rid of the possibility of asynchronous exceptions is well worth it.
The text was updated successfully, but these errors were encountered: