Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong stacktrace shown when rethrowing an exception after catching another one #19979

Closed
martinholters opened this issue Jan 11, 2017 · 25 comments
Labels
error handling Handling of exceptions by Julia or the user

Comments

@martinholters
Copy link
Member

function foo()
    try
        error("error in foo")
    catch e
    end
end

function bar()
    try
        error("error in bar")
    catch e
        foo()
        rethrow(e)
    end
end

bar()

produces the misleading

ERROR: error in bar
Stacktrace:
 [1] foo() at ./REPL[1]:3
 [2] bar() at ./REPL[2]:5

This just hit me in a really wicked situation: The stacktrace actually lead me to a line erroneously throwing the reported exception. I fixed it (to throw the correct exception), but the same error kept showing, leaving me completely puzzled for quite some time...

@kshyatt kshyatt added the error handling Handling of exceptions by Julia or the user label Jan 11, 2017
@JeffBezanson
Copy link
Member

We should somehow clarify that if you care about the backtrace of an exception, you need to call catch_backtrace immediately.

@samoconnor
Copy link
Contributor

samoconnor commented Nov 14, 2017

I've been trying to figure out why the backtraces are mostly corrupted when my Retry.jl package is used. I have always assumed that my naive macro hacking was messing up the line number metadata in the AST somehow. I spent a long time chasing that down, but I now think the root of the problem is this issue.

It seems there are three problems: the backtrace can be empty; the rethrown backtrace can be wrong; or the rethrown error and backtrace can be wrong.

The minimal cases I've boiled it down to follow:

test1() -- works

Julia Version 0.7.0-DEV.2098 (2017-10-10 11:37 UTC)

function f()
    throw("Error in f()!")
end

function test1()
    try
        f()
    catch e
        if e == "foo"
            println("Got foo!")
        end
        rethrow(e)
    end
end
ERROR: LoadError: "Error in f()!"
Stacktrace:
 [1] f() at /Users/sam/julia/v0.5/Retry/src/extest.jl:2
 [2] test1() at /Users/sam/julia/v0.5/Retry/src/extest.jl:8
...
in expression starting at /Users/sam/julia/v0.5/Retry/src/extest.jl:50

test2() -- No stacktrace

function test2()
    try
        f()
    catch e
        println("catch!") # Update:, yield() is enough to loose the backtrace
        rethrow(e)
    end
end
catch!
ERROR: LoadError: "Error in f()!"
in expression starting at /Users/sam/julia/v0.5/Retry/src/extest.jl:50

test3() -- Wrong stacktrace

function isfoo(x)
    try
        x.code == "foo"
    catch e
        false
    end
end


function test3()
    try
        f()
    catch e
        if isfoo(e)
            println("Got foo!")
        end
        rethrow(e)
    end
end
ERROR: LoadError: "Error in f()!"
Stacktrace:
 [1] isfoo(::String) at /Users/sam/julia/v0.5/Retry/src/extest.jl:29
 [2] test3() at /Users/sam/julia/v0.5/Retry/src/extest.jl:43
...
in expression starting at /Users/sam/julia/v0.5/Retry/src/extest.jl:50

test4() -- Wrong error rethrown

The doc for rethrow() says "The default argument is the current exception (if called within a catch block)", but in this case rethrow is throwing the exception that was caught and ignored by isfoo():

function test4()
    try
        f()
    catch e
        if isfoo(e)
            println("Got foo!")
        end
        rethrow()
    end
end
ERROR: LoadError: type String has no field code
Stacktrace:
 [1] isfoo(::String) at /Users/sam/julia/v0.5/Retry/src/extest.jl:29
 [2] test3() at /Users/sam/julia/v0.5/Retry/src/extest.jl:43

@samoconnor
Copy link
Contributor

@JeffBezanson I'm not sure how to apply your advice that "you need to call catch_backtrace immediately".

I see that the runtime calls catch_backtrace() in _start().

It seems like maybe I need to do tmp = catch_backtrace() at the top of catch and set_global_catch_backtrace(tmp) before rethrow() to undo any backtrace corruption that happens in the catch block, but I can't see anything like set_global_catch_backtrace.

I found that exec_program in repl.c does a manual save and restore of the ptls->bt_data.

Maybe record_backtrace should save the old ptls->bt_data before overwriting it if there is already a current ptls->exception_in_transit ?

@samoconnor
Copy link
Contributor

related:
#12541
#12485

@samoconnor
Copy link
Contributor

Based on reading the issues references above, and looking at the code,
It seems that rethrow only works if the catch block does not:

  • cause a task switch,
  • throw an exception or
  • call anything that might cause a task switch or an exception.

Is that correct? @vtjnash @yuyichao @Keno @s2maki

@samoconnor
Copy link
Contributor

samoconnor commented Nov 15, 2017

related to #20784 ?
#13947

samoconnor added a commit to JuliaWeb/Retry.jl that referenced this issue Nov 21, 2017
Retry.jl used to use try/catch inside the catch block's "@ignore if ..."
condition. The purpose of this was to treat if conditions that throw
errors as false. e.f. @ignore if e.reason == "Foo" end might throw an
exception if e does not have a reason field. Retry.jl used to allow this
and treat it as false.

At present doing a nested try/catch in a catch block causes the Julia
stacktrace to be corrupted: JuliaLang/julia#19979 (comment)

After this commit, Retry.jl not requires @ignore if conditions to return
true or false and not throw exceptions.

The ecode() function has been exported to allow replacing

    @ignore if e.code == "404" end

with

    @ignore if ecode(e) == "404" end
samoconnor added a commit to JuliaCloud/AWSCore.jl that referenced this issue Nov 21, 2017
JuliaWeb/Retry.jl@6c565a3

JuliaLang/julia#19979 (comment)

Retry.jl no longer allows @ignore if conditions to throw errors.

To support this:
 - Dynamic generation of exception types is removed from AWSException.jl
   (All exceptions are now type AWSException with fields: code, message,
info and cause.
 - Exception handling conditions have added type checks.
@s2maki
Copy link
Contributor

s2maki commented Nov 27, 2017

I've definitely seen the case for test4 cause the wrong exception to be thrown, and the missing stack trace problem from test2. Our code that exhibits this behavior definitely seems to fit these examples.

@JeffBezanson
Copy link
Member

I think the fastest way to fix this is to add a method of rethrow that accepts a backtrace; that will at least allow it to be preserved manually via

catch e
    bt = catch_backtrace()
    ...
    rethrow(e, bt)
end

@JeffBezanson
Copy link
Member

Here's a patch for 0.6:

--- a/src/task.c
+++ b/src/task.c
@@ -576,6 +576,16 @@ JL_DLLEXPORT void jl_rethrow_other(jl_value_t *e)
     throw_internal(e);
 }
 
+JL_DLLEXPORT void jl_rethrow_other_bt(jl_value_t *e, jl_value_t *bt)
+{
+    size_t btsz = jl_array_len(bt);
+    size_t sz = btsz > JL_MAX_BT_SIZE ? JL_MAX_BT_SIZE : btsz;
+    jl_ptls_t ptls = jl_get_ptls_states();
+    memcpy(ptls->bt_data, jl_array_data(bt), sz * sizeof(void*));
+    ptls->bt_size = sz;
+    throw_internal(e);
+}
+
 JL_DLLEXPORT jl_task_t *jl_new_task(jl_function_t *start, size_t ssize)

With wrapper:

rethrow(e, bt::Array{Ptr{Void},1}) = ccall(:jl_rethrow_other_bt, Union{}, (Any,Any), e, bt)

This can be added to 0.7 as well though the patch will be significantly different. In fact we should probably fully deprecate rethrow to throw (allowing 1 or 2 arguments).

@c42f
Copy link
Member

c42f commented Jan 14, 2018

Naively it seems possible to fix this and make catch_backtrace/rethrow reliable by saving and restoring the backtrace data appropriately - somewhat similarly to jl_apply_with_saved_exception_state but with various complications.

Is it worth considering this approach or is there some fundamental reason why it won't work?

@Keno
Copy link
Member

Keno commented Jan 14, 2018

The problem is that collecting the backtrace is expensive, so you only want to do that if you're actually gonna save it. It is quite possible however to have some sort of magic symbol that the compiler looks for to see if it needs to save the backtrace (which could but need not be syntactically exposed). @JeffBezanson said there was an issue about that at some point, but I couldn't find it.

@c42f
Copy link
Member

c42f commented Jan 14, 2018

I spent a few hours poking around in the runtime. The backtrace may be expensive, but it looks like the cost of recording it is unconditionally borne inside jl_throw. The only way to prevent that is to somehow have a flag set in the error handler state. A syntactic hint like this old proposal: https://gist.github.com/vtjnash/9cd53434cc7f7e2f7a29 would allow that.

But that seems like a distraction - what I'm proposing is that if there's a current backtrace, we should record it when entering a try block, and restore it when leaving. Seems like a bit of pointer shuffling may be all that's required in jl_enter_handler/jl_pop_handler? And it should only hurt if you've already thrown an exception.

For context, I'm just mulling over what to do about Jeff's comment at #25370 (comment) and wondering if a solution to the underlying problem is within reach.

@c42f
Copy link
Member

c42f commented Jan 15, 2018

all that's required

Hah, I should have known better :-) . I started prototyping a solution, but restoring the exception can't happen inside jl_pop_handler / jl_eh_restore_state because that occurs before the code in the catch block (duh!). There's also some careful gc rooting required.

I now think we should generalize the exception_in_transit to be a stack of (exception,backtrace) states in transit. If you throw Exception2 while handling Exception1, you'd like to know about Exception1 as the root cause while further handling Exception2. This would mean jl_throw needs to push the new exception onto the stack. The lowering of catch blocks would need a final step added to pop the exception stack back to the state it was when the try was entered. I think I can see a trick to give JL_CATCH the same behavior.

To address the task switching problems, the current exception stack would need to be saved and restored during context switching which means more copies of bt_data floating around.

If we did all this, I believe catch_backtrace() can be made reliable and there would be no measurable performance penalty outside a catch block.

@samoconnor
Copy link
Contributor

@c42f a stack of backtraces is what I had in mind #19979 (comment) ...

Maybe record_backtrace should save the old ptls->bt_data before overwriting it if there is already a current ptls->exception_in_transit ?

@JeffBezanson
Copy link
Member

Yes, a stack of backtraces is a good idea. That would let us keep catch_backtrace and rethrow, and fix finally as well. rethrow() is also more efficient than having the user (or compiler) call catch_backtrace() and pass the result manually, since that requires allocating more julia objects to reflect the data safely.

@samoconnor
Copy link
Contributor

This reminds me of an old niggling thought: can we do better that the catch + rethrow API?
Can we do it in a way that discourages accidental loss (non-rethrowing) of important errors?
Can it be done in a way that makes requests for type-based catch redundant?

peek e as an alternative to catch e but with no need to rethrow. i.e. the error still propagates up the stack.

try
    ...
peek e
    @warn "my code caused an error" e
end

catch e where automatic rethrow is the default unless delete!(e) is called

try
    ...
catch e
    e isa IOError && e.code == :FOO && delete!(e)
end

@JeffBezanson JeffBezanson added this to the 1.0.x milestone Jan 23, 2018
@c42f
Copy link
Member

c42f commented Jan 25, 2018

@samoconnor that's an interesting idea; I think the key thing you're doing with the peek is capturing state from the scope where the peek occurs, and communicating it to an outer scope.

I think a better way to do this might be to add a new exception to the stack - that is, instead of the peek, to simply have:

try
    ...
catch
    throw(MyException("my code caused an error"))
end

Notably, because this throw occurs before the end of the catch block, the previous exceptions in the stack are preserved. So in the outer scope you have the full context, but you can also capture and transport arbitrary information in MyException.

In general I think the pattern of logging an exception and rethrowing it isn't great: Somewhere at an outer scope the exception will also be caught and reported in some way (probably via logging but without rethrowing). Logging and rethrowing just leads to duplicate error information due to several catch/log/rethrows down the stack. This is very confusing in log files. I'd say it's much better to preserve the information and make it available as a full stack of exceptions at outer scope instead.

@samoconnor
Copy link
Contributor

I now think we should generalize the exception_in_transit to be a stack of (exception,backtrace) states in transit.

Hi @c42f,
Have you been able to make any progress towards this?
I have a protocol layer stack where there are multiple layers that need to: catch errors, inspect error type, clean up layer-local resources, then retrhow. As things stand, the stack traces are almost always corrupted or missing by the time the exceptions reach the application layer. As things stand, I often find myself commenting-out all the intermediate try/catches to get a clean look at the original stack trace. This is especially frustrating because of catch catching all exception types. I have this hierarchy of fine grained IO exception handlers, but I have to disable it all if I make a typo in some code that leads to an argument error, or and undefined var error or a bounds error.

It would be really awesome to have rethrow working :)

(This reminds me of: #15906)

@c42f
Copy link
Member

c42f commented Feb 22, 2018

It's an interesting problem but I haven't made any implementation progress. My original motivation was to solve #25370 (comment), but in the end it didn't seem that fixing catch_backtrace() would help much. (Not to detract from the host of other good reasons to do this.)

@c42f
Copy link
Member

c42f commented Jul 2, 2018

I had a 24 hour plane journey so I started looking into this in more detail. Here's some notes I made for myself about how exceptions and backtraces are implemented in the runtime (perhaps could become devdocs if there's interest?)

Capturing backtraces

In the runtime, backtraces are recorded by calling the low level function
jl_unw_stepn. jl_unw_stepn can fill in two C arrays:

  • ip which is an array of native instruction pointers mixed with interpreter
    state where necessary. This array tells us which code was being executed
    at the time of the backtrace.
  • sp which is an array of pointers to native stack frames containing the
    program data. This optional array tells us which data was in use at the
    time of the backtrace.

For use from julia code, jl_unw_stepn is wrapped by jl_backtrace_from_here
in order to return arrays of julia types. For use internally by the runtime,
jl_unw_stepn is called directly from various functions. For instance,
throwing an exception using jl_throw records the backtrace into a raw ip
buffer attached to the thread local storage. This is a bare minimum of raw
information for efficiency and needs to be further converted into julia-level
information when calling catch_backtrace and associated functions.

Format of the raw instruction pointer array

The julia runtime executes programs as a mixture of interpreted and compiled
code. Therefore, a comprehensive instruction pointer array must contains a
mixture of interpreter program counters and native instruction pointers:

  • Interpreter program counters are coded as a triple consisting of a special
    sentinel value (uintptr_t)-1, followed by a jl_value_t pointer to the
    MethodInstance being interpreted, and an integer program counter indexing
    into the code for that method.
  • Native program instruction pointers are a single void pointer to the native
    code being executed.

To get stack traces to work when interpreted and compiled code are mixed
requires a bit of black magic to detect the interpreter frames using the raw
instruction pointer, and extract the interpreter state from the native stack.
This is done by always entering the interperter via a special thunk
enter_interpreter_frame, whose code address can be known to the runtime. The
stack layout in this hand written assembly function is also carefully
controlled so that the interpreter state can be extracted in a well defined
way.

Exception handling

Try-catch mechanism

Julia exception handling is built on top of the C setjmp/longjmp mechanism
with some additional logic in jl_eh_restore_state to restore the state of the
task when a longjmp occurs. This additional logic includes:

  • Removing GC roots in abandoned frames by popping the GC stack
  • Restoring task local GC state
  • Unlocking any mutexes which were held in the abandoned frames
  • Restoring the signal mask

The state to be restored must be recorded at the try site and be accessible at
the catch site both in C and in julia. In both cases it's ultimately stored on
the stack in a jl_handler_t which is initialized by jl_enter_handler and
consumed by jl_eh_restore_state.

The C mechanism is as follows:

  • Use of the JL_TRY macro introduces a local variable of type jl_handler_t,
    initializes it before calling setjmp, and calls jl_eh_restore_state if no
    exception is thrown.
  • JL_CATCH adds the else clause for entry when an exception triggers a
    longjmp. This includes a call to jl_eh_restore_state which uses the
    local jl_handler_t variable to restore task local state.

The julia codegen mechnism essentially mirrors the C version, but is a little
more complicated as lowering, code generation, and an LLVM optimization pass
are involved. Briefly, lowering turns try/catch into
Expr(:enter) and Expr(:leave) in the lowered linearized code. These are
then translated during LLVM codegen into calls to the C runtime functions given
above, by way of an intermediate dummy function @julia.enter and the LLVM
pass LowerExcHandlers. (Note that local jumps caused by break and @goto
also abandon a try block and are also lowered to Expr(:leave).)

Throwing exceptions

From julia

When an exception is raised with jl_throw:

  • A backtrace is recorded into the thread local state
  • Some GC safe point stuff happens which I don't understand
  • The exception object is recorded in the thread local state
  • The jl_handler_t for the current task is used to get the approriate
    jmp_buf which will reset execution to the current exception handler up the
    stack.

From the runtime

The runtime can raise exceptions using jl_throw; this works the same way as
from julia.

There's also some special circumstances which require extra care:

  • Stack overflows
  • Divide by zero errors
  • User initiated interruptions (^C)
  • Attempts to write into read-only memory.

These are caught by the runtime using signal handlers (for example, stack
overflows are detected by protecting a memory page on the known stack boundary,
and catching the SIGSEGV which results if the stack spills onto this page).
Crucially, all code inside a signal handler must be async-signal-safe - for
example, we cannot allocate memory using malloc or call any function which
does so.

@c42f
Copy link
Member

c42f commented Jul 2, 2018

To follow on from the above, a possible implementation plan:

First, fix the task-vs-thread local state bugs in the current implementation. This should give experience going to the next step:

  • Fix thread local exception problems. Either
    • Move exception_in_transit from thread local to task local storage in _jl_task_t, or
    • Leave as is, but persist the exception state into the task right after the catch, or during task switching.
      What's the difference here? There's two things:
    • A potential perf difference, trading task switching perf for task startup time and memory usage.
    • Putting exception state directly on the task, we can't throw exceptions before the root task is initialized. Is this a problem?
  • Fix thread local backtrace buffer. See above.

To tie things off, replace the task-based exception/backtrace with a stack:

  • Add stack of (exception,backtrace) pairs to the task, perhaps based on a linked list? Allow the first node to be reused so that perf doesn't crater when try is entered but no exception occurs?

@c42f
Copy link
Member

c42f commented Jul 20, 2018

I've been slowly poking away at this. It looks reasonably easy to preserve the exception and backtrace data during task switching - that's just a few changes to the runtime C code.

The more tricky problem is to insert an exception stack pop intrinsic (which I've currently called Expr(:pop_exc)) into the lowered code and more particularly to get it flowing through the various layers of optimization correctly. The desired lowering is something along the lines of

Before()
try
    InTry()
catch
    InCatch()
end
After()

lowers to

  Before()
  enter c1
  InTry()
  leave 1
  goto e1
c1:
  leave 1
  InCatch()
  pop_exc    #<- Needs to pop exception stack back to the state at associated `enter`
e1:
  After()

The state of the stack at enter can be recorded in the jl_handler_t and will be dynamically available when the associated leave executes. We need a little of this state available in the associated pop_exc.

I've got three ideas:

  • I've tried storing the label of the associated enter in the pop_exc expression but it ended up feeling slightly unnatural to propagate that label during SSAIR processing, particularly since it's not used as a jump target.
  • Another idea is to use a token which is yielded by the enter, and make pop_exc take that token as an argument. As I understand it, this is the kind of thing that the LLVM Token type was introduced for, and would then have a nice lowering to LLVM IR before we get to the LowerExcHandlers pass.
  • We could just re-deduce the association from the nesting context in LowerExcHandlers, provided the IR isn't reordered by any optimizations. I'm starting to think this would be the easiest as LowerExcHandlers already has similar assumptions.

@Keno any thoughts?

@c42f
Copy link
Member

c42f commented Jul 27, 2018

It turns out that of the ideas above, producing a token from enter plays extremely nicely with the implicit SSA variables in SSAIR, but also is easy to work with in both the interpreter and in codegen. So I'm going with that for now.

@c42f
Copy link
Member

c42f commented Aug 24, 2018

Here's a PR which fixes this: #28878

@c42f
Copy link
Member

c42f commented Oct 24, 2018

Fixed in #28878

@c42f c42f closed this as completed Oct 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
error handling Handling of exceptions by Julia or the user
Projects
None yet
Development

No branches or pull requests

7 participants