Stack overflow not caught in Drop for TLS data #111272

ridiculousfish · 2023-05-06T00:19:27Z

Rust attempts to catch stack overflow by installing a guard page at the end of the valid stack range. This guard page is stored in TLS and is torn down when a thread exits (including the main thread).

However other thread local data may run drop after this guard page is torn down. Stack overflows occurring in these drop implementations are not detected by Rust. (It may be backstopped by the OS, but this is system dependent.)

To reproduce:

use std::thread_local;
struct First;
struct Second;

impl Drop for First {
    fn drop(&mut self) {
        // First is materialized in TLS in main().
        // Second is materialized in First's dtor, thereby registering a
        // dtor for Second.
        // This dtor is guaranteed to run after the Thread data (including
        // its guard page) has been unmapped.
        SECOND.with(|_s| {} );
    }
}

impl Drop for Second {
    fn drop(&mut self) {
        // Trigger a stack overflow.
        recurse(0);
    }
}

thread_local! {
    static FIRST: First = First;
}

thread_local! {
    static SECOND: Second = Second;
}

// Triggers a stack overflow.
fn recurse(count: usize) {
    if count < usize::MAX {
        recurse(count + 1);
    }
    println!("{} bottles of beer on the wall", count);
}

fn main() {
    FIRST.with(|_f| {});
}

This causes a SIGILL on macOS and a SIGSEGV on Linux. In both cases I confirmed that Rust's stack overflow signal handler is not run.

Reproduced on rust stable and master branch.

The text was updated successfully, but these errors were encountered:

workingjubilee · 2023-05-06T00:24:20Z

cc @m-ou-se may be of interest to you as you clean up the thread local impl.

bstrie · 2023-05-06T01:20:42Z

@ridiculousfish Did you believe that there is potential for UB from safe code here? If you can explain the circumstances where UB can arise here, I'll tag this with I-unsound (which is the tag used for the ability to invoke UB from safe code).

ridiculousfish · 2023-05-06T02:00:46Z

@bstrie There is no unsafe code in the repro case. I don't know if a stack overflow which is not caught by Rust's handler is considered UB, but imagine it should be, else there would be no reason for the handler. So (conservatively) yes.

workingjubilee · 2023-05-06T03:04:50Z

I don't think it's ever actually been definitively answered as to whether the stack protector is considered a soundness constraint or a quality of implementation detail.

thomcc · 2023-05-06T05:50:30Z

In general it's target-specific if we implement it (and even depends if Rust is in control of main) so I think it's considered quality of implementation (if for no other reason than pragmatism).

That said, I think this is worth fixing. On most targets (I don't think all, but it's been a while) we have enough control over the order dtors are run in already (since we run them manually most of the time) that we should be able to ensure the guard teardown happens last.

thomcc · 2023-05-06T05:53:45Z

I can look at this this weekend.

joboet · 2023-05-06T08:14:58Z

Duplicate of #109785. In short, the problem is that

the guard page range is stored in the same TLS variable as the current Thread, so it is destroyed before the other TLS destructors are run
the signal handler stack is deallocated just before the thread exits, so when the signal handler runs, it accesses the NULL pointer, resulting in the crash.

The first problem is quite easily handled by putting the guard page location in a second TLS variable, but the second problem is more complicated. I tried to resolve these by eagerly running the destructors in the thread itself (see #109858) but that appears to interact badly with the platform libc.

workingjubilee · 2023-05-06T22:30:48Z

Despite being a duplicate, I am going to close the other issue instead, because this one has more useful data.

bjorn3 · 2023-05-17T08:49:49Z

I don't know if a stack overflow which is not caught by Rust's handler is considered UB, but imagine it should be, else there would be no reason for the handler. So (conservatively) yes.

It is not UB. The reason we have this handler at all AFAIK is to avoid people thinking they managed to violate memory safety even though a stack overflow is guaranteed to result in an abort through SIGSEGV on every platform where we or the OS sets up a guard page and stack probing is used. (currently only x86/x86_64, but ideally every arch) The guard page doesn't get removed when dropping TLS data.

ridiculousfish · 2023-05-17T19:22:09Z

To clarify there are potentially two guard pages at play here:

One that the OS creates, which is not unmapped when dropping TLS
One that the Rust runtime creates (edit: I previously made a mistake here, I linked to the munmap for the sigaltstack. I am unclear where Rust's mmap gets cleaned up).

Whether this second page is created is platform dependent.

bjorn3 · 2023-05-17T19:33:43Z

Didn't think about the unmapping of the rust created guard page. Looks like it is dropped right between returning from the thread main function and dropping TLS:

rust/library/std/src/sys/unix/thread.rs

Lines 104 to 109 in ad23942

    
               // Next, set up our stack overflow handler which may get triggered if we run 
        
               // out of stack. 
        
               let _handler = stack_overflow::Handler::new(); 
        
               // Finally, let's run some code. 
        
               Box::from_raw(main as *mut Box<dyn FnOnce()>)(); 
        
           }

RalfJung · 2024-03-31T11:24:40Z

In general it's target-specific if we implement it (and even depends if Rust is in control of main) so I think it's considered quality of implementation (if for no other reason than pragmatism).

AFAIK this is just a soundness bug for some targets and environments (like when main is not Rust). Stack overflow can mean we start colliding with the heap and that's clearly unsound. It's a hard-to-fix soundness bug though. Not sure if it is explicitly tracked anywhere.

Didn't think about the unmapping of the rust created guard page. Looks like it is dropped right between returning from the thread main function and dropping TLS:

So that does sound like a soundness issue then, if the guard page no longer exists when TLS dtors run?

joboet · 2024-04-04T09:33:29Z

So that does sound like a soundness issue then, if the guard page no longer exists when TLS dtors run?

It's not, because the guard page is provided by the system and available during TLS destruction. What's not available is our signal stack, which we unregister when the thread main function returns. Therefore, the signal handler will be run on the overflowing stack (more specifically: on the guard page), resulting in an immediate, system-caused second SIGSEGV.

RalfJung · 2024-04-04T09:54:14Z

It's not, because the guard page is provided by the system

There seems to be contradicting information here as above @bjorn3 said that the guard page is unmapped. That was the Rust guard page, which I presume is not the same thing as the system guard page, but if there is guaranteed to be a system guard page then why do we even have a Rust guard page?

joboet · 2024-04-04T10:00:09Z

There is no Rust guard page, except for some platforms where we protect the main thread (but that one we never reset).

We do need a stack for the signal handler though, and that stack we free at thread exit, otherwise we'd leak quite some memory.

…cupiver std: make `thread::current` available in all `thread_local!` destructors ... and thereby allow the panic runtime to always print the right thread name. This works by modifying the TLS destructor system to schedule a runtime cleanup function after all other TLS destructors registered by `std` have run. Unfortunately, this doesn't affect foreign TLS destructors, `thread::current` will still panic there. Additionally, the thread ID returned by `current_id` will now always be available, even inside the global allocator, and will not change during the lifetime of one thread (this was previously the case with key-based TLS). The mechanisms I added for this (`local_pointer` and `thread_cleanup`) will also allow finally fixing rust-lang#111272 by moving the signal stack to a similar runtime-cleanup TLS variable.

…piver std: make `thread::current` available in all `thread_local!` destructors ... and thereby allow the panic runtime to always print the right thread name. This works by modifying the TLS destructor system to schedule a runtime cleanup function after all other TLS destructors registered by `std` have run. Unfortunately, this doesn't affect foreign TLS destructors, `thread::current` will still panic there. Additionally, the thread ID returned by `current_id` will now always be available, even inside the global allocator, and will not change during the lifetime of one thread (this was previously the case with key-based TLS). The mechanisms I added for this (`local_pointer` and `thread_cleanup`) will also allow finally fixing rust-lang#111272 by moving the signal stack to a similar runtime-cleanup TLS variable.

…viper std: make `thread::current` available in all `thread_local!` destructors ... and thereby allow the panic runtime to always print the right thread name. This works by modifying the TLS destructor system to schedule a runtime cleanup function after all other TLS destructors registered by `std` have run. Unfortunately, this doesn't affect foreign TLS destructors, `thread::current` will still panic there. Additionally, the thread ID returned by `current_id` will now always be available, even inside the global allocator, and will not change during the lifetime of one thread (this was previously the case with key-based TLS). The mechanisms I added for this (`local_pointer` and `thread_cleanup`) will also allow finally fixing rust-lang#111272 by moving the signal stack to a similar runtime-cleanup TLS variable.

Fixes rust-lang#111272. With rust-lang#127912 merged, we now have all the infrastructure in place to support stack overflow detection in TLS destructors. This was not possible before because the signal stack was freed in the thread main function, thus a SIGSEGV afterwards would immediately crash. And on platforms without native TLS, the guard page address was stored in an allocation freed in a TLS destructor, so would not be available. rust-lang#127912 introduced the `local_pointer` macro which allows storing a pointer-sized TLS variable without allocation and the `thread_cleanup` runtime function which is called after all other code managed by the Rust runtime. This PR simply moves the signal stack cleanup to the end of `thread_cleanup` and uses `local_pointer` to store every necessary variable. And so, everything run under the Rust runtime is now properly protected against stack overflows.

ridiculousfish added the C-bug Category: This is a bug. label May 6, 2023

workingjubilee added A-thread-locals Area: Thread local storage (TLS) A-thread Area: `std::thread` A-stack-probe Area: Stack probing and guard pages labels May 6, 2023

thomcc self-assigned this May 6, 2023

workingjubilee linked a pull request May 6, 2023 that will close this issue

Eagerly run TLS destructors to properly handle stack overflows #109858

Closed

workingjubilee mentioned this issue May 6, 2023

Tracking issue for cleaning up std's thread_local implementation details #110897

Open

24 tasks

workingjubilee added the A-runtime Area: std's runtime and "pre-main" init for handling backtraces, unwinds, stack overflows label May 6, 2023

workingjubilee mentioned this issue May 6, 2023

Stack overflow in thread local's drop rendered as a segmentation fault. #109785

Closed

This was referenced Jul 17, 2024

unix: document unsafety for std sig{action,altstack} #127843

Merged

std: make thread::current available in all thread_local! destructors #127912

Merged

joboet linked a pull request Oct 5, 2024 that will close this issue

std: detect stack overflows in TLS destructors on UNIX #131282

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stack overflow not caught in Drop for TLS data #111272

Stack overflow not caught in Drop for TLS data #111272

ridiculousfish commented May 6, 2023 •

edited

Loading

workingjubilee commented May 6, 2023

bstrie commented May 6, 2023

ridiculousfish commented May 6, 2023

workingjubilee commented May 6, 2023

thomcc commented May 6, 2023 •

edited

Loading

thomcc commented May 6, 2023

joboet commented May 6, 2023

workingjubilee commented May 6, 2023

bjorn3 commented May 17, 2023

ridiculousfish commented May 17, 2023 •

edited

Loading

bjorn3 commented May 17, 2023

RalfJung commented Mar 31, 2024

joboet commented Apr 4, 2024

RalfJung commented Apr 4, 2024 •

edited

Loading

joboet commented Apr 4, 2024

Stack overflow not caught in Drop for TLS data #111272

Stack overflow not caught in Drop for TLS data #111272

Comments

ridiculousfish commented May 6, 2023 • edited Loading

workingjubilee commented May 6, 2023

bstrie commented May 6, 2023

ridiculousfish commented May 6, 2023

workingjubilee commented May 6, 2023

thomcc commented May 6, 2023 • edited Loading

thomcc commented May 6, 2023

joboet commented May 6, 2023

workingjubilee commented May 6, 2023

bjorn3 commented May 17, 2023

ridiculousfish commented May 17, 2023 • edited Loading

bjorn3 commented May 17, 2023

RalfJung commented Mar 31, 2024

joboet commented Apr 4, 2024

RalfJung commented Apr 4, 2024 • edited Loading

joboet commented Apr 4, 2024

ridiculousfish commented May 6, 2023 •

edited

Loading

thomcc commented May 6, 2023 •

edited

Loading

ridiculousfish commented May 17, 2023 •

edited

Loading

RalfJung commented Apr 4, 2024 •

edited

Loading