Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack overflow not caught in Drop for TLS data #111272

Open
Tracked by #110897
ridiculousfish opened this issue May 6, 2023 · 15 comments · May be fixed by #131282
Open
Tracked by #110897

Stack overflow not caught in Drop for TLS data #111272

ridiculousfish opened this issue May 6, 2023 · 15 comments · May be fixed by #131282
Assignees
Labels
A-runtime Area: std's runtime and "pre-main" init for handling backtraces, unwinds, stack overflows A-stack-probe Area: Stack probing and guard pages A-thread Area: `std::thread` A-thread-locals Area: Thread local storage (TLS) C-bug Category: This is a bug.

Comments

@ridiculousfish
Copy link
Contributor

ridiculousfish commented May 6, 2023

Rust attempts to catch stack overflow by installing a guard page at the end of the valid stack range. This guard page is stored in TLS and is torn down when a thread exits (including the main thread).

However other thread local data may run drop after this guard page is torn down. Stack overflows occurring in these drop implementations are not detected by Rust. (It may be backstopped by the OS, but this is system dependent.)

To reproduce:

use std::thread_local;
struct First;
struct Second;

impl Drop for First {
    fn drop(&mut self) {
        // First is materialized in TLS in main().
        // Second is materialized in First's dtor, thereby registering a
        // dtor for Second.
        // This dtor is guaranteed to run after the Thread data (including
        // its guard page) has been unmapped.
        SECOND.with(|_s| {} );
    }
}

impl Drop for Second {
    fn drop(&mut self) {
        // Trigger a stack overflow.
        recurse(0);
    }
}

thread_local! {
    static FIRST: First = First;
}

thread_local! {
    static SECOND: Second = Second;
}

// Triggers a stack overflow.
fn recurse(count: usize) {
    if count < usize::MAX {
        recurse(count + 1);
    }
    println!("{} bottles of beer on the wall", count);
}

fn main() {
    FIRST.with(|_f| {});
}

This causes a SIGILL on macOS and a SIGSEGV on Linux. In both cases I confirmed that Rust's stack overflow signal handler is not run.

Reproduced on rust stable and master branch.

@ridiculousfish ridiculousfish added the C-bug Category: This is a bug. label May 6, 2023
@workingjubilee workingjubilee added A-thread-locals Area: Thread local storage (TLS) A-thread Area: `std::thread` A-stack-probe Area: Stack probing and guard pages labels May 6, 2023
@workingjubilee
Copy link
Member

cc @m-ou-se may be of interest to you as you clean up the thread local impl.

@bstrie
Copy link
Contributor

bstrie commented May 6, 2023

@ridiculousfish Did you believe that there is potential for UB from safe code here? If you can explain the circumstances where UB can arise here, I'll tag this with I-unsound (which is the tag used for the ability to invoke UB from safe code).

@ridiculousfish
Copy link
Contributor Author

@bstrie There is no unsafe code in the repro case. I don't know if a stack overflow which is not caught by Rust's handler is considered UB, but imagine it should be, else there would be no reason for the handler. So (conservatively) yes.

@workingjubilee
Copy link
Member

I don't think it's ever actually been definitively answered as to whether the stack protector is considered a soundness constraint or a quality of implementation detail.

@thomcc
Copy link
Member

thomcc commented May 6, 2023

In general it's target-specific if we implement it (and even depends if Rust is in control of main) so I think it's considered quality of implementation (if for no other reason than pragmatism).

That said, I think this is worth fixing. On most targets (I don't think all, but it's been a while) we have enough control over the order dtors are run in already (since we run them manually most of the time) that we should be able to ensure the guard teardown happens last.

@thomcc
Copy link
Member

thomcc commented May 6, 2023

I can look at this this weekend.

@thomcc thomcc self-assigned this May 6, 2023
@joboet
Copy link
Member

joboet commented May 6, 2023

Duplicate of #109785. In short, the problem is that

  1. the guard page range is stored in the same TLS variable as the current Thread, so it is destroyed before the other TLS destructors are run
  2. the signal handler stack is deallocated just before the thread exits, so when the signal handler runs, it accesses the NULL pointer, resulting in the crash.

The first problem is quite easily handled by putting the guard page location in a second TLS variable, but the second problem is more complicated. I tried to resolve these by eagerly running the destructors in the thread itself (see #109858) but that appears to interact badly with the platform libc.

@workingjubilee
Copy link
Member

Despite being a duplicate, I am going to close the other issue instead, because this one has more useful data.

@bjorn3
Copy link
Member

bjorn3 commented May 17, 2023

I don't know if a stack overflow which is not caught by Rust's handler is considered UB, but imagine it should be, else there would be no reason for the handler. So (conservatively) yes.

It is not UB. The reason we have this handler at all AFAIK is to avoid people thinking they managed to violate memory safety even though a stack overflow is guaranteed to result in an abort through SIGSEGV on every platform where we or the OS sets up a guard page and stack probing is used. (currently only x86/x86_64, but ideally every arch) The guard page doesn't get removed when dropping TLS data.

@ridiculousfish
Copy link
Contributor Author

ridiculousfish commented May 17, 2023

To clarify there are potentially two guard pages at play here:

  1. One that the OS creates, which is not unmapped when dropping TLS
  2. One that the Rust runtime creates (edit: I previously made a mistake here, I linked to the munmap for the sigaltstack. I am unclear where Rust's mmap gets cleaned up).

Whether this second page is created is platform dependent.

@bjorn3
Copy link
Member

bjorn3 commented May 17, 2023

Didn't think about the unmapping of the rust created guard page. Looks like it is dropped right between returning from the thread main function and dropping TLS:

// Next, set up our stack overflow handler which may get triggered if we run
// out of stack.
let _handler = stack_overflow::Handler::new();
// Finally, let's run some code.
Box::from_raw(main as *mut Box<dyn FnOnce()>)();
}

@RalfJung
Copy link
Member

In general it's target-specific if we implement it (and even depends if Rust is in control of main) so I think it's considered quality of implementation (if for no other reason than pragmatism).

AFAIK this is just a soundness bug for some targets and environments (like when main is not Rust). Stack overflow can mean we start colliding with the heap and that's clearly unsound. It's a hard-to-fix soundness bug though. Not sure if it is explicitly tracked anywhere.

Didn't think about the unmapping of the rust created guard page. Looks like it is dropped right between returning from the thread main function and dropping TLS:

So that does sound like a soundness issue then, if the guard page no longer exists when TLS dtors run?

@joboet
Copy link
Member

joboet commented Apr 4, 2024

So that does sound like a soundness issue then, if the guard page no longer exists when TLS dtors run?

It's not, because the guard page is provided by the system and available during TLS destruction. What's not available is our signal stack, which we unregister when the thread main function returns. Therefore, the signal handler will be run on the overflowing stack (more specifically: on the guard page), resulting in an immediate, system-caused second SIGSEGV.

@RalfJung
Copy link
Member

RalfJung commented Apr 4, 2024

It's not, because the guard page is provided by the system

There seems to be contradicting information here as above @bjorn3 said that the guard page is unmapped. That was the Rust guard page, which I presume is not the same thing as the system guard page, but if there is guaranteed to be a system guard page then why do we even have a Rust guard page?

@joboet
Copy link
Member

joboet commented Apr 4, 2024

There is no Rust guard page, except for some platforms where we protect the main thread (but that one we never reset).

We do need a stack for the signal handler though, and that stack we free at thread exit, otherwise we'd leak quite some memory.

workingjubilee added a commit to workingjubilee/rustc that referenced this issue Aug 29, 2024
…cupiver

std: make `thread::current` available in all `thread_local!` destructors

... and thereby allow the panic runtime to always print the right thread name.

This works by modifying the TLS destructor system to schedule a runtime cleanup function after all other TLS destructors registered by `std` have run. Unfortunately, this doesn't affect foreign TLS destructors, `thread::current` will still panic there.

Additionally, the thread ID returned by `current_id` will now always be available, even inside the global allocator, and will not change during the lifetime of one thread (this was previously the case with key-based TLS).

The mechanisms I added for this (`local_pointer` and `thread_cleanup`) will also allow finally fixing rust-lang#111272 by moving the signal stack to a similar runtime-cleanup TLS variable.
workingjubilee added a commit to workingjubilee/rustc that referenced this issue Aug 29, 2024
…cupiver

std: make `thread::current` available in all `thread_local!` destructors

... and thereby allow the panic runtime to always print the right thread name.

This works by modifying the TLS destructor system to schedule a runtime cleanup function after all other TLS destructors registered by `std` have run. Unfortunately, this doesn't affect foreign TLS destructors, `thread::current` will still panic there.

Additionally, the thread ID returned by `current_id` will now always be available, even inside the global allocator, and will not change during the lifetime of one thread (this was previously the case with key-based TLS).

The mechanisms I added for this (`local_pointer` and `thread_cleanup`) will also allow finally fixing rust-lang#111272 by moving the signal stack to a similar runtime-cleanup TLS variable.
bors added a commit to rust-lang-ci/rust that referenced this issue Oct 2, 2024
…piver

std: make `thread::current` available in all `thread_local!` destructors

... and thereby allow the panic runtime to always print the right thread name.

This works by modifying the TLS destructor system to schedule a runtime cleanup function after all other TLS destructors registered by `std` have run. Unfortunately, this doesn't affect foreign TLS destructors, `thread::current` will still panic there.

Additionally, the thread ID returned by `current_id` will now always be available, even inside the global allocator, and will not change during the lifetime of one thread (this was previously the case with key-based TLS).

The mechanisms I added for this (`local_pointer` and `thread_cleanup`) will also allow finally fixing rust-lang#111272 by moving the signal stack to a similar runtime-cleanup TLS variable.
bors added a commit to rust-lang-ci/rust that referenced this issue Oct 3, 2024
…viper

std: make `thread::current` available in all `thread_local!` destructors

... and thereby allow the panic runtime to always print the right thread name.

This works by modifying the TLS destructor system to schedule a runtime cleanup function after all other TLS destructors registered by `std` have run. Unfortunately, this doesn't affect foreign TLS destructors, `thread::current` will still panic there.

Additionally, the thread ID returned by `current_id` will now always be available, even inside the global allocator, and will not change during the lifetime of one thread (this was previously the case with key-based TLS).

The mechanisms I added for this (`local_pointer` and `thread_cleanup`) will also allow finally fixing rust-lang#111272 by moving the signal stack to a similar runtime-cleanup TLS variable.
joboet added a commit to joboet/rust that referenced this issue Oct 5, 2024
Fixes rust-lang#111272.

With rust-lang#127912 merged, we now have all the infrastructure in place to support stack overflow detection in TLS destructors. This was not possible before because the signal stack was freed in the thread main function, thus a SIGSEGV afterwards would immediately crash. And on platforms without native TLS, the guard page address was stored in an allocation freed in a TLS destructor, so would not be available. rust-lang#127912 introduced the `local_pointer` macro which allows storing a pointer-sized TLS variable without allocation and the `thread_cleanup` runtime function which is called after all other code managed by the Rust runtime. This PR simply moves the signal stack cleanup to the end of `thread_cleanup` and uses `local_pointer` to store every necessary variable. And so, everything run under the Rust runtime is now properly protected against stack overflows.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-runtime Area: std's runtime and "pre-main" init for handling backtraces, unwinds, stack overflows A-stack-probe Area: Stack probing and guard pages A-thread Area: `std::thread` A-thread-locals Area: Thread local storage (TLS) C-bug Category: This is a bug.
Projects
None yet
7 participants