Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACCESS_VIOLATION when dereferencing once_cell::Lazy in closure with LTO #81408

Closed
roblabla opened this issue Jan 26, 2021 · 18 comments · Fixed by #103353
Closed

ACCESS_VIOLATION when dereferencing once_cell::Lazy in closure with LTO #81408

roblabla opened this issue Jan 26, 2021 · 18 comments · Fixed by #103353
Labels
A-linkage Area: linking into static, shared libraries and binaries A-lto Area: Link Time Optimization C-bug Category: This is a bug. I-unsound Issue: A soundness hole (worst kind of bug), see: https://en.wikipedia.org/wiki/Soundness O-windows-msvc Toolchain: MSVC, Operating system: Windows P-high High priority T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@roblabla
Copy link
Contributor

roblabla commented Jan 26, 2021

The code in the following repository crashes with an ACCESS_VIOLATION on Windows: https://github.com/roblabla/reproduce-boom/

To reproduce, four things are needed:

  • Using LLD linker
  • Building with Thin LTO
  • Splitting the main binary and a library.

EDIT: Better reproducer can be found here.

The main.rs file:

fn main() {
    reproducer::run()();
}

The lib.rs file:

use once_cell::sync::Lazy;

static CHANNEL: Lazy<()> = Lazy::new(|| ());

pub fn run() -> impl FnOnce() {
    || {
        let _ = *CHANNEL;
        println!("Did not crash.")
    }
}

Running cargo run --release will trigger the following output:

   Finished release [optimized + debuginfo] target(s) in 0.02s
     Running `C:/Users/roblabla/cargo_target\release\reproducer.exe`
error: process didn't exit successfully: `C:/Users/roblabla/cargo_target\release\reproducer.exe` (exit code: 0xc0000005, STATUS_ACCESS_VIOLATION)

This issue seems related to ThinLTO. It only occurs on windows.

Meta

rustc --version --verbose:

rustc 1.49.0 (e1884a8e3 2020-12-29)
binary: rustc
commit-hash: e1884a8e3c3e813aada8254edfa120e85bf5ffca
commit-date: 2020-12-29
host: x86_64-pc-windows-msvc
release: 1.49.0
Backtrace

[0x0]   reproducer!core::sync::atomic::atomic_compare_exchange + 0x3
[0x1]   reproducer!core::sync::atomic::AtomicUsize::compare_exchange + 0x3
[0x2]   reproducer!core::sync::atomic::AtomicUsize::compare_and_swap + 0x3
[0x3]   reproducer!once_cell::imp::wait + 0x5c
[0x4]   reproducer!once_cell::imp::initialize_inner + 0xcc
[0x5]   reproducer!once_cell::imp::OnceCell<tuple<>>::initialize<tuple<>,closure-0,once_cell::sync::{{impl}}::get_or_init::Void> + 0x41
[0x6]   reproducer!once_cell::sync::OnceCell<tuple<>>::get_or_try_init + 0x19
[0x7]   reproducer!once_cell::sync::OnceCell<tuple<>>::get_or_init + 0x19
[0x8]   reproducer!once_cell::sync::Lazy<tuple<>, fn()>::force + 0x19
[0x9]   reproducer!once_cell::sync::{{impl}}::deref + 0x19
[0xa]   reproducer!reproducer::run::{{closure}} + 0x19
[0xb]   reproducer!core::future::from_generator::{{impl}}::poll + 0x19
[0xc]   reproducer!reproducer::run + 0x7d
[0xd]   reproducer!reproducer::main + 0xc6
[0xe]   reproducer!core::ops::function::FnOnce::call_once + 0x2
[0xf]   reproducer!std::sys_common::backtrace::__rust_begin_short_backtrace<fn(),tuple<>> + 0x6
[0x10]   reproducer!std::rt::lang_start::{{closure}}<tuple<>> + 0xc
[0x11]   reproducer!core::ops::function::impls::{{impl}}::call_once + 0x6
[0x12]   reproducer!std::panicking::try::do_call + 0x6
[0x13]   reproducer!std::panicking::try + 0x6
[0x14]   reproducer!std::panic::catch_unwind + 0x6
[0x15]   reproducer!std::rt::lang_start_internal + 0xd9
[0x16]   reproducer!main + 0x27
[0x17]   reproducer!invoke_main + 0x22
[0x18]   reproducer!__scrt_common_main_seh + 0x10c
[0x19]   KERNEL32!BaseThreadInitThunk + 0x14
[0x1a]   ntdll!RtlUserThreadStart + 0x21

@roblabla roblabla added the C-bug Category: This is a bug. label Jan 26, 2021
@roblabla roblabla changed the title ACCESS_VIOLATION when dereferencing once_cell::Lazy in an async block ACCESS_VIOLATION when dereferencing once_cell::Lazy in closure with LTO Jan 26, 2021
@CryZe
Copy link
Contributor

CryZe commented Jan 26, 2021

The reproducer::CHANNEL symbol gets mislinked. Here's where the symbol's address gets loaded into the register:

https://i.imgur.com/2mDT6oz.png

If you check what's at that address, you find it's the beginning of the binary that is loaded as read only into the RAM:

https://i.imgur.com/f175N4Y.png

which later segfaults as that is interpreted as the Lazy's internal AtomicUsize, at which point everything goes wrong:

https://i.imgur.com/fuyXtp1.png

The +crt-static is not necessary btw. It's a problem involving the fact that there's multiple crates, rust-lld and thin-lto.

@jonas-schievink jonas-schievink added the I-unsound Issue: A soundness hole (worst kind of bug), see: https://en.wikipedia.org/wiki/Soundness label Jan 26, 2021
@rustbot rustbot added the I-prioritize Issue: Indicates that prioritization has been requested for this issue. label Jan 26, 2021
@Stupremee Stupremee added A-linkage Area: linking into static, shared libraries and binaries A-lto Area: Link Time Optimization T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. O-windows Operating system: Windows O-windows-msvc Toolchain: MSVC, Operating system: Windows and removed O-windows Operating system: Windows labels Feb 2, 2021
@apiraino
Copy link
Contributor

apiraino commented Feb 3, 2021

@rustbot ping windows

@rustbot
Copy link
Collaborator

rustbot commented Feb 3, 2021

Hey Windows Group! This bug has been identified as a good "Windows candidate".
In case it's useful, here are some instructions for tackling these sorts of
bugs. Maybe take a look?
Thanks! <3

cc @arlosi @danielframpton @gdr-at-ms @kennykerr @luqmana @lzybkr @nico-abram @retep998 @rylev @sivadeilra

@rustbot rustbot added the O-windows Operating system: Windows label Feb 3, 2021
@apiraino apiraino added P-high High priority and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Feb 3, 2021
@apiraino
Copy link
Contributor

apiraino commented Feb 3, 2021

We think this warrants a P-high priority but feel free to adjust. Zulip thread at Prioritization Working Group.

Could be interesting checking since when we have this behaviour

@memoryruins
Copy link
Contributor

memoryruins commented May 9, 2021

With nearly the same set-up ( windows-msvc, rust-lld, thinlto, but without +crt-static), I also experienced (exit code: 0xc0000005, STATUS_ACCESS_VIOLATION) in a project.

After seeing this issue, I swapped once_cell out with the unstable std api std::lazy::SyncLazy, which is a slightly different implementation of once_cell::sync::Lazy. One difference is that the impl in nightly std uses std::sync::Once while the once_cell crate copies Once from std but makes some changes.

Rather than an access violation, the program panicked with the following

thread 'main' panicked at 'assertion failed: state_and_queue & STATE_MASK == RUNNING', library\std\src\sync\once.rs:425:21

// All other values must be RUNNING with possibly a
// pointer to the waiter queue in the more significant bits.
assert!(state_and_queue & STATE_MASK == RUNNING);

The following example minimizes the SyncLazy type somewhat and combines it the first example. With the combination of windows-msvc, rust-lld, and thinlto, the panic occurs in std::sync::once::Once::call_inner rather than from the unimplemented! stubs.

main.rs:

fn main() {
    example::run()();
}
lib.rs:
use std::{cell::Cell, marker::PhantomData, sync::Once};

static X: SyncLazy<()> = SyncLazy::new(|| ());

pub fn run() -> impl FnOnce() {
    || {
        let _ = SyncLazy::force(&X);
    }
}

struct SyncOnceCell<T> {
    once: Once,
    _phantom: PhantomData<T>,
}

impl<T> SyncOnceCell<T> {
    const fn new() -> Self {
        SyncOnceCell {
            once: Once::new(),
            _phantom: PhantomData,
        }
    }
    fn get_or_init<F>(&self, _f: F) -> &T
    where
        F: FnOnce() -> T,
    {
        self.once.call_once(|| unimplemented!());
        unimplemented!()
    }
}

struct SyncLazy<T, F = fn() -> T> {
    cell: SyncOnceCell<T>,
    _init: Cell<Option<F>>,
}

unsafe impl<T, F: Send> Sync for SyncLazy<T, F> where SyncOnceCell<T>: Sync {}

impl<T, F> SyncLazy<T, F> {
    const fn new(f: F) -> Self {
        SyncLazy {
            cell: SyncOnceCell::new(),
            _init: Cell::new(Some(f)),
        }
    }
    fn force(this: &Self) -> &T {
        this.cell.get_or_init(|| unimplemented!())
    }
}

cargo run --release

backtrace
     Running `target\release\example.exe`
thread 'main' panicked at 'assertion failed: state_and_queue & STATE_MASK == RUNNING', library\std\src\sync\once.rs:425:21
stack backtrace:
   0: std::panicking::begin_panic_handler
             at /rustc/881c1ac408d93bb7adaa3a51dabab9266e82eee8\/library\std\src\panicking.rs:493      
   1: core::panicking::panic_fmt
             at /rustc/881c1ac408d93bb7adaa3a51dabab9266e82eee8\/library\core\src\panicking.rs:92      
   2: core::panicking::panic
             at /rustc/881c1ac408d93bb7adaa3a51dabab9266e82eee8\/library\core\src\panicking.rs:50      
   3: std::sync::once::Once::call_inner
             at /rustc/881c1ac408d93bb7adaa3a51dabab9266e82eee8\/library\std\src\sync\once.rs:425      
   4: std::sync::once::Once::call_once
   5: example::lazy::SyncOnceCell<T>::get_or_init
   6: example::lazy::SyncLazy<T,F>::force
   7: core::ops::function::FnOnce::call_once{{vtable.shim}}
meta
rustc 1.54.0-nightly (881c1ac40 2021-05-08)
binary: rustc
commit-hash: 881c1ac408d93bb7adaa3a51dabab9266e82eee8
commit-date: 2021-05-08
host: x86_64-pc-windows-msvc
release: 1.54.0-nightly
LLVM version: 12.0.0

@memoryruins
Copy link
Contributor

memoryruins commented May 24, 2021

It looks like some variation of this issue was reported in #71504, but eventually went away. Guess some form of the issue returned, or this issue is finding another way to trigger the underlying issue. A similar set-up was required to trigger the access violation #71504 (comment)

edit: I am able to produce exit code: 0xc0000005, STATUS_ACCESS_VIOLATION with cargo run --release with the minimal example @Aaron1011 created in the previous issue, which I will copy over here for convenience:

// lib.rs
use std::sync::atomic::{AtomicPtr, Ordering};

#[inline(always)]
pub fn memrchr() {
    fn detect() {}
		
    static FN: AtomicPtr<()> = AtomicPtr::new(detect as *mut ());
	
    unsafe {
        let fun = FN.load(Ordering::SeqCst);
        std::mem::transmute::<*mut (), fn()>(fun)()
    }
}	

// main.rs
fn main() {
	badlld::memrchr();
}

@hkratz
Copy link
Contributor

hkratz commented Jul 20, 2021

To avoid duplicated work: Is anyone working on a concise bug report for the LLVM project?

@hkratz
Copy link
Contributor

hkratz commented Aug 22, 2021

That is unfortunately still an issue with LLD from LLVM 13.

@apiraino
Copy link
Contributor

apiraino commented Aug 23, 2021

thanks @hkratz for the reminder, adding I-nominated so this gets prioritized during the T-compiler meeting.

@rustbot label +I-nominated

@apiraino
Copy link
Contributor

apiraino commented Aug 26, 2021

So, this LLVM issue definitively needs to be reported upstream along with an LLVM-IR reproducible.

Seems that no on has yet owned this upstream bug report, so it's open for grab (and if you @hkratz want to own it, as suggested here please feel free to do so - thanks).

(Zulip discussion)

@rustbot label -I-nominated

Gohla added a commit to Gohla/ggg that referenced this issue Jan 23, 2022
- Disable thin LTO due to rust-lang/rust#81408
- Pin shaderc to 0.7.3 due to linker errors with newer version.
- Update dependencies.
@roblabla
Copy link
Contributor Author

Bumping. It's this bug's one-year anniversary 🎉.

I've tried to create a more minimal reproducer, but I don't have a whole lot of experience with LLVM stuff, so please forgive any blunders I might have made 😅 . The reproducer can be found here, and is essentially a reduction of this reproducer using no_std and panic=abort to remove as much unnecessary noise from the binary/asm/llvm. The LLVM IR is at https://github.com/roblabla/reproduce-boom/blob/nostd-minimal/reproducer.ll

The binary is very, very simple here. This is essentially the entire binary:

image

At the bottom, we see the memrchr::FN static that points to garbage, and the mainCRTStartup() loads it and calls it (which segfaults due to jumping to a garbage address).

@lqd
Copy link
Member

lqd commented Apr 20, 2022

This still reproduces with lld 14.0.0 (and therefore the current nightly's rust-lld). @nikic and @rust-lang/wg-llvm does that sound familiar to you ? (and if not, is the LLVM IR above enough for tracking this in a LLVM project issue ?)

@mati865
Copy link
Contributor

mati865 commented Apr 20, 2022

It does not reproduce with windows-gnu (LLD 14.0.0) so should it be labelled as O-windows instead of just O-windows-msvc?

@apiraino
Copy link
Contributor

apiraino commented Apr 20, 2022

It does not reproduce with windows-gnu (LLD 14.0.0) so should it be labelled as O-windows instead of just O-windows-msvc?

ok, probably the situation has changed compared to this comment.

@rustbot label -O-windows-msvc

@rustbot rustbot removed the O-windows-msvc Toolchain: MSVC, Operating system: Windows label Apr 20, 2022
@mati865
Copy link
Contributor

mati865 commented Apr 20, 2022

Oh sorry, I just put it in the wrong order 🤦🏻‍♂️

@rustbot label +O-windows-msvc -O-windows

@rustbot rustbot added O-windows-msvc Toolchain: MSVC, Operating system: Windows and removed O-windows Operating system: Windows labels Apr 20, 2022
@smmalis37
Copy link
Contributor

smmalis37 commented Oct 20, 2022

Has any progress been made here? We just ran into this issue too on rust 1.64.

@wesleywiser
Copy link
Member

I posted a fix in #103353. Huge thanks to @roblabla for creating the minimal repro which was instrumental in figuring out what was going on!

Manishearth added a commit to Manishearth/rust that referenced this issue Nov 8, 2022
…r=michaelwoerister

Fix Access Violation when using lld & ThinLTO on windows-msvc

Users report an AV at runtime of the compiled binary when using lld and ThinLTO on windows-msvc. The AV occurs when accessing a static value which is defined in one crate but used in another. Based on the disassembly of the cross-crate use, it appears that the use is not correctly linked with the definition and is instead assigned a garbage pointer value.

If we look at the symbol tables for each crates' obj file, we can see what is happening:

*lib.obj*:

```
COFF SYMBOL TABLE
...
00E 00000000 SECT2  notype       External     | _ZN10reproducer7memrchr2FN17h612b61ca0e168901E
...
```

*bin.obj*:

```
COFF SYMBOL TABLE
...
010 00000000 UNDEF  notype       External     | __imp__ZN10reproducer7memrchr2FN17h612b61ca0e168901E
...
```

The use of the symbol has the "import" style symbol name but the declaration doesn't generate any symbol with the same name. As a result, linking the files generates a warning from lld:

> rust-lld: warning: bin.obj: locally defined symbol imported: reproducer::memrchr::FN::h612b61ca0e168901 (defined in lib.obj) [LNK4217]

and the symbol reference remains undefined at runtime leading to the AV.

To fix this, we just need to detect that we are performing ThinLTO (and thus, static linking) and omit the `dllimport` attribute on the extern item in LLVM IR.

Fixes rust-lang#81408
Manishearth added a commit to Manishearth/rust that referenced this issue Nov 8, 2022
…r=michaelwoerister

Fix Access Violation when using lld & ThinLTO on windows-msvc

Users report an AV at runtime of the compiled binary when using lld and ThinLTO on windows-msvc. The AV occurs when accessing a static value which is defined in one crate but used in another. Based on the disassembly of the cross-crate use, it appears that the use is not correctly linked with the definition and is instead assigned a garbage pointer value.

If we look at the symbol tables for each crates' obj file, we can see what is happening:

*lib.obj*:

```
COFF SYMBOL TABLE
...
00E 00000000 SECT2  notype       External     | _ZN10reproducer7memrchr2FN17h612b61ca0e168901E
...
```

*bin.obj*:

```
COFF SYMBOL TABLE
...
010 00000000 UNDEF  notype       External     | __imp__ZN10reproducer7memrchr2FN17h612b61ca0e168901E
...
```

The use of the symbol has the "import" style symbol name but the declaration doesn't generate any symbol with the same name. As a result, linking the files generates a warning from lld:

> rust-lld: warning: bin.obj: locally defined symbol imported: reproducer::memrchr::FN::h612b61ca0e168901 (defined in lib.obj) [LNK4217]

and the symbol reference remains undefined at runtime leading to the AV.

To fix this, we just need to detect that we are performing ThinLTO (and thus, static linking) and omit the `dllimport` attribute on the extern item in LLVM IR.

Fixes rust-lang#81408
Manishearth added a commit to Manishearth/rust that referenced this issue Nov 8, 2022
…r=michaelwoerister

Fix Access Violation when using lld & ThinLTO on windows-msvc

Users report an AV at runtime of the compiled binary when using lld and ThinLTO on windows-msvc. The AV occurs when accessing a static value which is defined in one crate but used in another. Based on the disassembly of the cross-crate use, it appears that the use is not correctly linked with the definition and is instead assigned a garbage pointer value.

If we look at the symbol tables for each crates' obj file, we can see what is happening:

*lib.obj*:

```
COFF SYMBOL TABLE
...
00E 00000000 SECT2  notype       External     | _ZN10reproducer7memrchr2FN17h612b61ca0e168901E
...
```

*bin.obj*:

```
COFF SYMBOL TABLE
...
010 00000000 UNDEF  notype       External     | __imp__ZN10reproducer7memrchr2FN17h612b61ca0e168901E
...
```

The use of the symbol has the "import" style symbol name but the declaration doesn't generate any symbol with the same name. As a result, linking the files generates a warning from lld:

> rust-lld: warning: bin.obj: locally defined symbol imported: reproducer::memrchr::FN::h612b61ca0e168901 (defined in lib.obj) [LNK4217]

and the symbol reference remains undefined at runtime leading to the AV.

To fix this, we just need to detect that we are performing ThinLTO (and thus, static linking) and omit the `dllimport` attribute on the extern item in LLVM IR.

Fixes rust-lang#81408
@bors bors closed this as completed in 7521a97 Nov 9, 2022
@Zoxc
Copy link
Contributor

Zoxc commented Aug 13, 2024

I did some investigating on why ThinLTO is required to reproduce this. It turns out ThinLTO is not generating __imp__ZN10reproducer7CHANNEL17hdd18d4377648bcdfE (used for static linking) in the library unlike when ThinLTO is unused. This means that lld must generate it itself and it seems to struggle with that unlike link.

Another fix for this issue could be to also generate these for ThinLTO, though I don't know where these are generated. They don't appear in the LLVM bitcode at least.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-linkage Area: linking into static, shared libraries and binaries A-lto Area: Link Time Optimization C-bug Category: This is a bug. I-unsound Issue: A soundness hole (worst kind of bug), see: https://en.wikipedia.org/wiki/Soundness O-windows-msvc Toolchain: MSVC, Operating system: Windows P-high High priority T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.