Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ABA-problem with pointer provenance in lockless queues #121950

Open
joboet opened this issue Mar 3, 2024 · 11 comments
Open

ABA-problem with pointer provenance in lockless queues #121950

joboet opened this issue Mar 3, 2024 · 11 comments
Labels
A-atomic Area: Atomics, barriers, and sync primitives I-unsound Issue: A soundness hole (worst kind of bug), see: https://en.wikipedia.org/wiki/Soundness P-medium Medium priority T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@joboet
Copy link
Member

joboet commented Mar 3, 2024

Both lockless queue implementations inside std for Once and RwLock suffer from the ABA problem, resulting in unsoundness.

In both cases, the enqueue operation roughly looks like the following:

struct Node {
    next: *mut Node,
}

struct Queue {
    state: AtomicPtr<Node>,
}

let mut state = queue.state.load();
loop {
    node.next = to_node(state); // Some pointer masking
    let next = to_state(node); // Some bittagging
    match state.compare_exchange(state, next) {
        Ok(_) => break,
        Err(new) => state = next,
    }
}

This code is problematic, because it allows the following to occur:

  1. Thread 1: Append node 1 to the queue.
  2. Thread 2: Load the state (a pointer to node 1) and store it as the next pointer of node 2.
  3. Thread 1: Remove node 1 from the queue and deallocate it/modify it.
  4. Thread 1: Create node 3 at the same address as node 1 and append it to the queue.
  5. Thread 2: Perform the CAS. Because the state has the same bits, this succeeds.

Now, any traversal operation on the queue that attempts to dereference the next pointer of node 2 exhibits UB, as the pointer, while having the right address, only has provenance to node 1, which is no longer valid, but may not access node 3.

This is of little practical concern, as it is extremely unlikely that LLVM will perform a problematic optimization; but nevertheless the code is unsound, both under stacked-borrows and LLVM's semantics.

This is the cause of #121626, where this exact situation led to a (correct) UB report from miri.

@rustbot label +T-libs +I-unsound

Possible solutions

The only solution I see at the moment is to use fuzzy provenance and use from_exposed_provenance inside the traversal operation to make miri/LLVM guess the right provenance.

@rustbot rustbot added needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. I-unsound Issue: A soundness hole (worst kind of bug), see: https://en.wikipedia.org/wiki/Soundness T-libs Relevant to the library team, which will review and decide on the PR/issue. I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Mar 3, 2024
@the8472
Copy link
Member

the8472 commented Mar 3, 2024

Since this is mac-specific why not use a DWCAS? Afaik they don't have any hardware that doesn't have those.

@joboet
Copy link
Member Author

joboet commented Mar 3, 2024

It's not mac-specific, though. Once gets used on all platforms except the futex ones, I believe even Windows. RwLock gets used on NetBSD and other UNIXes as well, and I'd like to use it to get rid of the unsound SGX lock implementations. So we need to find something portable here.

Also, this is a provenance-only problem, so I'd like to avoid anything that makes the machine code less efficient, because in terms of what is actually generated, the current implementation is perfectly bug-free. It "just" violates the memory model.

@the8472
Copy link
Member

the8472 commented Mar 3, 2024

And using the return value from the CAS (the Ok(_)) and working on that instead of state doesn't work either?

@joboet
Copy link
Member Author

joboet commented Mar 3, 2024

No, because the old pointer is stored inside the node, and we can't retroactively update the pointer, because a traversal operation could access the old one immediately after the compare_exchange.

@saethlin
Copy link
Member

saethlin commented Mar 3, 2024

to make miri/LLVM guess the right provenance.

I don't think LLVM implements expose semantics in any sense, so I suspect doing the expose dance here wouldn't change LLVM's view of the code. (Happy to be corrected on this)

@Noratrieb Noratrieb removed the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Mar 3, 2024
@joboet
Copy link
Member Author

joboet commented Mar 3, 2024

LLVM's pointer aliasing rules mention that the pointer resulting from inttoptr is based on "all pointer values that contribute (directly or indirectly) to the computation of the pointer’s value". IIUC, that means that because the traversing thread reads the current pointer from state and uses it to load the integer that is then passed to inttoptr, the resulting pointer will also be based on the current pointer, making the code sound.

@joboet joboet added the A-atomic Area: Atomics, barriers, and sync primitives label Mar 3, 2024
@RalfJung
Copy link
Member

RalfJung commented Mar 3, 2024

Also, this is a provenance-only problem, so I'd like to avoid anything that makes the machine code less efficient, because in terms of what is actually generated, the current implementation is perfectly bug-free. It "just" violates the memory model.

That is assuming that we don't have optimizations that make full use of the provenance model. Or have you inspected the generated assembly to be sure?

This looks extremely related to rust-lang/unsafe-code-guidelines#480 to me. I wonder if the CAS semantics I proposed here would help.

@joboet
Copy link
Member Author

joboet commented Mar 3, 2024

I inspected the assembly of a simplified version (I removed the Thread handle and park operations). Also, the Once implementation has existed for ~9 years, with no issue encountered.

@saethlin
Copy link
Member

saethlin commented Mar 3, 2024

The age of an implementation is not an indicator of its soundness with respect to provenance rules. If anything, perhaps the correlation runs the other direction.

I'm willing to review codegen tests for suspicious things like this, if someone can find a way to write them that's both effective and doesn't make them break often. I don't think we usually check in codegen tests for structures as complex as this.

@apiraino
Copy link
Contributor

apiraino commented Mar 5, 2024

WG-prioritization assigning priority (Zulip discussion).

@rustbot label -I-prioritize +P-medium

@rustbot rustbot added P-medium Medium priority and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Mar 5, 2024
@RalfJung
Copy link
Member

RalfJung commented Mar 5, 2024

The only solution I see at the moment is to use fuzzy provenance and use from_exposed_provenance inside the traversal operation to make miri/LLVM guess the right provenance.

Just recording here that another possible solution has been sketched, also spelled out here.

matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Aug 8, 2024
rwlock: disable 'frob' test in Miri on macOS

Due to rust-lang#121950, Miri will sometimes complain about this test on macOS. Better disable the test, as otherwise it can fail for unrelated PRs.

r? `@joboet`
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Aug 8, 2024
rwlock: disable 'frob' test in Miri on macOS

Due to rust-lang#121950, Miri will sometimes complain about this test on macOS. Better disable the test, as otherwise it can fail for unrelated PRs.

r? ``@joboet``
rust-timer added a commit to rust-lang-ci/rust that referenced this issue Aug 9, 2024
Rollup merge of rust-lang#128640 - RalfJung:rwlock-macos-miri, r=joboet

rwlock: disable 'frob' test in Miri on macOS

Due to rust-lang#121950, Miri will sometimes complain about this test on macOS. Better disable the test, as otherwise it can fail for unrelated PRs.

r? ``@joboet``
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Nov 19, 2024
ignore an occasionally-failing test in Miri

This is like rust-lang#128640, the test [sometimes](rust-lang#133189) fails due to rust-lang#121950.
rust-timer added a commit to rust-lang-ci/rust that referenced this issue Nov 19, 2024
Rollup merge of rust-lang#133200 - RalfJung:miri-rwlock-test, r=tgross35

ignore an occasionally-failing test in Miri

This is like rust-lang#128640, the test [sometimes](rust-lang#133189) fails due to rust-lang#121950.
github-actions bot pushed a commit to rust-lang/miri that referenced this issue Nov 20, 2024
ignore an occasionally-failing test in Miri

This is like rust-lang/rust#128640, the test [sometimes](rust-lang/rust#133189) fails due to rust-lang/rust#121950.
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Nov 25, 2024
…=tgross35

miri: disable test_downgrade_observe test on macOS

Due to rust-lang#121950, this test can fail on Miri. The test is also quite slow on Miri (taking more than 30s) due to the high iteration count (a total of 2000), so let's reduce that a little.

Fixes rust-lang#133421
jhpratt added a commit to jhpratt/rust that referenced this issue Nov 26, 2024
…=tgross35

miri: disable test_downgrade_observe test on macOS

Due to rust-lang#121950, this test can fail on Miri. The test is also quite slow on Miri (taking more than 30s) due to the high iteration count (a total of 2000), so let's reduce that a little.

Fixes rust-lang#133421
compiler-errors added a commit to compiler-errors/rust that referenced this issue Nov 26, 2024
…=tgross35

miri: disable test_downgrade_observe test on macOS

Due to rust-lang#121950, this test can fail on Miri. The test is also quite slow on Miri (taking more than 30s) due to the high iteration count (a total of 2000), so let's reduce that a little.

Fixes rust-lang#133421
rust-timer added a commit to rust-lang-ci/rust that referenced this issue Nov 27, 2024
Rollup merge of rust-lang#133435 - RalfJung:test_downgrade_observe, r=tgross35

miri: disable test_downgrade_observe test on macOS

Due to rust-lang#121950, this test can fail on Miri. The test is also quite slow on Miri (taking more than 30s) due to the high iteration count (a total of 2000), so let's reduce that a little.

Fixes rust-lang#133421
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-atomic Area: Atomics, barriers, and sync primitives I-unsound Issue: A soundness hole (worst kind of bug), see: https://en.wikipedia.org/wiki/Soundness P-medium Medium priority T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

7 participants