Make std::sync::Arc compatible with ThreadSanitizer #65097

tmiasko · 2019-10-04T11:44:26Z

The memory fences used previously in Arc implementation are not properly
understood by thread sanitizer as synchronization primitives. This had
unfortunate effect where running any non-trivial program compiled with
-Z sanitizer=thread would result in numerous false positives.

Replace acquire fences with acquire loads to address the issue.

Fixes #39608.

rust-highfive · 2019-10-04T11:44:38Z

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @joshtriplett (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

Centril · 2019-10-04T19:40:24Z

r? @RalfJung (take your time)

RalfJung · 2019-10-05T00:13:24Z

Cc @jhjourdan who did the formal verification of current Arc

RalfJung · 2019-10-05T00:17:30Z

I don't think I know the correctness argument of Arc well enough to review this on my own -- I have never tried to convince myself of its correctness. Some of my colleagues did though, I am trying to get them to help.

Do we have anyone else who is enough of an expert in weak memory to help? IIRC @aturon wrote most of this, but they are busy with other things these days.

RalfJung · 2019-10-05T00:21:55Z

Looking at the changes, this seems to mostly un-do deliberate optimizations that specifically keep synchronization to the absolute minimum that is necessary. I am not sure if "sanitizers do not properly support fences" is a good enough argument to change our code -- shouldn't sanitizers be fixed instead? Fences are an important part of Rust's and C's concurrency model.

I feel before we do a correctness review, we need a policy decision by some team (T-libs, I assume) whether we are willing to reduce code quality in order to help sanitizers.

tmiasko · 2019-10-05T10:56:13Z

AFAIK there are no known algorithmic approaches supporting memory barriers with
practical runtime overhead. Presuming that thread sanitizer should be fixed is
no solution, other approaches might be. For example, a conditional compilation
would be another thing to consider (if that is more agreeable?).

Changes here have no effect on tier 1, since generated code on i686 / x86_64 is
unaffected. On aarch64 assembly changes roughly reflect those made in the code,
i.e., dmb ishld is removed and ldxr is replaced with ldaxr.

I also mentioned other implementations (libc++ / libstdc++) as implicit
reflection of their views on the trade-offs involved. Personally I find thread
sanitizer to be invaluable, and if the only thing to make it work correctly is
to change standalone memory fence to an operation on concrete memory location, the
decision to me is clear. This is especially true here, since the issue is
limited to Arc.

RalfJung · 2019-10-05T16:04:09Z

Changes here have no effect on tier 1, since generated code on i686 / x86_64 is
unaffected.

Did you test that? I know that lowering of atomic operations to x86 assembly is the same, but this change will affect compiler transformations even on x86. So there can still easily be differences.

tmiasko · 2019-10-05T16:37:45Z

I extracted the code form Arc (before and after changes) and examined
assembly. On i686 / x86_64 the difference was disappearance of #MEMBARRIER
compiler marker after changes, which is actually promising.

RalfJung · 2019-10-09T14:44:36Z

AFAIK there are no known algorithmic approaches supporting memory barriers with
practical runtime overhead.

Interesting. I thought I new that these fences are equivalent to doing a release-acquire RMW from a global location, which would be trivial to check algorithmically (assuming a release/acquire checker is already implemented) but I may misremember.

I extracted the code form Arc (before and after changes) and examined
assembly. On i686 / x86_64 the difference was disappearance of #MEMBARRIER
compiler marker after changes, which is actually promising.

Thanks, that is useful.

So, the nominated question here for @rust-lang/libs is: we have a trade-off here between (a) keeping the existing, highly optimized, well-reviewed and even formally analyzed code, which however at least current dynamic thread sanitizers cannot handle properly, and (b) changing this code to be less efficient in theory due to stronger synchronization, but ultimately simpler for the same reason (and with likely no perf change in practice), working better with thread sanitizers, but also changing extremely subtle code that is very well-tested (any change has some risk of introducing a bug) and losing the existing formal results. What do you think we should do? I am asking you because I don't think I should make such calls.

My personal feeling is: I am a big fan of sanitizers, so it seems worth sacrificing some entirely theoretical performance benefit of the current code for better testability. However, losing the existing formal analyses is an unfortunate regression. That said, comparing the assembly gives some assurance that at least for simple clients, the behavior did not change.

Amanieu · 2019-10-09T15:17:11Z

I would like to benchmark this on ARM first since that platform seems to be affected. The main issue here is that we will have to unconditional emit an acquire fence even if the ref count isn't dropping to zero.

Amanieu · 2019-10-10T17:20:27Z

I ran a quick benchmark comparison of fetch_sub(Release) and fetch_sub(AcqRel) but couldn't fine a difference in performance (they both report a consistent 20ns). I did check the assembly and Release uses ldxr while AcqRel uses ldaxr, and that is the only difference.

In theory ldaxr is slower than ldxr since it acts as an acquire barrier which prevents loads after it from being executed before it in an out-of-order CPU. I guess this isn't visible in this benchmark since there are no other loads.

joshtriplett · 2019-10-12T09:50:49Z

On Thu, Oct 10, 2019 at 10:21:34AM -0700, Amanieu wrote: I ran a quick benchmark comparison of `fetch_sub(Release)` and `fetch_sub(AcqRel)` but couldn't fine a difference in performance (they both report a consistent 20ns). I did check the assembly and `Release` uses `ldxr` while `AcqRel` uses `ldaxr`, and that is the only difference. In theory `ldaxr` is slower than `ldxr` since it acts as an acquire barrier which prevents loads after it from being executed before it in an out-of-order CPU. I guess this isn't visible in this benchmark since there are no other loads.

Right; in general you have to take care in benchmarking atomic operations and synchronized operations, as their costs depend greatly on other operations in the same CPU's pipeline, and on the cache state of the system (which depends on ongoing operations on other CPUs).

alexcrichton · 2019-10-15T14:28:28Z

The comments in the code indicate where all this logic originally came from (Boost) and its history also shows that this is an extremely sensitive operation for performance (see #41714 for example). Benchmarking this change to evaluate its performance impact would be quite difficult, but leaning on users who have previously benchmarked various changes here (such as Servo/Gecko) is a good start.

I don't think that the correctness of the current code is in question, this looks like it's a change intended on making Arc compatible with the thread sanitizer. Given that it may be a regression in performance to an operation that is quite sensitive to performance I would personally prefer to not merge this PR as-is until sufficient data is gathered showing that this isn't a performance regression. If it is indeed a performance regression, or if data does not wish to be gathered, then landing this would indeed require some form of conditional compilation, but that is also somewhat hard to do since this is already quite tricky code to read, and adding a conditional path for only-rarely-used thread sanitization isn't necessarily great.

JohnTitor · 2019-10-27T08:57:47Z

Ping from triage: @rust-lang/libs @tmiasko any updates on this?

pitdicker · 2019-10-30T11:30:59Z

To summarize this PR is changing a few cases of:

fn drop(&mut self) {
    if self.inner().strong.fetch_sub(1, Release) != 1 {
        return;
    }
    atomic::fence(Acquire);
    /* drop contents of Arc */
}

to:

fn drop(&mut self) {
    if self.inner().strong.fetch_sub(1, AcqRel) != 1 {
        return;
    }
    /* drop contents of Arc */
}

This places an extra requirement to do an acquire synchronization in what may be hot code (cloning and dropping Arc references should be cheap). Before the acquire was in the colder path that drops the final reference of the Arc and drops the contents.

If avoiding fences is a goal, something like #41714 seems like the much better option to me. I.e. change the same code to:

fn drop(&mut self) {
    if self.inner().strong.fetch_sub(1, Release) != 1 {
        return;
    }
    let _ = self.inner().strong.load(Acquire);
    /* drop contents of Arc */
}

This keeps the hot path the same, and it adds at most an extra mov in the colder path.

I would actually feel better about an acquire load instead of a fence, because of the trickiness of fences (Atomic-fence synchronization). Using a fence just to optimize out one instruction in a colder path seems like a questionable optimization. Especially because the processor knows just as good as we do that there are no other references to this atomic. So it can't have been changed in the meantime, and should still be in a register or cache. The mov should basically be free.

pitdicker · 2019-10-30T17:08:20Z

Interesting. I thought I new that these fences are equivalent to doing a release-acquire RMW from a global location, which would be trivial to check algorithmically (assuming a release/acquire checker is already implemented) but I may misremember.

According to Atomic-fence synchronization a fence binds to a nearby atomic operation.

atomic_a.load(Relaxed);
atomic_b.store(Relaxed);
fence(Acquire);
atomic_b.load(Relaxed);

An acquire fence works together with the atomic that last did a load operation, in the example atomic_a.load(Relaxed). In theory it only has to synchronize data with the other threads that did a release on atomic_a.

In the same vein fence(Release) works together with the store on an atomic right after.

The reordering rules for fences are carefully worded to not talk about themselves, but the previous read or the next write (Acquire and Release Fences Don't Work the Way You'd Expect).

In all cases a fence needs operations on some atomic to bind to, a fence-fence synchronization without atomics does not seem to be a thing.

RalfJung · 2019-11-03T10:28:13Z

In all cases a fence needs operations on some atomic to bind to, a fence-fence synchronization without atomics does not seem to be a thing.

Indeed, synchronization always arises from a reads-from edge. Fences can just "upgrade" relaxed reads/writes to still incur synchronization.

JohnTitor · 2019-11-17T15:06:40Z

Ping from triage: @tmiasko and @rust-lang/libs what do you think about the above comments?

alexcrichton · 2019-11-18T16:18:16Z

I think my previous comment about a performance investigation still stands.

tmiasko · 2019-11-21T13:23:19Z

The approach suggested so far (including current implementation) are:

Use acquire fence (currently used in std::sync::Arc).
Use acquire load instead of acquire fence (current used in servo_arc::Arc for example).
Use acq-rel in fetch_sub, removing the fence.
Conditional compilation. I am not particularly fond of this approach, but
this could be limited to a conditional definition of a single constant,
i.e., ordering used with fetch_sub, while leaving memory fence in place.

I performed micro benchmarks on x86-64 (https://github.com/tmiasko/arc):

1 and 3 generate equivalent code, so there is nothing interesting to see there.
When only fast path is involved (clone and drop, but refcount never reaches zero) the results of 1,2 and 3 are indistinguishable, as generated code is equivalent in that part.
When only slow path is involved (create and immediately drop the arc) the results are:
- 30 ns for 1 and 3
- 39 ns for 2 (additional mov instructions)

@pitdicker I agree that the second approach would be preferable to the third
one from perspective of architectures with weaker memory models, while still
being compatible with tsan.

@alexcrichton I looked briefly at servo benchmarks (test-dromaeo, test-perf),
but generally as far as I can see the benchmarks are quite noisy, and
std::sync::Arc and servo_arc::Arc account for too little runtime, so any
changes like this are well within measurement error.

Unfortunately neither of those approaches is a Pareto improvement over current
state of affairs, so there is a trade-off involved.

alexcrichton · 2019-11-22T16:04:19Z

It's really difficult to measure the performance here unfortunately, and I think that focusing only on one architecture may be missing the purpose of these atomics as well.

I also believe that (2) as you proposed is an incorrect solution because I believe that an acquire load only synchronizes with one release store, and the reason we use a fence is want to synchronize with all of the release stores, not just one.

joshtriplett · 2020-03-20T12:39:51Z

If you change the barrier types when ThreadSanitizer is enabled, you risk masking bugs you might have been trying to debug. (And also, we *definitely* shouldn't change the required barrier type unconditionally based on the limitations of a debugging tool.)

tmiasko · 2020-03-20T13:18:26Z

Note that the synchronization used here is not really stronger, so there is no
risk of that kind, except in trivially true sense that with ThreadSanitizer you
are de-facto executing completely difference machine code, with additional
calls to the runtime for each atomic operation etc.

Furthermore if you are actually trying to debug issues in Arc / Weak,
ThreadSanitizer was of no use so far. On the other hand with changes
proposed here it is able to detect previous bugs when reintroduced.

RalfJung · 2020-03-20T13:30:51Z

One could actually argue synchronization is weaker, because an acquire fence syncs with all release writes/fences, while an acquire load only syncs with release writes to the same location.

Make std::sync::Arc compatible with ThreadSanitizer The memory fences used previously in Arc implementation are not properly understood by thread sanitizer as synchronization primitives. This had unfortunate effect where running any non-trivial program compiled with `-Z sanitizer=thread` would result in numerous false positives. Replace acquire fences with acquire loads to address the issue. Fixes rust-lang#39608.

@ghost

Rollup of 16 pull requests Successful merges: - rust-lang#65097 (Make std::sync::Arc compatible with ThreadSanitizer) - rust-lang#69033 (Use generator resume arguments in the async/await lowering) - rust-lang#69997 (add `Option::{zip,zip_with}` methods under "option_zip" gate) - rust-lang#70038 (Remove the call that makes miri fail) - rust-lang#70058 (can_begin_literal_maybe_minus: `true` on `"-"? lit` NTs.) - rust-lang#70111 (BTreeMap: remove shared root) - rust-lang#70139 (add delay_span_bug to TransmuteSizeDiff, just to be sure) - rust-lang#70165 (Remove the erase regions MIR transform) - rust-lang#70166 (Derive PartialEq, Eq and Hash for RangeInclusive) - rust-lang#70176 (Add tests for rust-lang#58319 and rust-lang#65131) - rust-lang#70177 (Fix oudated comment for NamedRegionMap) - rust-lang#70184 (expand_include: set `.directory` to dir of included file.) - rust-lang#70187 (more clippy fixes) - rust-lang#70188 (Clean up E0439 explanation) - rust-lang#70189 (Abi::is_signed: assert that we are a Scalar) - rust-lang#70194 (#[must_use] on split_off()) Failed merges: r? @ghost

bors · 2020-03-21T07:47:07Z

☔ The latest upstream changes (presumably #70205) made this pull request unmergeable. Please resolve the merge conflicts.

choller · 2020-04-09T10:56:53Z

Thank you @tmiasko for working on this. This should get us one step closer to find races between C++ and Rust code in Firefox.

And fwiw, we have taken similar measures to replace a fence in our implementation with a load for TSan:

https://searchfox.org/mozilla-central/rev/4e228dc5f594340d35da7453829ad9f3c3cb8b58/mfbt/RefCounted.h#142-149

rust-highfive assigned joshtriplett Oct 4, 2019

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Oct 4, 2019

jonas-schievink added the A-sanitizers Area: Sanitizers for correctness and code quality. label Oct 4, 2019

rust-highfive assigned RalfJung and unassigned joshtriplett Oct 4, 2019

RalfJung added I-nominated T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Oct 5, 2019

RalfJung added S-waiting-on-team Status: Awaiting decision from the relevant subteam (see the T-<team> label). and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 10, 2019

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-team Status: Awaiting decision from the relevant subteam (see the T-<team> label). labels Mar 20, 2020

Dylan-DPC-zz mentioned this pull request Mar 20, 2020

Rollup of 7 pull requests #70198

Closed

Dylan-DPC-zz mentioned this pull request Mar 21, 2020

Rollup of 11 pull requests #70202

Closed

Centril mentioned this pull request Mar 21, 2020

Rollup of 17 pull requests #70203

Closed

Centril mentioned this pull request Mar 21, 2020

Rollup of 16 pull requests #70205

Merged

bors added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Mar 21, 2020

bors merged commit 4b91729 into rust-lang:master Mar 21, 2020

tmiasko deleted the arc branch March 21, 2020 07:54

blaxill mentioned this pull request Mar 27, 2020

Possible data race in Rust runtime project-oak/oak#780

Closed

seanmonstar mentioned this pull request Jul 4, 2020

Make Bytes & BytesMut compatible with ThreadSanitizer tokio-rs/bytes#405

Closed

This was referenced Nov 2, 2020

ThreadSanitizer doesn't like RwLock? Amanieu/parking_lot#257

Closed

ThreadSanitizer doesn't like rayon's work stealing? rayon-rs/rayon#812

Open

ThreadSanitizer doesn't like crossbeam_deque? crossbeam-rs/crossbeam#589

Closed

SkiFire13 mentioned this pull request Oct 2, 2021

thread sanitizer warnings using channel operations #89463

Closed

mgeier mentioned this pull request Mar 5, 2022

Use a single dynamic allocation (making RingBuffer a DST) mgeier/rtrb#75

Draft

mgeier mentioned this pull request Mar 31, 2024

Arc::strong_count memory ordering is a potential footgun #117485

Open

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make std::sync::Arc compatible with ThreadSanitizer #65097

Make std::sync::Arc compatible with ThreadSanitizer #65097

tmiasko commented Oct 4, 2019 •

edited

Loading

rust-highfive commented Oct 4, 2019

Centril commented Oct 4, 2019

RalfJung commented Oct 5, 2019

RalfJung commented Oct 5, 2019

RalfJung commented Oct 5, 2019

tmiasko commented Oct 5, 2019

RalfJung commented Oct 5, 2019

tmiasko commented Oct 5, 2019

RalfJung commented Oct 9, 2019

Amanieu commented Oct 9, 2019

Amanieu commented Oct 10, 2019

joshtriplett commented Oct 12, 2019 via email

alexcrichton commented Oct 15, 2019

JohnTitor commented Oct 27, 2019

pitdicker commented Oct 30, 2019 •

edited

Loading

pitdicker commented Oct 30, 2019

RalfJung commented Nov 3, 2019

JohnTitor commented Nov 17, 2019

alexcrichton commented Nov 18, 2019

tmiasko commented Nov 21, 2019

alexcrichton commented Nov 22, 2019

joshtriplett commented Mar 20, 2020 via email

tmiasko commented Mar 20, 2020

RalfJung commented Mar 20, 2020

bors commented Mar 21, 2020

choller commented Apr 9, 2020

Make std::sync::Arc compatible with ThreadSanitizer #65097

Make std::sync::Arc compatible with ThreadSanitizer #65097

Conversation

tmiasko commented Oct 4, 2019 • edited Loading

rust-highfive commented Oct 4, 2019

Centril commented Oct 4, 2019

RalfJung commented Oct 5, 2019

RalfJung commented Oct 5, 2019

RalfJung commented Oct 5, 2019

tmiasko commented Oct 5, 2019

RalfJung commented Oct 5, 2019

tmiasko commented Oct 5, 2019

RalfJung commented Oct 9, 2019

Amanieu commented Oct 9, 2019

Amanieu commented Oct 10, 2019

joshtriplett commented Oct 12, 2019 via email

alexcrichton commented Oct 15, 2019

JohnTitor commented Oct 27, 2019

pitdicker commented Oct 30, 2019 • edited Loading

pitdicker commented Oct 30, 2019

RalfJung commented Nov 3, 2019

JohnTitor commented Nov 17, 2019

alexcrichton commented Nov 18, 2019

tmiasko commented Nov 21, 2019

alexcrichton commented Nov 22, 2019

joshtriplett commented Mar 20, 2020 via email

tmiasko commented Mar 20, 2020

RalfJung commented Mar 20, 2020

bors commented Mar 21, 2020

choller commented Apr 9, 2020

tmiasko commented Oct 4, 2019 •

edited

Loading

pitdicker commented Oct 30, 2019 •

edited

Loading