Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LLVM atomic memcpy intrinsics, expose in core/std #58599

Open
joshlf opened this issue Feb 20, 2019 · 36 comments
Open

Add LLVM atomic memcpy intrinsics, expose in core/std #58599

joshlf opened this issue Feb 20, 2019 · 36 comments
Labels
A-atomic Area: Atomics, barriers, and sync primitives A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-feature-request Category: A feature request, i.e: not implemented / a PR. T-lang Relevant to the language team, which will review and decide on the PR/issue. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@joshlf
Copy link
Contributor

joshlf commented Feb 20, 2019

Expose LLVM's "element wise atomic memory intrinsics". In particular:

  • Expose llvm.memcpy.element.unordered.atomic as atomic_element_unordered_copy_memory_nonoverlapping
  • Expose llvm.memmove.element.unordered.atomic as atomic_element_unordered_copy_memory
  • Expose llvm.memset.element.unordered.atomic as atomic_element_unordered_set_memory

Expose these through functions in the std::ptr module. Each function is implemented by the equivalent intrinsic or by a loop of relaxed atomic operations if the intrinsic is not available on the target platform (TODO: Given this platform-specific behavior, can this also be exposed in core::ptr?)

  • copy_nonoverlapping_atomic_unordered, backed by atomic_element_unordered_copy_memory_nonoverlapping
  • copy_atomic_unordered, backed by atomic_element_unordered_copy_memory
  • write_atomic_unordered, backed by atomic_element_unordered_set_memory

Previously discussed on the internals forum here.

Folks with the authority to approve this: Let me know if this is OK to move forward; I'd like to post it to This Week in Rust's CFP.

@estebank estebank added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Feb 20, 2019
@Centril Centril added T-lang Relevant to the language team, which will review and decide on the PR/issue. C-feature-request Category: A feature request, i.e: not implemented / a PR. labels Feb 20, 2019
@joshlf
Copy link
Contributor Author

joshlf commented Feb 22, 2019

@alexcrichton Do you have thoughts on this? Want to make sure I get the go-ahead before I send somebody on TWiR's CFP down a dead end.

@alexcrichton
Copy link
Member

I suspect there'll be opinions on stabilization, but seems fine to add as unstable at least to me!

@joshlf
Copy link
Contributor Author

joshlf commented Feb 22, 2019

OK cool, thanks!

@tmccombs
Copy link
Contributor

tmccombs commented Mar 2, 2019

I'd like to take a stab at implementing this.

@joshlf
Copy link
Contributor Author

joshlf commented Mar 2, 2019

Awesome, thanks @tmccombs ! This will unblock a lot of important work :)

@joshlf
Copy link
Contributor Author

joshlf commented Mar 2, 2019

@alexcrichton On the subject of exposing in core::ptr: Is the ban on platform-specific logic in core a general one, or just for API-visible elements? If it's the latter, then we should be able to implement the facade in core::ptr as well.

@alexcrichton
Copy link
Member

Currently it's largely just a "we strongly discorage platform-specific APIs" in libcore, there's already platform-specific pieces to the implementation, especially some math-related pieces on MSVC

@tarcieri
Copy link
Contributor

I saw there was some discussions around use cases on #59155 so I thought I'd bring up mine (and perhaps you can all cross-check me that this would be the proper solution).

I maintain a crate for zeroizing memory (for e.g. wiping cryptographic keys/plaintexts) which presently relies on undefined behavior - namely mixed volatile / non-volatile memory accesses (in conjunction with compiler and hardware fences):

https://github.com/iqlusioninc/crates/tree/develop/zeroize

I'm presently (ab)using core::ptr::write_volatile (and relying on undefined LLVM behavior / implementation details) in conjunction with core::atomic::fence.

What I'd really like is an atomic memset-style operation, i.e. llvm.memset.element.unordered.atomic, which can write repeated copies of a value atomically. The closest thing that presently exists is volatile_set_memory, however that comes with undefined behavior when volatile and non-volatile operations are mixed.

@RalfJung
Copy link
Member

undefined behavior when volatile and non-volatile operations are mixed.

I don't know why you think that mixing volatile and non-volatile accesses is UB. I see no issue with that per se.

That said, atomic memset might still be useful.

@tarcieri
Copy link
Contributor

It is explicitly documented as UB:

https://doc.rust-lang.org/core/ptr/fn.write_volatile.html

Memory accessed with read_volatile or write_volatile should not be accessed with non-volatile operations.

Here is a thread on the subject of using volatile writes for zeroing sensitive memory:

https://internals.rust-lang.org/t/volatile-and-sensitive-memory/3188

@RalfJung
Copy link
Member

RalfJung commented May 20, 2019

That sentences does not say "UB" or does it say "must not". The section on when this function is UB does not mention mixed accesses.

AFAIK, it mostly refers to "normal" usage of volatile, namely interaction with IO registers: you don't want to mix volatile and non-volatile accesses on those. But I do not know of any form of UB there. @rkruppe, do you?

EDIT: Opened #60972

@RalfJung
Copy link
Member

RalfJung commented May 20, 2019

Here is a thread on the subject of using volatile writes for zeroing sensitive memory:

Note that that thread does not reflect our current understanding of volatile accesses and the issues around that. See rust-lang/unsafe-code-guidelines#33 for the latest discussion. I'd love to amend that thread but it got auto-closed :(

@tarcieri
Copy link
Contributor

tarcieri commented May 20, 2019

@RalfJung interesting (and also lengthy)! I'll have to read through it later.

Discrepancies between volatile and atomic aside, there's still no stable API for volatile memset either (only the unstable core::intrinsics::volatile_set_memory). It'd be nice if there were something along the lines of core::ptr::set_volatile which provides a stable wrapper around this intrinsic.

Centril added a commit to Centril/rust that referenced this issue May 20, 2019
remove confusing remarks about mixed volatile and non-volatile accesses

These comments were originally added by @ecstatic-morse in rust-lang@911d35f and then later edited by me. The intention, I think, was to make sure people do both their reads and writes with these methods if the affected memory really is used for communication with external devices.

However, [people read this as saying that mixed volatile/non-volatile accesses are UB](rust-lang#58599 (comment)), which -- to my knowledge -- they are not. So better remove this.

Cc @rkruppe @rust-lang/wg-unsafe-code-guidelines
@htfy96
Copy link

htfy96 commented Jun 2, 2019

This is also particularly useful to implement a seqlock. Currently all implementations read unprotected data on the reader side with or without volatile, which definitely depends on LLVM's implementation to synchronize with previous load-acquire semantic

More details can be read at this C++ proposal.

@RalfJung
Copy link
Member

RalfJung commented Jun 3, 2019

This is also particularly useful to implement a seqlock. Currently all implementations read unprotected data on the reader side with or without volatile, which definitely depends on LLVM's implementation to synchronize with previous load-acquire semantics.

It depends on LLVM semantics ensuring that read-write races are not UB, and instead yield undef/poison on the read side. But that is something LLVM guarantees (but Rust does not, neither do C/C++). I don't see any connection to release-acquire synchronization.

More details can be read at this C++ proposal.

That's interesting, thanks! I am surprised though that they don't even mention the LLVM semantics, which are an alternative approach to solve the issue with SeqLocks. Seems like relevant related work.

@hsivonen
Copy link
Member

hsivonen commented Jun 4, 2019

I suspect there'll be opinions on stabilization, but seems fine to add as unstable at least to me!

Considering how much focus there is on Wasm, it would be great for LLVM "unordered" loads and stores to have a path to stabilization in order to have stable-Rust facilities for implementing host services for multi-threaded Wasm. (Since Wasm code is untrusted, if Wasm thread A calls into a host service and designates a range of Wasm heap for reading/writing, the Rust code implementing the host service can't be optimized with the assumption of freedom from data races in that memory region, because Wasm thread B could access the region without adhering to the Rust/C11/C++11 rules.)

@HadrienG2
Copy link

HadrienG2 commented Jun 4, 2019

Wait a second.

Actually, unordered loads and stores do not resolve your problem, if you have no way to guarantee that the adversary code is interacting with memory using unordered operations. As far as I can tell from the LLVM manual, unordered loads from memory that is written using non-unordered stores is still UB.

It could be argued that no sane compiler optimization or known hardware behavior could ever turn this UB into something dangerous in practice. But well... UB is UB, and we never know what the future holds.

So if you want to strictly avoid UB, the correct thing to do in your case is to tweak the WASM compiler so that all reads and writes to the WASM heap that may be used to interact with the host are atomic. They may be unordered, they may be relaxed, it doesn't really matter from an UB-correctness point of view (though it may matter from a performance point of view). But they must be atomic.

@jstarks
Copy link

jstarks commented Jun 14, 2019

Without these intrinsics, is there a supported way to read or write shared memory that is being concurrently and arbitrarily mutated by another processor? If the other processor is executing code in a different security boundary, then there may be no way to prevent it from mutating the memory with an arbitrary mix of atomic, non-atomic, and unaligned accesses. But clearly this works at the processor level, because this is the foundation for most OS and virtualization communcation models.

Right now, for example, Google's crosvm VMM uses an (arbitrary?) mix of ptr::{write_volatile, copy, write_bytes} while accessing memory that is shared between its process and a VM. It seems to me that at least some of this is UB today, and therefore future LLVM optimizations could break this code, but I wouldn't know how to fix this with what Rust offers today (especially while retaining good performance).

The requested intrinsics would seem to be what crosvm, and software like it, needs.

@RalfJung
Copy link
Member

RalfJung commented Jun 14, 2019

Without these intrinsics, is there a supported way to read or write shared memory that is being concurrently and arbitrarily mutated by another processor?

Yes, use atomic accesses.

A data race of an atomic and a non-atomic access in one program is UB if at least one of them is a write; but if another separately compiled program races with your atomic accesses, the only side that has to expect problems is that side. In some way, the data race is UB "only on the non-atomic side", which is not a distinction that usually makes any sense because UB is a whole-program property, but it does if one side is encapsulated and compiled separately.

Mind you, this is informal. You are outside of the scope of any formal model I am aware of.

@hsivonen
Copy link
Member

Without these intrinsics, is there a supported way to read or write shared memory that is being concurrently and arbitrarily mutated by another processor?

Yes, use atomic accesses.

Over in the SharedArrayBuffer issue that motivates my comments here, I was told by a SpiderMonkey developer that C++11/C11 atomics are not a valid answer.

In some way, the data race is UB "only on the non-atomic side", which is not a distinction that usually makes any sense because UB is a whole-program property, but it does if one side is encapsulated and compiled separately.

Is there advice for programmers of virtual machines for how to use atomics safely with a non-conforming execution on the other processor such that LLVM devs agree that the advice may be safely relied upon?

@RalfJung
Copy link
Member

The point seems to be

gcc is quite explicit about this in its documentation, saying that its __atomic intrinsics may not execute fences if gcc thinks they're not required.

Indeed compilers are allowed to optimize atomic accesses to non-atomic accesses if they can prove that that does not introduce a race or otherwise change behavior.

The problem we are having here is that the C/C++ memory model is inherently a "whole-program spec". The question we are asking though only makes any sense when considering "separate compilation". We want some kind of modular notion of correctness when linked with arbitrary code -- and worse, with the additional constraint that we have run-time mechanisms in place that ensure that the arbitrary code does not cause UB.

However, I rest my case that I think using atomic instructions will work for this. GCC can only remove synchronization when it knows all accesses done to a location. In case of a VM, I don't know how memory gets allocated and handed to the program inside the VM, but surely it happens in a way that the compiler must consider the pointer "escaped"? Even though the spec does not consider separate compilation, that's what compilers do in practice -- they have to compile a compilation unit in a way that works fine with any conforming other compilation unit. Hence they cannot remove fences from atomic accesses to locations that have "escaped".

But I will admit that this is all based on a somewhat idealized understanding of the VM and compilers, the details of which I am not extremely familiar with. I just cannot imagine a way to implement a compiler that supports separate compilation of concurrent code, such that using atomics for this use-case does not have the intended semantics.

@RalfJung
Copy link
Member

One issue with these intrinsics is that they use LLVM's "unordered" atomic memory ordering, which Rust currently does not have, and I have some reservations to adding it.

Are there "relaxed" versions of these intrinsics that we could use instead?

@joshlf
Copy link
Contributor Author

joshlf commented Oct 12, 2019

Are there "relaxed" versions of these intrinsics that we could use instead?

Not sure. Anecdotally, I benchmarked a loop of relaxed atomic loads (loading from a slice of AtomicU8s using the Relaxed ordering), and found it to be ~10x slower than a loop of unsychronized loads (just accessing the elements of a slice of u8). I haven't benchmarked the unordered atomic memcpy intrinsics, but I suspect they'll be significantly faster.

For my use case in particular, speed is paramount, which is why I'm interested in these intrinsics. That said, I completely understand your concern with introducing "unordered" into our memory model. Perhaps a middle-road path would be to add these as unstable with a big warning in their doc comments saying "we will probably never stabilize these because of memory model issues"? That way we could still use them for the time being while we let the various discussions about the semantics of inter-process shared memory play out, and hopefully switch to the "official" solution once such a solution exists.

@RalfJung
Copy link
Member

Not sure. Anecdotally, I benchmarked a loop of relaxed atomic loads (loading from a slice of AtomicU8s using the Relaxed ordering), and found it to be ~10x slower than a loop of unsychronized loads (just accessing the elements of a slice of u8). I haven't benchmarked the unordered atomic memcpy intrinsics, but I suspect they'll be significantly faster.

My current theory for why relaxed is so much slower is that LLVM is really bad at optimizing it.

add these as unstable

I am not opposed to any unstable experimentation. I am okay with giving people enough rope to hang themselves, at least on nightly. ;)
I am surprised though that this to be enough for you to actually use it.

@HadrienG2
Copy link

HadrienG2 commented Oct 12, 2019

@joshlf The cause of this particular problem is known. LLVM's optimizer is currently unable to fuse together neighboring relaxed atomic loads and stores, so a loop of AtomicU8 operations will translate into actual byte operations in hardware assembly, which is stupidly slow.

According to @comex, that's an optimizer bug, and the relaxed atomic semantics should actually allow for such code to be optimized into e.g. an efficient rep movq or SSE2 memcpy loop on x86_64.

@RalfJung
Copy link
Member

RalfJung commented Oct 12, 2019

inter-process shared memory play

Note that relaxed vs unordered has no effect on inter-process shared memory (between distrusting parties). General consensus is (to my knowledge) still that this does and will require volatile -- specifically, "atomic volatile", which LLVM has but Rust does not expose.

@joshlf
Copy link
Contributor Author

joshlf commented Oct 12, 2019

inter-process shared memory play

Note that relaxed vs unordered has no effect on inter-process shared memory (between distrusting parties). General consensus is (to my knowledge) still that this does and will require volatile -- specifically, "atomic volatile", which LLVM has but Rust does not expose.

IIUC from the discussion in rust-lang/unsafe-code-guidelines#152, normal volatile may be sufficient if combined with a freeze operation (which doesn't yet exist, of course)?

@RalfJung
Copy link
Member

IIUC from the discussion in rust-lang/unsafe-code-guidelines#152, normal volatile may be sufficient if combined with a freeze operation (which doesn't yet exist, of course)?

That is true if we adapt LLVM's handling of read-write races. So, we would be deviating from C++ concurrency there as well. This is slightly better studied than LLVM's handling of write-write races, but that doesn't say much (as the latter is entirely unstudied to my knowledge).

@steveklabnik
Copy link
Member

Triage: I don't believe there's been any changes here.

@m-ou-se
Copy link
Member

m-ou-se commented Jan 7, 2022

One issue with these intrinsics is that they use LLVM's "unordered" atomic memory ordering, which Rust currently does not have, and I have some reservations to adding it.

Wg21's p1478r6 might be interesting. It proposes to add to C++ an atomic_load_per_byte_memcpy and atomic_store_per_byte_memcpy which, unlike the llvm intrinsics, both take a memory ordering as argument. The only valid options are Relaxed and Acquire/Release. It comes down to an unordered read followed by (or write preceded by) an atomic fence, which is exactly what you want in cases like a seqlock. We could use that design in Rust as well, to avoid the need for a new memory ordering.

@RalfJung
Copy link
Member

RalfJung commented Jan 7, 2022

Yeah that sounds like a great solution. :)

@m-ou-se
Copy link
Member

m-ou-se commented Jan 7, 2022

Cool. Time to add those intrinsics and and add this API as unstable to play around with then. :)

Not sure if I'll have time for that soon. So if anyone else has, please do! ^^

@m-ou-se
Copy link
Member

m-ou-se commented Jan 8, 2022

@Amanieu mentions here (#59155 (comment)):

LLVM's element-wise atomic operations actually generate very poor code in practice since they are always lowered to a call to the intrinsic. You can get much better codegen by using volatile reads and writes instead, which have similar (but stricter) guarantees.

Does that mean we could provide these atomic_load_per_byte_memcpy and atomic_store_per_byte_memcpy but implement them without using those llvm intrinsics, and instead use volatile operations since we know llvm makes the guarantees we need on them? We'd still add a fence and take a memory ordering argument, to make sure it all fits in the existing memory model, but the implementation makes use of some extra guarantees llvm makes.

@RalfJung
Copy link
Member

RalfJung commented Jan 8, 2022

instead use volatile operations since we know llvm makes the guarantees we need on them?

How do we know that?
I don't think LLVM makes guarantee on volatile accesses that race with atomic accesses.

@RalfJung
Copy link
Member

RalfJung commented Jan 8, 2022

LLVM's element-wise atomic operations actually generate very poor code in practice since they are always lowered to a call to the intrinsic.

That sounds like something that is fixable in LLVM? (And if these operations are really adopted by C/C++ and used for seqlocks, they will want to do that anyway.)

@workingjubilee workingjubilee added the A-atomic Area: Atomics, barriers, and sync primitives label Aug 17, 2022
@joshlf
Copy link
Contributor Author

joshlf commented Oct 14, 2022

Cool. Time to add those intrinsics and and add this API as unstable to play around with then. :)

@m-ou-se, could you clarify which intrinsics and what API surface you'd like to see added here? I'd like to post this issue on This Week in Rust's CFP section to see if we can find someone to implement it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-atomic Area: Atomics, barriers, and sync primitives A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-feature-request Category: A feature request, i.e: not implemented / a PR. T-lang Relevant to the language team, which will review and decide on the PR/issue. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests