-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document movnt needs sfence #1457
Conversation
r? @Amanieu (rustbot has picked a reviewer for you, use r? to override) |
57c4239
to
1341b75
Compare
For every intrinsic that may generate any of the MOVNT family of instructions, specify it must be followed by `_mm_sfence`. Also, ask people to not think too hard about what actually happens with write-combining memory buffers. They probably don't want to know, and in terms of the Rust abstract machine, we aren't actually entirely sure yet.
1341b75
to
dc9228d
Compare
/// After using this intrinsic, but before any atomic operations occur, a call | ||
/// to `_mm_sfence()` must be performed. A safe function that includes unsafe | ||
/// usage of this intrinsic must always end in `_mm_sfence()`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a bit too strict. There are niche scenarios where one would write out data without ever reading it again or at least without reading it again on another thread.
In those cases some later release write would be incidental and only meant to order other, regular writes.
Maybe the entire requirement could be conditional on "if the written memory is intended to be made accessible on another thread through a release operation", with the recommendation "if in doubt, add a fence".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, I don't think it's coherent to leave an unsatisfied obligation hanging unless it's still in unsafe
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That depends on the perspective. Yes, it's an obligation to... restore consistency with the current rust memory model which assumes that all writes must be ordered with a release operation.
But under some unspecified extended model it may be valid to leave some memory locations unordered.
AIUI the purpose of ordering is to avoid data races which are UB. If the another thread never accesses the memory then this is unspecified behavior but not necessarily UB.
E.g. if you're writing bytes to a framebuffer that's concurrently being scanned out by the GPU then the fence doesn't add anything. You're not synchronizing with anything. You're just racing against time. Either the write makes out to the pixels or it doesn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the conservative definition is fine for now. But it could be phrased in a way that makes it clear that it may be replaced with a more refined definition at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to tell people to always discharge the obligation (even in the case you mention, the store-store fence would serialize any remaining deferred write-buffers, which is desired if you might run out of time otherwise!) and relax things when we mechanize a better spec.
/// | ||
/// After using this intrinsic, but before any atomic operations occur, a call | ||
/// to `_mm_sfence()` must be performed. A safe function that includes unsafe | ||
/// usage of this intrinsic must always end in `_mm_sfence()`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are currently moving into a different direction in rust-lang/rust#114582: if we follow @talchas' approach, the rule will be: after using this intrinsic, but (happens-)before any other read or write of this location in any thread, a fence must occur (the fence must be in the same thread that called the intrinsic).
Basically, the write actually happens in another thread via a non-atomic store, so accessing this location may cause a data race. Doing a fence waits for that other thread to complete its store, avoiding the data race.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I tried to keep the rationale for the requirement slightly vague as we discuss it, and approach it as more of a "do this or your program explodes into flames" type of warning, which is why it goes on to explain further reads or writes to the location are discouraged. So with that in mind, what is this missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer if we'd call that UB instead of merely "discouraged".
Ideally even further nontemporal writes (before the next sfence) would be UB... I'd find it strange to have a situation where a nontemporal store would be allowed but a regular store would not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #1534 with my alternative wording.
/// Reading and writing to the memory stored-to by any other means, after any | ||
/// nontemporal store has been used to write to that memory, but before the | ||
/// use of `_mm_sfence()`, is discouraged. Such reads can lead to pipeline | ||
/// stalls and yet-unspecified program behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. Well, both x86's movntdq and Arm's stnp retain local-thread serial ordering if you use them twice on the same location, i.e. that using _mm_stream_ps(ptr, a); _mm_stream_ps(ptr, b);
still will yield b
when you later read it, and coalescing such writes is traditionally allowed by compilers, so I would prefer to retain the more ambiguous verbiage I used here, @RalfJung. Mostly, I said "any other means" to cast a wide net so that people avoid it rather than inviting people to try to reason about in a way that might end in them deciding it's suddenly okay to perform an atomic store or atomic load on the memory from another thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do agree that we should canonically allow _mm_stream_ps(ptr, a); _mm_store_ps(ptr, b); _mm_sfence();
eventually but I would prefer it live in an in-between state until we are clear on our mechanics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing is that the proposed doc just doesn't align with the proposed spec in rust-lang/rust#114582. If you go with @talchas' original proposed spec, the doc should be something like
"After the _mm_stream_ps
until the next time this thread calls _mm_sfence
, any attempt to access (read or write) this memory from this thread by means other than _mm*_stream*
operations is UB. Furthermore, any access from another threads must synchronize-with that _mm_sfence
or else any access (read or write) from the other thread is UB, including _mm*_stream*
operations.
Basically, the nontemporal writes performed by streaming operations should be considered not ordered even with other operations in the thread they appear in, except with other nontemporal writes, until the next sfence which establishes synchronization with all nontemporal writes of the current thread."
That would allow your example. Crucially it disallows _mm_stream_ps(ptr, a); *ptr = a; _mm_sfence();
and _mm_stream_ps(ptr, a); println!("{:?}", *ptr); _mm_sfence();
which is disallowed by the proposed spec. We should probably have examples in the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, _mm_store_ps
is morally equivalent to assignment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry, I misread your example and thought it said stream
twice.
If we make _mm_store_ps
also inline assembly we might be able to come up with a mechanism that allows it. But for regular assignment I'm quite worried that LLVM optimizations will make deductions from seeing an assignment that are incompatible with the nontemporal state of that location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a myth that "you cannot read uninitialized memory" and I think it is harmful. I've encountered tons of confusion caused by people axiomatically thinking "I cannot read uninit memory". Instead we should teach people what actually happens in our spec: when reading memory at a certain type, that memory must be sufficiently "valid" for this type. If you read at bool
, it must be 0 or 1. If you read at i32
, it must be initialized. If you read at MaybeUninit<_>
, it can be anything. I think telling people they can't read from uninit memory is a complete distraction and leads to an unnecessarily complicated mental model. There's a reason that "reading from uninit memory" does not show up in this list.
But anyway, it seems unlikely we will get to an agreement on terminology or teaching philosophy here and that's all rather off-topic anyway. The on-topic question is whether we should allow people to perform regular writes in between the nontemporal write and the fence. I don't see a good motivation for doing that, and it opens some tricky questions that we'd need to carefully figure out before allowing it. For instance, if I do a nontemporal write and then a regular write, do I even still need the fence or is it now guaranteed that the next "release" operation synchronizes everything properly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ehnn I mean yes, just...
...Anyways I think, dreadfully, to actually answer your question about the write series (i.e.
movnti [somewhere], something
mov [somewhere], something
mov [flag], 1
), that on x86 you kiiiinda still need the fence, from the ISA's point of view?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. The local results should be consistent absolutely, but I am concerned about code that may look more like moving overlapping sizes, so
vmovntdq [somewhere], ymm09
mov [somewhere], r09
mov [flag], 1
The first 8 bytes are guaranteed once the flag is set due to the two regular movs participating in TSO, but the vmovntdq means 24 bytes are in an ambiguous state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do I even still need the fence or is it now guaranteed that the next "release" operation synchronizes everything properly?
If the answer to that is yes then I don't think we can allow regular writes. A regular write in Rust is guaranteed to be properly released by a release operation, after all.
We can have intrinsics that do "regular write to something that might be in the nontemporal state", but those need the same inline-assembly-and-Rust-replacement-code-spec treatment as streaming writes.
I'd really prefer if we didn't have to do this... the usecases we are aware of don't need this, do they?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, alright then, I didn't realize that was the specific way the sequencing invariants were formed. It's possible the reality may be amenable to regular writes (assuming a sort of "wholesale adoption of the x86 mechanics into Rust" approach for this case) but I'd have to examine the rules... very closely.
I don't think the majority of use cases need this in practice, correct, so now that we have hashed out that issue as existing, then I am happy to let this one go. I will update the documentation changes here accordingly.
Will address the comments after the conversation settles a bit. |
☔ The latest upstream changes (presumably c8ae80e) made this pull request unmergeable. Please resolve the merge conflicts. |
For every intrinsic that may generate any of the MOVNT family of instructions, specify it must be followed by
_mm_sfence
.Also, ask people to not think too hard about what actually happens with write-combining memory buffers. They probably don't want to know, and in terms of the Rust abstract machine, we aren't actually entirely sure yet.