-
Notifications
You must be signed in to change notification settings - Fork 32
This repo wants an FAQ / some usage advice #40
Comments
@titzer maybe Chrome should do that optimization :-) |
That the program "works" in SpiderMonkey's interpreter but not once the JIT kicks in only increases the user's frustration, of course. For the latest incarnation, see https://bugzil.la/1237410. |
I recommend quoting @hboehm and @dvyukov in these circumstances and in the FAQ: The FAQ should also explain why such a system is the right design. Basically, the user's code would utterly fail if it were compiled to C++. We're bringing a similar model to JavaScript, but making it more of a gotcha because the interpreter -> JIT transition optimizes races differently. This is a performance feature, and we shouldn't neuter the optimizations that we can perform because it's not intuitive. We should instead offer higher-level primitives (such as mutex) built on top of these low-level primitives so developers can use the feature intuitively without understanding races, and if they want to code-tweakers can get the best performance out of it by understanding the model. |
The current definition of GetValueFromBuffer and SetValueInBuffer seems to me to say that this optimization is illegal. Specifically the note on GetValueFromBuffer currently says: "If IsSharedMemory( arrayBuffer ) is true then two consecutive calls to GetValueFromBuffer with the same arguments in the same agent may not return the same value even if there is no write to the buffer in that agent between the read calls: another agent may have written to the buffer. This restricts compiler optimizations as follows. If a program loads a value, and then uses the loaded value several places, an ECMAScript implementation must not re-load the value for any of the uses even if it can prove the agent does not overwrite the value in memory. It must also prove that no concurrent agent overwrites the value. " (from http://lars-t-hansen.github.io/ecmascript_sharedmem/shmem.html#StructuredData.ArrayBuffer.abstract.GetValueFromBuffer). The definition of SetValueInBuffer also seems to imply that writes to the sharedarray will happen (there is nothing in there that would suggest you could optimize them away, or defer them indefinitely). The sample code from https://bugzilla.mozilla.org/show_bug.cgi?id=1237410 essentially writes to a sharedarray in a loop in one thread, whilst another thread reads the same sharedarray in a loop. Surely that would boil down to one thread calling SetValueInBuffer in a loop and the other calling GetValueFromBuffer in a loop (and therefore would observe the writes)? Am I missing something? |
I agree, the class of optimizations that expect optimization to be valid only in a single threaded context should be voided. Namely because the physical usage of sharedarraybuffer in of itself instead of a regular typed array is the opt in to the removal of optimizations for multi threaded correctness. |
I recommend reading Common Compiler Optimisations are Invalid in the C11 Memory Model and what we can do about it. Yes some optimizations are invalid in a multithreaded world, but not all of them are. Specifically:
Many forms of loop-invariant code motion are still valid on non-atomic code if there are provably no intervening atomic accesses or fences (put another way, if there's no happens-before in the loop). The examples @lars-t-hansen alludes to often fall in these categories: the developer expects non-atomic accesses to synchronize-with another thread, and they most definitely do not. |
I was figuring optimizations that would either cache results without propagation to the memory system, or factor out duplicate reads. I just want my forward progression guarantee. I also agree if the optimizations are within the bounds of the weak memory model, they should be allowed. |
Forward progress is definitely not guaranteed if code tries to synchronize Forward progress should definitely be guaranteed if you use atomics |
If one is using sharedarraybuffers, then wouldn't the results be submitted at all r/w at some indefinite point in time to the other threads, even if unsychronized? It's not a normal typed array, it's a typed array specified to be shared among workers. I get that the timeframe is undefined, but it shouldn't be infinite. |
Basically, if results are cached at some point, I'd figure the compiler would find the most appropriate time in scheduling instructions to submit them to the memory system, that time in of itself is undefined, but exists. |
That's not the memory model offered by SAB because doing so would required The SAB model is similar to the C++ one. Right now it only support seq_cst, |
Oh, I'm not for ordering reads and writes, I'm just for them eventually happening at some undefined time in the future if not done atomically. The lockless ring buffer fifo queue I wrote in JS works with R/W being out of order to each other thread and not occurring instantly. It just requires that non-atomic buffer writes be eventually propagated at some undefined point in the future. |
That would be easier to do but would still be surprising for developers It's also not clear what you'd win with what you want, versus what SAB |
On 7 January 2016 at 23:34, JF Bastien notifications@github.com wrote:
Given the definition of GetValueFromBuffer from the SAB specification ( while(1){ and var y=x[0]; Where x is a Shared Array. To me, the specification seems to say that they In step 4, if IsSharedMemory What is the definition of "[[SharedArrayBufferData]] internal slot" in |
On Fri, Jan 8, 2016 at 2:05 PM, Liam Wilson notifications@github.com
Hardware memory does not work how you are assuming.
|
I agree with @titzer, and would like to also address this:
It's indeed heavy, but the proofs aren't even something that compiler implementors need to fully understand, only the result of the proof (I find the result intuitive, though tricky). Developers using the memory model don't need to understand any of this! You can if you want to, but to use the memory model (not optimize it) you need to understand synchronization through seq_cst (and eventually acquire / release). Or even better, don't use those primitives and instead use mutex or other higher-level libraries implemented using those primitives. You're trying to do something the memory model disallows you from doing, and you're frustrated it won't do what you think it should without wanting to understand why it's designed that way. There are two solutions: understand why it's designed that way (read that paper and others such as Mark Batty's thesis), or accept that your usage is invalid. If it's preventing you from getting full performance then we have more work to do, but right now SAB only supports seq_cst so that's definitely the case. An analogy: you don't need to understand how a car works to drive one. Are you trying to design a car (implement the memory model), or drive one (use SAB)? You can't be frustrated if the car won't drive in ways it's not designed to! |
I guess lazily non-atomically waiting for memory updates in a separate thread is going to be a no go due to the optimizations then potentially tampering with the intended r/w then. I'm still interested in knowing if an Atomics.fence on the reader side will cause writes on the writer side to be published to the reader side. |
There's currently no support for If / when SAB supports fences, you'll still need to cause inter-thread synchronization by using a fence both on the reader and on the writer side. Note that using |
On 8 January 2016 at 14:00, titzer notifications@github.com wrote:
My assumption was that writes will eventually propagate between threads #include <stdio.h> void foo(void *y_void){ int main(void){ I compiled this with: gcc -m32 -g -O0 a.c -lpthread (Ubuntu 12.04, GCC 4.6.3 on a Core 2 Duo Note I compiled -O0 as otherwise the optimising compiler would optimise the while(1){ into a single test followed by an infinite loop (very much like the When I ran the program it printed the following to my terminal (the empty Iteration 1 Iteration 2 Iteration 3 Which shows the writes from function foo were propagating to the main Looking at the disassembly: objdump --prefix-addresses -S -M intel -d a.out void foo(void *y_void){ You can see that foo is writing to the memory location y[0] every iteration In the polling while loop from the main function:
} Turns in to: 0804859a <main+0x9c> nop so it repeatedly loads x[0] and checks if it is 1. I did consider that synchronisation was happening elsewhere (maybe in #include <stdio.h> void foo(void *y_void){ int main(void){
} This program ran to completion, implying that the writes from foo were Admittedly I don't fully understand all the nuances of the x86 memory model |
You're testing one compiler configuration, compiling one program, on a single version of an x86 machine. This isn't proof that hardware works that way. In fact weaker memory model systems do not offer such guarantees. We're designing a memory model which will work with common hardware implementations, not just x86. We can't "just ship x86" and call it a day. What are you trying to get, and why is the current memory model not sufficient? Most developers should use higher-level abstractions such as mutex, a few should use atomics but they need to do quite a bit of work to get it right. |
@jfbastien I get that now. You want guarantees it works on ALL systems, not just the conventional ones that are usually intended for the general populace. I wrote some test code with the guarantees on the OS scheduler and common platform details. You are correct in that there's added correctness that should exist that some of us are leaving out. I was personally hellbent on assuming that the consumer grade configurations would be the baseline. |
Gotcha, that's worth adding to the FAQ as one of the design constraints, with references supporting design choices (including some of the links I provided). Otherwise the repo just says "this is how things are" without enough "and this is why it's this way". |
@cosinusoidally Just say no to spinloops. Let the API spinloop behind the scenes for you, but do not write your own, as it's extremely bad practice. It's an anti-pattern against the OS scheduler, it's "room heater" code, and whether it works or not depends on the memory model of the system it's running on. GCC optimizing out your spinloop isn't a bug, it's a very deliberate feature, take it as a warning sign against such patterns. |
On Mon, Jan 11, 2016 at 1:12 AM, Liam Wilson notifications@github.com
Unfortunately this is not guaranteed by current memory models which are Hardware has limited cache space, causing it to eventually flush to lines
|
@titzer yeah, the goal of sharedmem is to target all systems, not ones that automagically sync at the end of the day. I see that now. |
On 11 January 2016 at 00:21, JF Bastien notifications@github.com wrote:
I’m just trying to wrap my head around some of the implications of the On 11 January 2016 at 09:22, titzer notifications@github.com wrote:
Yep, that’s one of the things that does not sit right with me |
IIUC you're hitting a problem similar to one C++ also has and the standards committee is actively trying to address: http://wg21.link/p0019r0 You're trying to have epochs where an array is set in a single-thread manner, and then accessed in a read-only manner from multiple threads. The paper describes a slightly different use case, but the fundamental difference is the epoch concept. In C++ today this is feasible through usage of relaxed accesses when accessing the array from a single thread and fences to mark the epoch. In your use case you could also use relaxed accesses when data is read-only because that's non-racy either. The current SAB memory model doesn't have relaxed accesses, nor does it have fences. Things can instead work out if you access the array with non-atomic accesses, but you hold a single-writer multiple-reader mutex between epochs (or perform a lower level acquire / release between writer and all readers). I think this approach is cleaner anyways :-) |
I updated the FAQ with more context from this discussion and linked to this issue for papers and background, so let's close this. Happy to accept answers to more FAQs, but let's phrase them as PRs against the FAQ document. |
For example: Over time I've received a couple of bug reports from people who have started to use this facility and who are surprised to find that racy programs don't work - usually this is a flag being communicated racily through memory, with the reader looping until the flag change is noted. For better or worse, IonMonkey hoists unsynchronized TA reads out of loops while Chrankshaft appears not to, leading to confusion when the racy program happens to work in Chrome but not in Firefox.
The text was updated successfully, but these errors were encountered: