Skip to content
This repository has been archived by the owner on Jan 25, 2022. It is now read-only.

This repo wants an FAQ / some usage advice #40

Closed
lars-t-hansen opened this issue Jan 7, 2016 · 28 comments
Closed

This repo wants an FAQ / some usage advice #40

lars-t-hansen opened this issue Jan 7, 2016 · 28 comments

Comments

@lars-t-hansen
Copy link
Collaborator

For example: Over time I've received a couple of bug reports from people who have started to use this facility and who are surprised to find that racy programs don't work - usually this is a flag being communicated racily through memory, with the reader looping until the flag change is noted. For better or worse, IonMonkey hoists unsynchronized TA reads out of loops while Chrankshaft appears not to, leading to confusion when the racy program happens to work in Chrome but not in Firefox.

@jfbastien
Copy link
Contributor

@titzer maybe Chrome should do that optimization :-)

@lars-t-hansen
Copy link
Collaborator Author

That the program "works" in SpiderMonkey's interpreter but not once the JIT kicks in only increases the user's frustration, of course. For the latest incarnation, see https://bugzil.la/1237410.

@jfbastien
Copy link
Contributor

I recommend quoting @hboehm and @dvyukov in these circumstances and in the FAQ:
https://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
http://hboehm.info/boehm-hotpar11.pdf

The FAQ should also explain why such a system is the right design. Basically, the user's code would utterly fail if it were compiled to C++. We're bringing a similar model to JavaScript, but making it more of a gotcha because the interpreter -> JIT transition optimizes races differently.

This is a performance feature, and we shouldn't neuter the optimizations that we can perform because it's not intuitive. We should instead offer higher-level primitives (such as mutex) built on top of these low-level primitives so developers can use the feature intuitively without understanding races, and if they want to code-tweakers can get the best performance out of it by understanding the model.

@cosinusoidally
Copy link

The current definition of GetValueFromBuffer and SetValueInBuffer seems to me to say that this optimization is illegal. Specifically the note on GetValueFromBuffer currently says: "If IsSharedMemory( arrayBuffer ) is true then two consecutive calls to GetValueFromBuffer with the same arguments in the same agent may not return the same value even if there is no write to the buffer in that agent between the read calls: another agent may have written to the buffer. This restricts compiler optimizations as follows. If a program loads a value, and then uses the loaded value several places, an ECMAScript implementation must not re-load the value for any of the uses even if it can prove the agent does not overwrite the value in memory. It must also prove that no concurrent agent overwrites the value. " (from http://lars-t-hansen.github.io/ecmascript_sharedmem/shmem.html#StructuredData.ArrayBuffer.abstract.GetValueFromBuffer). The definition of SetValueInBuffer also seems to imply that writes to the sharedarray will happen (there is nothing in there that would suggest you could optimize them away, or defer them indefinitely).

The sample code from https://bugzilla.mozilla.org/show_bug.cgi?id=1237410 essentially writes to a sharedarray in a loop in one thread, whilst another thread reads the same sharedarray in a loop. Surely that would boil down to one thread calling SetValueInBuffer in a loop and the other calling GetValueFromBuffer in a loop (and therefore would observe the writes)?

Am I missing something?

@taisel
Copy link

taisel commented Jan 7, 2016

I agree, the class of optimizations that expect optimization to be valid only in a single threaded context should be voided. Namely because the physical usage of sharedarraybuffer in of itself instead of a regular typed array is the opt in to the removal of optimizations for multi threaded correctness.

@jfbastien
Copy link
Contributor

I recommend reading Common Compiler Optimisations are Invalid in the C11 Memory Model and what we can do about it. Yes some optimizations are invalid in a multithreaded world, but not all of them are.

Specifically:

  • Accesses not marked as atomic can be optimized quite a bit. The above paper documents where this fails, but most optimizations are still valid.
  • Accesses that are marked as atomic can and should still be optimized by compiler.

Many forms of loop-invariant code motion are still valid on non-atomic code if there are provably no intervening atomic accesses or fences (put another way, if there's no happens-before in the loop). The examples @lars-t-hansen alludes to often fall in these categories: the developer expects non-atomic accesses to synchronize-with another thread, and they most definitely do not.

@taisel
Copy link

taisel commented Jan 8, 2016

I was figuring optimizations that would either cache results without propagation to the memory system, or factor out duplicate reads. I just want my forward progression guarantee. I also agree if the optimizations are within the bounds of the weak memory model, they should be allowed.

@jfbastien
Copy link
Contributor

I was figuring optimizations that would either cache results without
propagation to the memory system, or factor out duplicate reads. I just
want my forward progression guarantee. I also agree if the optimizations
are within the bounds of the weak memory model, they should be allowed.

Forward progress is definitely not guaranteed if code tries to synchronize
without using atomics. If that's what you want then you're in for
disappointment.

Forward progress should definitely be guaranteed if you use atomics
properly. There are separate bugs for this (#5 and #28). It's not well
defined in C++ either, but the committee is working on fixing this.

@taisel
Copy link

taisel commented Jan 8, 2016

If one is using sharedarraybuffers, then wouldn't the results be submitted at all r/w at some indefinite point in time to the other threads, even if unsychronized? It's not a normal typed array, it's a typed array specified to be shared among workers. I get that the timeframe is undefined, but it shouldn't be infinite.

@taisel
Copy link

taisel commented Jan 8, 2016

Basically, if results are cached at some point, I'd figure the compiler would find the most appropriate time in scheduling instructions to submit them to the memory system, that time in of itself is undefined, but exists.

@jfbastien
Copy link
Contributor

If one is using sharedarraybuffers, then wouldn't the results be
submitted at all r/w at some indefinite point in time to the other threads,
even if unsychronized? It's not a normal typed array, it's a typed array
specified to be shared among workers. I get that the timeframe is
undefined, but it shouldn't be infinite.

That's not the memory model offered by SAB because doing so would required
fences before every read if we want to enforce ordering on weak memory
model systems.

The SAB model is similar to the C++ one. Right now it only support seq_cst,
but the intent (or mine anyways) is to at least support acquire and
release.

@taisel
Copy link

taisel commented Jan 8, 2016

Oh, I'm not for ordering reads and writes, I'm just for them eventually happening at some undefined time in the future if not done atomically. The lockless ring buffer fifo queue I wrote in JS works with R/W being out of order to each other thread and not occurring instantly. It just requires that non-atomic buffer writes be eventually propagated at some undefined point in the future.

@jfbastien
Copy link
Contributor

Oh, I'm not for ordering reads and writes, I'm just for them eventually
happening at some undefined time in the future if not done atomically.

That would be easier to do but would still be surprising for developers
other than you, and lead to missed optimization opportunities. Check out
the "benign races" papers I liked to above, you think you don't want
ordering but you probably do.

It's also not clear what you'd win with what you want, versus what SAB
does: it allows you to express exactly to the compiler and CPU what you
want. If you don't want to understand atomics (that should be most
developers) then you should use higher level primitives built on top of SAB
such as mutex. I'd appreciate details on this.

@cosinusoidally
Copy link

On 7 January 2016 at 23:34, JF Bastien notifications@github.com wrote:

I recommend reading Common Compiler Optimisations are Invalid in the C11
Memory Model and what we can do about it
http://www.di.ens.fr/%7Ezappa/readings/c11comp.pdf. Yes some
optimizations are invalid in a multithreaded world, but not all of them are.

That's pretty heavy going stuff. Without investing significant amounts of
time learning Coq, I have little chance of understanding the bulk of that
paper. Having said that, the SAB specification is not the C11
specification. The SAB specification may draw from the C11 specification,
but we should be able to understand the behaviour and permitted
optimization allowed by the SAB specification purely in terms of the SAB
specification.

Given the definition of GetValueFromBuffer from the SAB specification (
http://lars-t-hansen.github.io/ecmascript_sharedmem/shmem.html#StructuredData.ArrayBuffer.abstract.GetValueFromBuffer)
are the following code fragments equivalent?

while(1){
if(x[0]===1){break;}
}

and

var y=x[0];
if(y!==1){while(1){}}

Where x is a Shared Array. To me, the specification seems to say that they
are not. Each buffer access x[0] should become a call to
GetValueFromBuffer. From the definition of GetValueFromBuffer:

In step 4, if IsSharedMemory
http://lars-t-hansen.github.io/ecmascript_sharedmem/shmem.html#StructuredData.SharedArrayBuffer.abstract.IsSharedMemory(
arrayBuffer ) is true then use the value of arrayBuffer's
[[SharedArrayBufferData]] internal slot.

What is the definition of "[[SharedArrayBufferData]] internal slot" in
terms of the SAP spec? I read it to mean a memory cell in a location shared
between 2 or more threads. Is that correct? If so, are implementations
permitted to cache those reads? If not, then implementations should not be
able to hoist the read out of the loop. Implementations would therefore
need to poll the memory cell and so, in turn, would observe writes to the
cell in a cache coherent multiprocessor system.

@titzer
Copy link

titzer commented Jan 8, 2016

On Fri, Jan 8, 2016 at 2:05 PM, Liam Wilson notifications@github.com
wrote:

On 7 January 2016 at 23:34, JF Bastien notifications@github.com wrote:

I recommend reading Common Compiler Optimisations are Invalid in the C11
Memory Model and what we can do about it
http://www.di.ens.fr/%7Ezappa/readings/c11comp.pdf. Yes some
optimizations are invalid in a multithreaded world, but not all of them
are.

That's pretty heavy going stuff. Without investing significant amounts of
time learning Coq, I have little chance of understanding the bulk of that
paper. Having said that, the SAB specification is not the C11
specification. The SAB specification may draw from the C11 specification,
but we should be able to understand the behaviour and permitted
optimization allowed by the SAB specification purely in terms of the SAB
specification.

Given the definition of GetValueFromBuffer from the SAB specification (

http://lars-t-hansen.github.io/ecmascript_sharedmem/shmem.html#StructuredData.ArrayBuffer.abstract.GetValueFromBuffer
)
are the following code fragments equivalent?

The definition in this section needs some work. It says that the two
consecutive reads "may not" return the same value, which is poorly worded,
since "may not" could be construed as prescriptive. The given optimization
example is also wrong; it prohibits introducing loads, not eliminating
them. I am aware of memory models that prohibit introducing loads, but not
of ones that prohibit eliminating them.

while(1){
if(x[0]===1){break;}
}

and

var y=x[0];
if(y!==1){while(1){}}

Where x is a Shared Array. To me, the specification seems to say that they
are not. Each buffer access x[0] should become a call to
GetValueFromBuffer. From the definition of GetValueFromBuffer:

In step 4, if IsSharedMemory
<
http://lars-t-hansen.github.io/ecmascript_sharedmem/shmem.html#StructuredData.SharedArrayBuffer.abstract.IsSharedMemory

(
arrayBuffer ) is true then use the value of arrayBuffer's
[[SharedArrayBufferData]] internal slot.

What is the definition of "[[SharedArrayBufferData]] internal slot" in
terms of the SAP spec? I read it to mean a memory cell in a location shared
between 2 or more threads. Is that correct? If so, are implementations
permitted to cache those reads? If not, then implementations should not be
able to hoist the read out of the loop. Implementations would therefore
need to poll the memory cell and so, in turn, would observe writes to the
cell in a cache coherent multiprocessor system.

If this is the behavior that you want, you are going to disallow load
elimination and further will require hardware fences for every read and
every write. This is going to be prohibitively expensive, even when the
locations are not contended and there is no sharing.

Hardware memory does not work how you are assuming.


Reply to this email directly or view it on GitHub
https://github.com/lars-t-hansen/ecmascript_sharedmem/issues/40#issuecomment-169996207
.

@jfbastien
Copy link
Contributor

I agree with @titzer, and would like to also address this:

That's pretty heavy going stuff. Without investing significant amounts of
time learning Coq, I have little chance of understanding the bulk of that
paper. Having said that, the SAB specification is not the C11
specification. The SAB specification may draw from the C11 specification,
but we should be able to understand the behaviour and permitted
optimization allowed by the SAB specification purely in terms of the SAB
specification.

It's indeed heavy, but the proofs aren't even something that compiler implementors need to fully understand, only the result of the proof (I find the result intuitive, though tricky).

Developers using the memory model don't need to understand any of this! You can if you want to, but to use the memory model (not optimize it) you need to understand synchronization through seq_cst (and eventually acquire / release). Or even better, don't use those primitives and instead use mutex or other higher-level libraries implemented using those primitives. You're trying to do something the memory model disallows you from doing, and you're frustrated it won't do what you think it should without wanting to understand why it's designed that way. There are two solutions: understand why it's designed that way (read that paper and others such as Mark Batty's thesis), or accept that your usage is invalid. If it's preventing you from getting full performance then we have more work to do, but right now SAB only supports seq_cst so that's definitely the case.

An analogy: you don't need to understand how a car works to drive one. Are you trying to design a car (implement the memory model), or drive one (use SAB)? You can't be frustrated if the car won't drive in ways it's not designed to!

@taisel
Copy link

taisel commented Jan 8, 2016

I guess lazily non-atomically waiting for memory updates in a separate thread is going to be a no go due to the optimizations then potentially tampering with the intended r/w then.

I'm still interested in knowing if an Atomics.fence on the reader side will cause writes on the writer side to be published to the reader side.

@jfbastien
Copy link
Contributor

I guess lazily non-atomically waiting for memory updates in a separate thread is going to be a no go due to the optimizations then potentially tampering with the intended r/w then.

I'm still interested in knowing if an Atomics.fence on the reader side will cause writes on the writer side to be published to the reader side.

There's currently no support for Atomics.fence in SAB. This will only be useful once something like relaxed ordering is added. @hboehm has an interesting presentation on that topic.

If / when SAB supports fences, you'll still need to cause inter-thread synchronization by using a fence both on the reader and on the writer side. Note that using seq_cst load and store will do exactly that! It'll make the write visible to readers.

@cosinusoidally
Copy link

On 8 January 2016 at 14:00, titzer notifications@github.com wrote:

If this is the behavior that you want, you are going to disallow load
elimination and further will require hardware fences for every read and
every write. This is going to be prohibitively expensive, even when the
locations are not contended and there is no sharing.

It's not necessarily the behaviour I want. I understand that the
implications are not good from the point of view of an optimising compiler.
As you say, the specification could do with some wording improvements.

Hardware memory does not work how you are assuming.

My assumption was that writes will eventually propagate between threads
without necessarily needing explicit synchronisation. I wrote the following
program to test my assumption:

#include <stdio.h>
#include <pthread.h>
#include <stdlib.h>

void foo(void *y_void){
char
y=(char *)y_void;
unsigned int j;
while(1){
for(j=0;j<1000000000;j++){
}
y[0]=1;
}
return NULL;
}

int main(void){
char* x;
x=malloc(1);
x[0]=7;
pthread_t other_thread;
if(pthread_create(&other_thread, NULL, foo, x)) {
printf("Thread creation failed\n");
return 1;
}
int i=1;
while(1){
x[0]=7;
printf("\nIteration %d\n",i);
printf("x[0] at start %d\n",x[0]);
while(1){
if(x[0]==1)break;
};
printf("x[0] at finish %d\n",x[0]);
i++;
}
return 0;
}

I compiled this with:

gcc -m32 -g -O0 a.c -lpthread (Ubuntu 12.04, GCC 4.6.3 on a Core 2 Duo
machine)

Note I compiled -O0 as otherwise the optimising compiler would optimise the
following memory polling

while(1){
if(x[0]==1)break;
};

into a single test followed by an infinite loop (very much like the
JavaScript example I gave in my previous post)

When I ran the program it printed the following to my terminal (the empty
loop in foo serves as a delay):

Iteration 1
x[0] at start 7
x[0] at finish 1

Iteration 2
x[0] at start 7
x[0] at finish 1

Iteration 3
x[0] at start 7
x[0] at finish 1

Which shows the writes from function foo were propagating to the main
function. I also used taskset to set thread affinity to make sure each
thread was running on a different core.

Looking at the disassembly:

objdump --prefix-addresses -S -M intel -d a.out

void foo(void *y_void){
080484d4 push ebp
080484d5 <foo+0x1> mov ebp,esp
080484d7 <foo+0x3> sub esp,0x10
char
y=(char *)y_void;
080484da <foo+0x6> mov eax,DWORD PTR [ebp+0x8]
080484dd <foo+0x9> mov DWORD PTR [ebp-0x4],eax
unsigned int j;
while(1){
for(j=0;j<1000000000;j++){
080484e0 <foo+0xc> mov DWORD PTR [ebp-0x8],0x0
080484e7 <foo+0x13> jmp 080484ed <foo+0x19>
080484e9 <foo+0x15> add DWORD PTR [ebp-0x8],0x1
080484ed <foo+0x19> cmp DWORD PTR [ebp-0x8],0x3b9ac9ff
080484f4 <foo+0x20> jbe 080484e9 <foo+0x15>
}
y[0]=1;
080484f6 <foo+0x22> mov eax,DWORD PTR [ebp-0x4]
080484f9 <foo+0x25> mov BYTE PTR [eax],0x1
}
080484fc <foo+0x28> jmp 080484e0 <foo+0xc>
return NULL;
}

You can see that foo is writing to the memory location y[0] every iteration
of the while loop (y[0] is the same memory location as x[0]).

In the polling while loop from the main function:

while(1){
  if(x[0]==1)break;

}

Turns in to:

0804859a <main+0x9c> nop
0804859b <main+0x9d> mov eax,DWORD PTR [esp+0x1c]
0804859f <main+0xa1> movzx eax,BYTE PTR [eax]
080485a2 <main+0xa4> cmp al,0x1
080485a4 <main+0xa6> jne 0804859a <main+0x9c>
080485a6 <main+0xa8> nop

so it repeatedly loads x[0] and checks if it is 1.

I did consider that synchronisation was happening elsewhere (maybe in
printf) so I also tried the following program:

#include <stdio.h>
#include <pthread.h>
#include <stdlib.h>

void foo(void *y_void){
char
y=(char *)y_void;
unsigned int j;
while(1){
y[0]=1;
}
return NULL;
}

int main(void){
char* x;
x=malloc(1);
x[0]=7;
pthread_t other_thread;
if(pthread_create(&other_thread, NULL, foo, x)) {
printf("Thread creation failed\n");
return 1;
}
int i=1;
while(1){
x[0]=7;

while(1){
  if(x[0]==1)break;
};
if(i>1000000000) break;
i++;

}
printf("done\n");
return 0;
}

This program ran to completion, implying that the writes from foo were
propagating to main without explicit synchronisation (again I verified this
by reading the disassembly).

Admittedly I don't fully understand all the nuances of the x86 memory model
(and from reading https://www.di.ens.fr/~zappa/readings/cacm10.pdf there
seems to be many) but to a first order approximation I think my original
assumption holds (though I'm aware of the dangers of drawing more general
conclusions from that).

@jfbastien
Copy link
Contributor

You're testing one compiler configuration, compiling one program, on a single version of an x86 machine. This isn't proof that hardware works that way. In fact weaker memory model systems do not offer such guarantees.

We're designing a memory model which will work with common hardware implementations, not just x86. We can't "just ship x86" and call it a day.

What are you trying to get, and why is the current memory model not sufficient? Most developers should use higher-level abstractions such as mutex, a few should use atomics but they need to do quite a bit of work to get it right.

@taisel
Copy link

taisel commented Jan 11, 2016

@jfbastien I get that now. You want guarantees it works on ALL systems, not just the conventional ones that are usually intended for the general populace. I wrote some test code with the guarantees on the OS scheduler and common platform details. You are correct in that there's added correctness that should exist that some of us are leaving out. I was personally hellbent on assuming that the consumer grade configurations would be the baseline.

@jfbastien
Copy link
Contributor

Gotcha, that's worth adding to the FAQ as one of the design constraints, with references supporting design choices (including some of the links I provided). Otherwise the repo just says "this is how things are" without enough "and this is why it's this way".

@taisel
Copy link

taisel commented Jan 11, 2016

@cosinusoidally Just say no to spinloops. Let the API spinloop behind the scenes for you, but do not write your own, as it's extremely bad practice. It's an anti-pattern against the OS scheduler, it's "room heater" code, and whether it works or not depends on the memory model of the system it's running on.

GCC optimizing out your spinloop isn't a bug, it's a very deliberate feature, take it as a warning sign against such patterns.

@titzer
Copy link

titzer commented Jan 11, 2016

On Mon, Jan 11, 2016 at 1:12 AM, Liam Wilson notifications@github.com
wrote:

On 8 January 2016 at 14:00, titzer notifications@github.com wrote:

If this is the behavior that you want, you are going to disallow load
elimination and further will require hardware fences for every read and
every write. This is going to be prohibitively expensive, even when the
locations are not contended and there is no sharing.

It's not necessarily the behaviour I want. I understand that the
implications are not good from the point of view of an optimising compiler.
As you say, the specification could do with some wording improvements.

Hardware memory does not work how you are assuming.

My assumption was that writes will eventually propagate between threads
without necessarily needing explicit synchronisation. I wrote the following
program to test my assumption:

Unfortunately this is not guaranteed by current memory models which are
based on a model of a store buffer that is nondeterministically flushed to
memory by the writer. There is no operation that a user-level reader can
execute to force a writer to flush its writes; there must be a
synchronization operation (i.e. a memory fence) in the writer to guarantee
writes are flushed to memory. Thus my comment that guaranteeing write
propagation at the user level would require inserting a memory fence after
every write, which would be very expensive.

Hardware has limited cache space, causing it to eventually flush to lines
memory, and today's cache coherency protocols will eventually flush dirty
cache lines as well, but without explicit synchronization it is basically
impossible to guarantee anything meaningful about when that happens with
respect to the writer thread, let alone the reader thread.

#include <stdio.h>
#include <pthread.h>
#include <stdlib.h>

void foo(void *y_void){
char
y=(char *)y_void;
unsigned int j;
while(1){
for(j=0;j<1000000000;j++){
}
y[0]=1;
}
return NULL;
}

int main(void){
char* x;
x=malloc(1);
x[0]=7;
pthread_t other_thread;
if(pthread_create(&other_thread, NULL, foo, x)) {
printf("Thread creation failed\n");
return 1;
}
int i=1;
while(1){
x[0]=7;
printf("\nIteration %d\n",i);
printf("x[0] at start %d\n",x[0]);
while(1){
if(x[0]==1)break;
};
printf("x[0] at finish %d\n",x[0]);
i++;
}
return 0;
}

I compiled this with:

gcc -m32 -g -O0 a.c -lpthread (Ubuntu 12.04, GCC 4.6.3 on a Core 2 Duo
machine)

Note I compiled -O0 as otherwise the optimising compiler would optimise the
following memory polling

while(1){
if(x[0]==1)break;
};

into a single test followed by an infinite loop (very much like the
JavaScript example I gave in my previous post)

When I ran the program it printed the following to my terminal (the empty
loop in foo serves as a delay):

Iteration 1
x[0] at start 7
x[0] at finish 1

Iteration 2
x[0] at start 7
x[0] at finish 1

Iteration 3
x[0] at start 7
x[0] at finish 1

Which shows the writes from function foo were propagating to the main
function. I also used taskset to set thread affinity to make sure each
thread was running on a different core.

Looking at the disassembly:

objdump --prefix-addresses -S -M intel -d a.out

...

void foo(void *y_void){
080484d4 push ebp
080484d5 <foo+0x1> mov ebp,esp
080484d7 <foo+0x3> sub esp,0x10
char
y=(char *)y_void;
080484da <foo+0x6> mov eax,DWORD PTR [ebp+0x8]
080484dd <foo+0x9> mov DWORD PTR [ebp-0x4],eax
unsigned int j;
while(1){
for(j=0;j<1000000000;j++){
080484e0 <foo+0xc> mov DWORD PTR [ebp-0x8],0x0
080484e7 <foo+0x13> jmp 080484ed <foo+0x19>
080484e9 <foo+0x15> add DWORD PTR [ebp-0x8],0x1
080484ed <foo+0x19> cmp DWORD PTR [ebp-0x8],0x3b9ac9ff
080484f4 <foo+0x20> jbe 080484e9 <foo+0x15>
}
y[0]=1;
080484f6 <foo+0x22> mov eax,DWORD PTR [ebp-0x4]
080484f9 <foo+0x25> mov BYTE PTR [eax],0x1
}
080484fc <foo+0x28> jmp 080484e0 <foo+0xc>
return NULL;
}

...

You can see that foo is writing to the memory location y[0] every iteration
of the while loop (y[0] is the same memory location as x[0]).

In the polling while loop from the main function:

while(1){
if(x[0]==1)break;
}

Turns in to:

0804859a <main+0x9c> nop
0804859b <main+0x9d> mov eax,DWORD PTR [esp+0x1c]
0804859f <main+0xa1> movzx eax,BYTE PTR [eax]
080485a2 <main+0xa4> cmp al,0x1
080485a4 <main+0xa6> jne 0804859a <main+0x9c>
080485a6 <main+0xa8> nop

so it repeatedly loads x[0] and checks if it is 1.

I did consider that synchronisation was happening elsewhere (maybe in
printf) so I also tried the following program:

#include <stdio.h>
#include <pthread.h>
#include <stdlib.h>

void foo(void *y_void){
char
y=(char *)y_void;
unsigned int j;
while(1){
y[0]=1;
}
return NULL;
}

int main(void){
char* x;
x=malloc(1);
x[0]=7;
pthread_t other_thread;
if(pthread_create(&other_thread, NULL, foo, x)) {
printf("Thread creation failed\n");
return 1;
}
int i=1;
while(1){
x[0]=7;

while(1){
if(x[0]==1)break;
};
if(i>1000000000) break;
i++;
}
printf("done\n");
return 0;
}

This program ran to completion, implying that the writes from foo were
propagating to main without explicit synchronisation (again I verified this
by reading the disassembly).

Admittedly I don't fully understand all the nuances of the x86 memory model
(and from reading https://www.di.ens.fr/~zappa/readings/cacm10.pdf there
seems to be many) but to a first order approximation I think my original
assumption holds (though I'm aware of the dangers of drawing more general
conclusions from that).


Reply to this email directly or view it on GitHub
https://github.com/lars-t-hansen/ecmascript_sharedmem/issues/40#issuecomment-170408477
.

@taisel
Copy link

taisel commented Jan 11, 2016

@titzer yeah, the goal of sharedmem is to target all systems, not ones that automagically sync at the end of the day. I see that now.

@cosinusoidally
Copy link

On 11 January 2016 at 00:21, JF Bastien notifications@github.com wrote:

What are you trying to get, and why is the current memory model not sufficient? Most developers should use higher-level abstractions such as mutex, a few should use atomics but they need to do quite a bit of work to get it right.

I’m just trying to wrap my head around some of the implications of the
memory model (which is something that could probably be clarified by a
FAQ). A specific use case I have is that of emulating the mapping of
read only memory into the address spaces of multiple threads. One
thread would write to a region of memory inside a SAB. From the point
of view of the other threads that memory would then be in an undefined
state. I would then need to use some kind of synchronisation operation
to both flush the writes from the original thread, and to get the
other threads to reload any stale data they may have. After this is
done all threads should be able to read that memory region without
locking. From the specification I cannot see a straightforward a way
to do this (short of making every single write to the region atomic,
and then (atomically) passing a pointer to the region to the readers,
but that is likely to be inefficient).

On 11 January 2016 at 09:22, titzer notifications@github.com wrote:

Hardware has limited cache space, causing it to eventually flush to lines
memory, and today's cache coherency protocols will eventually flush dirty
cache lines as well, but without explicit synchronization it is basically
impossible to guarantee anything meaningful about when that happens with
respect to the writer thread, let alone the reader thread.

Yep, that’s one of the things that does not sit right with me
regarding “cache coherent” systems. Such systems seem to end up with
so many caveats that you may as well work under the assumption that
they are not cache coherent :)

@jfbastien
Copy link
Contributor

@cosinusoidally:

I’m just trying to wrap my head around some of the implications of the memory model (which is something that could probably be clarified by a FAQ). A specific use case I have is that of emulating the mapping of read only memory into the address spaces of multiple threads. One thread would write to a region of memory inside a SAB. From the point of view of the other threads that memory would then be in an undefined state. I would then need to use some kind of synchronisation operation to both flush the writes from the original thread, and to get the other threads to reload any stale data they may have. After this is done all threads should be able to read that memory region without locking. From the specification I cannot see a straightforward a way to do this (short of making every single write to the region atomic, and then (atomically) passing a pointer to the region to the readers,
but that is likely to be inefficient).

IIUC you're hitting a problem similar to one C++ also has and the standards committee is actively trying to address: http://wg21.link/p0019r0

You're trying to have epochs where an array is set in a single-thread manner, and then accessed in a read-only manner from multiple threads. The paper describes a slightly different use case, but the fundamental difference is the epoch concept.

In C++ today this is feasible through usage of relaxed accesses when accessing the array from a single thread and fences to mark the epoch. In your use case you could also use relaxed accesses when data is read-only because that's non-racy either.

The current SAB memory model doesn't have relaxed accesses, nor does it have fences. Things can instead work out if you access the array with non-atomic accesses, but you hold a single-writer multiple-reader mutex between epochs (or perform a lower level acquire / release between writer and all readers). I think this approach is cleaner anyways :-)

@lars-t-hansen
Copy link
Collaborator Author

I updated the FAQ with more context from this discussion and linked to this issue for papers and background, so let's close this. Happy to accept answers to more FAQs, but let's phrase them as PRs against the FAQ document.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants