Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Trial PR] Add ThreadSanitizer support #22

Closed
wants to merge 16 commits into from
Closed

Conversation

OlivierNicole
Copy link
Collaborator

@OlivierNicole OlivierNicole commented Dec 20, 2022

This PR introduces a new configure-time flag --enable-tsan to enable compilation with ThreadSanitizer (TSan) instrumentation. This is a joint work with Fabrice Buoro (@fabbing), based on the work of Anmol Sahoo (@anmolsahoo25). We also had help from @jhjourdan and @maranget on memory models, @shindere on the build system, and are grateful to @art-w and @gadmm for their useful feedback.

TSan is an approach developed by Google to locate data races in code bases. It works by instrumenting your executables to keep a history of previous memory accesses (at a certain performance cost), in order to detect data races, even when they have no visible effect on the execution. TSan instrumentation has been implemented in various compilers (gcc, clang, as well as the Go and Swift compilers), and has proved very effective in detecting hundreds of concurrency bugs in large projects.

Executables instrumented with TSan report data races without false positives. However, data races in code paths that are not visited will not be detected.

Example

A data race consists in two accesses to the same location, at least one of them being a write, without any order being enforced between them:

type t = { mutable x : int }

let v = { x = 0 }

let () =
  let t1 = Domain.spawn (fun () -> v.x <- 10; Unix.sleep 1) in
  let t2 = Domain.spawn (fun () -> ignore v.x; Unix.sleep 1) in
  Domain.join t1;
  Domain.join t2

Compiling this program with a TSan-enabled compiler is done by the usual command ocamlopt program.ml. Running it will output a report looking like this:

==================
WARNING: ThreadSanitizer: data race (pid=4170290)
  Read of size 8 at 0x7f072bbfe498 by thread T4 (mutexes: write M0):
    #0 camlSimpleRace__fun_524 /tmp/simpleRace.ml:7 (simpleRace.exe+0x43dc9d)
    #1 camlStdlib__Domain__body_696 /home/olivier/.opam/5.0.0+tsan/.opam-switch/build/ocaml-variants.5.0.0+tsan/stdlib/domain.ml:202 (simpleRace.exe+0x47b5dc)
    #2 caml_start_program ??:? (simpleRace.exe+0x4f51c3)
    #3 caml_callback_exn /home/olivier/.opam/5.0.0+tsan/.opam-switch/build/ocaml-variants.5.0.0+tsan/runtime/callback.c:168 (simpleRace.exe+0x4c2b93)
    #4 caml_callback /home/olivier/.opam/5.0.0+tsan/.opam-switch/build/ocaml-variants.5.0.0+tsan/runtime/callback.c:256 (simpleRace.exe+0x4c36e3)
    #5 domain_thread_func /home/olivier/.opam/5.0.0+tsan/.opam-switch/build/ocaml-variants.5.0.0+tsan/runtime/domain.c:1093 (simpleRace.exe+0x4c6ad1)

  Previous write of size 8 at 0x7f072bbfe498 by thread T1 (mutexes: write M1):
    #0 camlSimpleRace__fun_520 /tmp/simpleRace.ml:6 (simpleRace.exe+0x43dc45)
    #1 camlStdlib__Domain__body_696 /home/olivier/.opam/5.0.0+tsan/.opam-switch/build/ocaml-variants.5.0.0+tsan/stdlib/domain.ml:202 (simpleRace.exe+0x47b5dc)
    #2 caml_start_program ??:? (simpleRace.exe+0x4f51c3)
    #3 caml_callback_exn /home/olivier/.opam/5.0.0+tsan/.opam-switch/build/ocaml-variants.5.0.0+tsan/runtime/callback.c:168 (simpleRace.exe+0x4c2b93)
    #4 caml_callback /home/olivier/.opam/5.0.0+tsan/.opam-switch/build/ocaml-variants.5.0.0+tsan/runtime/callback.c:256 (simpleRace.exe+0x4c36e3)
    #5 domain_thread_func /home/olivier/.opam/5.0.0+tsan/.opam-switch/build/ocaml-variants.5.0.0+tsan/runtime/domain.c:1093 (simpleRace.exe+0x4c6ad1)

  Mutex M0 (0x000000567ad8) created at:
    #0 pthread_mutex_init /home/olivier/other_projects/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:1316 (libtsan.so.0+0x3cafb)
    [...]

SUMMARY: ThreadSanitizer: data race /tmp/simpleRace.ml:7 in camlSimpleRace__fun_524
==================
ThreadSanitizer: reported 1 warnings

Many other examples can be found in the test suite. More information about ThreadSanitizer usage are available in the readme of the ocaml-tsan repo, which has been released as a fork based on 5.0.0 (release date December 16, 2022). This approach already allowed to detect a number of data races in OCaml libraries:

The generated executables should not be impacted by this change when --enable-tsan is not set. Compilation times are the same as before without --enable-tsan.

Caveats

  • We inherit the portability limitations of ThreadSanitizer of not being available on Windows.
  • Currently, only instrumentation of x86_64 executables is supported; future work could include ARM64 support.

How TSan works

Executables are instrumented with calls to the TSan runtime library, which tracks accesses to shared data and reports races.

Internally the runtime library associates with each word of application memory at least 2 "shadow words". Each shadow word contains information about a recent memory access to that word, including a "scalar clock". Those clocks serve to establish a happens-before relation.

This information is maintained as a "shadow state" in a separate memory region, and updated at every (instrumented) memory access. A data race is reported every time two memory accesses are made to overlapping memory regions, and:

  • one of them is a write, and
  • there is no established happens-before relation between them.

Instrumentation pass

Instrumentation calls are inserted in several places:

  • Memory accesses are instrumented with calls to the TSan runtime, with functions of the form __tsan_read8, __tsan_atomic_store; those calls are inserted into the Cmm representation of code. We exploit mutability information to improve performance: immutable loads are translated to non-instrumented memory reads.

  • Function entries and exits are instrumented with calls to __tsan_func_entry and __tsan_func_exit. This is used by TSan to provide backtraces in data race reports (including backtraces of past memory accesses).

  • The C runtime is instrumented as well, using the built-in ThreadSanitizer support of GCC or Clang. This is necessary as

    1. many relevant memory accesses are made from the runtime, and
    2. the runtime handles thread-synchronizing operations such as locking and unlocking, operations which TSan needs to know about.

    Enabling tsan also causes C code to be built with -fno-omit-frame-pointers. The OCaml compiler's --enable-frame-pointers configure flag is however independent of --enable-tsan.

Instrumentation of function entries and exits is simple in C or C++ (where raising an exception always unwinds the stack), but challenging in OCaml which has non-local control flow due to exceptions and effect handlers. In order to keep correct backtraces in all circumstances, it is necessary to emit ad hoc calls to __tsan_func_entry and/or __tsan_func_exit when an exception is raised, an effect is performed, or a continuation is resumed, in order to keep TSan’s view of the call stack up-to-date.

Doing this requires unwinding the stack. When possible, we use the already-present mechanism of frame_descr descriptors, embedded in the executable, to do it. However, exceptions can be raised across C stack frames, and frame descriptors only exist for OCaml frames. In order to unwind the C parts of the stack, we use the libunwind library. As a consequence, one must install libunwind to build OCaml with --enable-tsan.

__tsan_func_entry takes as an argument the return address in the current stack frame. This required adding a constructor Creturn_addr to Cmm expressions and Ireturn_addr to the type Mach.operation, whose semantics is to fetch the return address from the stack frame.

Embedding of the OCaml memory model into the C11 one

TSan’s underlying memory model is the C11 model. Therefore, we have to map OCaml memory operations to C11 operations. The relations between the two models have been the subject of many discussions (ocaml/ocaml#10995, ocaml/ocaml#10992, ocaml/ocaml#10972 (comment)), which we took as a basis for our work. Conceptually, the instrumentation translates OCaml memory operations to C11 operations. Race-free OCaml programs are translated to race-free C programs, while OCaml programs containing races (in the OCaml sense) are translated to C programs that are racy (in the C11 sense).

The resulting runtime analysis does not report false positives, in the sense that all races reported by TSan are indeed data races in the OCaml sense1.

The mapping between OCaml memory operations and C11 operations signaled to TSan can be found here. In the same thread, the previous post contains a justification of the absence of false positives. In a meeting, we presented these to @maranget and @jhjourdan, who agreed that it should have the correctness properties we claim it has.

Performance cost of the instrumentation

First of all, there is no compilation or runtime cost associated with the change unless the compiler is configured with --enable-tsan.

When TSan instrumentation is enabled, it incurs a slowdown and increases memory consumption. Preliminary benchmarks show a slowdown in the range 7x-13x and a memory consumption increase in the range 2x-7x. Some pathological cases exist, such as testsuite/tests/lf_skiplist/test_parallel.ml for which time and memory usage blow up by 23x and 27x, respectively. (We haven’t investigated further yet.)

The memory consumption increase is better than for C/C++/Go (this conference talk reports 5x-10x). The slowdown is in range of what is observed in C/C++/Go (5x-15x), but since OCaml programs are expected to perform less mutation than C on average, there is probably room for improvement. There are probably low-hanging fruits in terms of optimization, notably removing instrumentation from the runtime where it is not strictly necessary to correctly detect races in user programs; and investigate the aforementioned pathological case.

Organization of the changes

This PR is best reviewed commit by commit.

It is based on the commits of ocaml/ocaml#12108 which introduces native-only C compiler flags. As a consequence, the changes actually belonging to this PR start at the 10th commit, “Add tsan configure flag”.

More than 60 % of the diff for this PR is due to a new set of tests in testsuite/tsan/. A small number of tests have to be disabled when --enable-tsan is set, due to the fact that a call tree of depth >64k causes the ThreadSanitizer runtime to crash. (This is a limitation on the ThreadSanitizer side.)

In the main “Add ThreadSanitizer support” commit, big changes are limited to a few files: the Cmm instrumentation pass is in asmcomp/thread_sanitizer.ml[i]. The new C file runtime/tsan.c handles the task of letting TSan know about function entries and exits when raising exceptions or handling effects. Finally, some of the instrumentation calls have to be inserted directly into the assembly routines of runtime/amd64.S.

In a separate commit, we disable TSan on a number of functions in the runtime. This is because TSan warns about data races on these functions. We reported those warnings in ocaml/ocaml#11040, and silence them to avoid facing users of TSan with data race reports that are not related to their code. To save performance, we also un-instrument some functions that are executed often, but should not create races with user code.

Footnotes

  1. Except maybe in some rare cases involving data dependencies or volatile, due to limitations of TSan itself. False positives due to volatile can only appear in legacy FFI code, not written with parallelism in mind.

@OlivierNicole
Copy link
Collaborator Author

This is a trial PR in preparation for a PR upstream @jmid @kayceesrk @stedolan @anmolsahoo25

@kayceesrk
Copy link
Collaborator

Thanks for the PR. I'll ask a few initial questions

  1. Is there a performance cost on code that is not compiled with --enable-tsan?
    • Does the compilation take longer or more memory?
    • Does the compiled program run longer?
    • Is there a code size impact for the compiled program?
    • Does the compiled program use more memory?
  2. I assume that, unlike C/C++, given that OCaml has some mutability information, instrumentation can be avoided on many accesses. Is this exploited? If so, I am surprised to see that the performance impact is on the same range as tsan instrumentation on C/C++. What am I missing?
  3. On a cursory look, the change looks fairly invasive. It adds 2640 lines and deletes 100 lines. There are 105 files changed. Can you comment on these changes? For example, how many of the 2640 lines and 105 files changed is new tests, for example?
  4. Lots of the CI runs are failing. Why?

ThreadSanitizer supports inserts calls to the ThreadSanitizer runtime library, which tracks accesses to shared data and reports races.

support

@kayceesrk
Copy link
Collaborator

Perhaps @sadiqj may also want to take a look as he has a good sense of runtime and instrumentation.

@OlivierNicole
Copy link
Collaborator Author

OlivierNicole commented Dec 21, 2022

1. Is there a performance cost on code that is not compiled with --enable-tsan?

  • Does the compilation take longer or more memory?
  • Does the compiled program run longer?
  • Is there a code size impact for the compiled program?
  • Does the compiled program use more memory?

There should be no added cost without --enable-tsan.

  • The compilation process should not change except for a few if Config.tsan tests on non-critical paths.
  • The generated code is the same.
  • There should be no size impact for executables statically linked with the runtime, as all of our changes are guarded by an #ifdef WITH_THREAD_SANITIZER directive.
  • The memory usage of the compiled program should be the same.

2. I assume that, unlike C/C++, given that OCaml has some mutability information, instrumentation can be avoided on many accesses. Is this exploited?

Yes, from OCaml code, we instrument only the accesses to mutable locations. On the other hand, the C runtime is also instrumented, and C instrumentation has no mutability information, so the instrumentation there is likely to be more costly.

If so, I am surprised to see that the performance impact is on the same range as tsan instrumentation on C/C++. What am I missing?

The memory usage is better than C/C++/Go: ours seems to be 2x-7x, whereas https://www.youtube.com/watch?v=5erqWdlhQLA reports 5x-10x. But the slowdown can indeed be seen as disappointing. I have three hypotheses:

  • the instrumented C runtime may suffer a bigger slowdown and drag down the performance of the program;
  • exception and effect handling is much more costly than in non-instrumented OCaml as we have to unwind the stack every time, even with raise_notrace;
  • maybe we are still instrumenting too many OCaml accesses due to incomplete mutability information; e.g. I recently substantially improved performance by no longer instrumenting header reads. It would require some work to look into improvement opportunities. It would also be interesting to investigate pathological cases such as the skiplist test mentioned above.

3. On a cursory look, the change looks fairly invasive. It adds 2640 lines and deletes 100 lines. There are 105 files changed. Can you comment on these changes? For example, how many of the 2640 lines and 105 files changed is new tests, for example?

Yes: the changes to the test suite only totalize 1583 insertions and 35 deletions on 45 files. So they constitute the bulk of the diff. The rest is split evenly between asmcomp/ and runtime/, where it is organized as described in the first message. The changes to the runtime are conditioned to the WITH_THREAD_SANITIZER constant. There are also a few non-invasive changes to the makefiles.

Finally, in the runtime, a portion of the diff (~100 lines) is due to this comment repeated several times:

+CAMLno_tsan /* FIXME Race reports from this function clog the tests of user
+               programs, so we disable instrumentation here. However, Further
+               investigation would be needed about the cause of these race
+               reports.
+               */
 static int try_update_object_header(value v, volatile value *p, value result,
                                     mlsize_t infix_offset) {

(As a side note, we had to silence many TSan reports internal to the runtime, which we intend to report separately.)

@OlivierNicole
Copy link
Collaborator Author

OlivierNicole commented Dec 21, 2022

the instrumented C runtime may suffer a bigger slowdown and drag down the performance of the program;

regarding this, a promising track of improvement is to un-instrument all C functions that do not perform any synchronization operations critical for TSan.

@OlivierNicole
Copy link
Collaborator Author

Investigating the CI failures now.

@kayceesrk
Copy link
Collaborator

Thanks. It would be useful to incorporate some of the answers to my questions into the PR message.

@OlivierNicole OlivierNicole force-pushed the tsan_patch branch 2 times, most recently from e034bdb to f27628d Compare December 21, 2022 13:33
@OlivierNicole
Copy link
Collaborator Author

Thank you for taking the time to look at this @kayceesrk. I updated the PR message with the new elements.

@OlivierNicole
Copy link
Collaborator Author

OlivierNicole commented Dec 21, 2022

The CI failures were due to me forgetting to push an updated configure.

@jmid
Copy link
Contributor

jmid commented Dec 21, 2022

Thanks for the trial PR! I think the PR text is well-written and gives a good intro with relevant background info.
I took a look at the diff last night, and spotted a few things. I'm not experienced in the run-time system so take the below with a grain of salt:

  • In the emitters: failwith "not implemented" seems a bit short. A non-suspecting user hitting this will not necessarily know what isn't implemented.
  • Perhaps you could add a high-level explanation of what C/Ireturn_addr is used/needed for?
  • I spotted a few needless changes that might as well be omitted to keep the diff minimal (I suspect they are a result of repeated rebasing):
    • the Makefile has an unrelated newline change in the clean target
    • runtime/caml/stack.h has an unrelated comment inserted

@kayceesrk
Copy link
Collaborator

The other thing that you should include is how TSAN instrumentation is used to detect data race under the Multicore OCaml primitives i.e, translation of OCaml memory model to C memory model. This came up in the discussion at https://discuss.ocaml.org/t/ann-threadsanitizer-support-for-ocaml-5-0-0-first-public-release/10975. It would be useful to add a section on how the translation works.

Given the last open question from Guillaume, do you think we're at a stage where we are sure that the technical work has been completed and there are no open questions? If there are open questions, mention them explicitly in the PR message.

@OlivierNicole
Copy link
Collaborator Author

OlivierNicole commented Dec 22, 2022

@jmid Thank you for your comments, I will follow the suggestions.

Given the last open question from Guillaume, do you think we're at a stage where we are sure that the technical work has been completed and there are no open questions? If there are open questions, mention them explicitly in the PR message.

Guillaume has pinpointed something that I overlooked. I completed the instrumentation of memory accesses when I was not aware of the subtleties of translating the OCaml memory model to C, and I did not go back to question this instrumentation later.

As a result, the current mapping from OCaml operations to C11 operations, that is used to inform TSan, does not quite reflect the correct mapping (ocaml/ocaml#10995):

OCaml operation Signaled to TSan as
Atomic load atomic_load_explicit(..., memory_order_acquire);
Atomic store atomic_thread_fence(memory_order_acquire);
atomic_exchange_explicit(..., memory_order_seq_cst);
atomic_thread_fence(memory_order_release);
Non-atomic load Plain read
Non-atomic store atomic_thread_fence(memory_order_acquire);
atomic_store_explicit(..., memory_order_release);
for OCaml values (through caml_modify), or
plain write for other values

In bold are the operations for which the mapping from OCaml to C11 memory operations is not faithful to the one described in ocaml/ocaml#10995.

In addition, Guillaume’s question is relevant and we did not lay it out clearly before: are we mapping OCaml memory operations to well-defined C programs (but at the risk that TSan does not report OCaml data races), or mapping OCaml memory operations to possibly buggy C programs (when there is an OCaml data race)?

I do not think any of this is a huge blocker, but the technical question should indeed be answered, and the implementation possibly corrected. I am leaning in favor of mapping OCaml data races to buggy C programs, but I need to verify that race-free programs will then always be mapped to race-free C programs.

@OlivierNicole
Copy link
Collaborator Author

I edited to fix some wrong lines in the table.

@kayceesrk
Copy link
Collaborator

or mapping OCaml memory operations to possibly buggy C programs (when there is an OCaml data race)?

Isn't this what we want? So that the race detector can tell us when things go wrong? Does that lead to more false positives?

More generally, can you comment on the soundness of TSAN for C++ programs and our encoding of TSAN for OCaml programs? A sound race detector has no false positives. If the underlying system is unsound, then there is no point in going for soundness with our encoding. Of course, reducing false positives is a worthwhile effort.

@OlivierNicole
Copy link
Collaborator Author

Yes, I think that mapping racy OCaml operations to buggy C programs is what we want.

can you comment on the soundness of TSAN for C++ programs and our encoding of TSAN for OCaml programs?

TSan is claimed to have no false positives (there are no papers describing TSan’s current algorithm, but see this slide: https://youtu.be/5erqWdlhQLA?t=1975), except sometimes when “there is tiny probability to miss a data race” (https://github.com/google/sanitizers/wiki/ThreadSanitizerAlgorithm#state-machine).

Does that lead to more false positives?

I think it might, or at least I cannot find a good mapping right now. We could just take the correct mapping (i.e., not the one above) and turn the operation for non-atomic loads into a plain write; that would ensure that a conflicting read & write to the same non-atomic location are turned into a C11 data race, and I think we don’t lose any happens-before edges that way; but then two conflicting writes do not translate to a data race.

OCaml operation Signaled to TSan as
Atomic load atomic_load_explicit(..., memory_order_seq_cst);
Atomic store atomic_thread_fence(memory_order_acquire);
atomic_exchange_explicit(..., memory_order_seq_cst);
atomic_tread_fence(memory_order_release);
Non-atomic load Plain read
Non-atomic store atomic_thread_fence(memory_order_acquire);
atomic_store_explicit(..., memory_order_release);

If we signal non-atomic stores as just plain writes, then I think we lose some happens-before edges, which can lead to false positives.

It would be useful to have the input of someone more expert than me in memory models on this question.

@anmolsahoo25
Copy link
Contributor

anmolsahoo25 commented Dec 22, 2022

I think I can chime in here.

What is ThreadSanitizer doing?

The objective of the ThreadSanitizer instrumentation of OCaml programs is to detect a race. A race is defined if there exist two memory events to a non-atomic location, with at least one being a write. A race exists if there is no happens-before (hb) edge between these two events. The ThreadSanitizer runtime mechanism builds up the hb-relation dynamically and signals if two events satisfying the above condition exist.

Why embed an OCaml program in a C program?

We use the C11 primitives provided by the ThreadSanitizer library to signal the hb-edges between OCaml memory events. For this, we use the existing hb-edges defined between RC11 events. For example, a e1: __tsan_atomic_write(x, release) and e2: __tsan_atomic_read(x, acquire) sets up an hb-edge, provided e2 reads from e1. This is a primitive hb-edge defined in RC11 and happens to coincide with what we need for OCaml as well.

But, since the OCaml memory model has additional hb-edges than RC11, we have to add more synchronization.

Mapping OCaml memory events to C11 primitives

The correct encoding of OCaml memory events to RC11 memory events is as follows. According to the OCaml Memory Model paper, there are 3 conditions for hb -

  1. E1 hb E2 whenever E1 is an initial write and E2 is not.
  2. E1 hb E2 whenever E1 po E2.
  3. E1 hb E2 whenever E1 and E2 access the same atomic location
    3.1. E1 co E2
    3.2. E1 rf E2.
OCaml operation Signaled to TSan as hb-edge created
Atomic.load x __tsan_atomic_load(x, memory_order_acq) rf edge causes hb (3.2)
Atomic.store x __tsan_atomic_exchange(x, memory_order_acqrel) co edge causes hb (3.1)
Non-atomic load __tsan_read(x) only po (2)
Non-atomic store (initializing) __tsan_write(x); atomic_thread_fence hb edge from initializing write (1)
Non-atomic store (assignment) __tsan_write(x) only po (2)

The current mapping is incorrect for the below reasons -

  1. The seq_cst ordering on atomic writes create a total order on all writes. But in the OCaml memory model, there is only an hb-order between writes to the same location
  2. The semantics of seq_cst in RC11 are broken (refer to Repairing Sequential Consistency in C/C++11)
  3. Non-atomic store may acquire from an atomic store (via the fence) which is not required by the OCaml memory model

@kayceesrk
Just to quickly answer question. We are sending OCaml programs defined under the OCaml Memory Model to C programs defined under the RC11 memory model with additional synchronization. As ThreadSanitizer is sound, if there are any races in the C program, there has to be a race in the OCaml program. Thus the procedure is sound since we are only adding hb-edges to the C11 program, we only eliminate potential races (which are not considered races in the OMM model) and do not create new ones. The procedure is complete, since every race under the OCaml memory model will be a race in the C11 + additional synchronization model. This is because adding hb-edges to the C program recreates exactly the hb-relation defined by the OCaml memory model, thus every race in the OCaml model will continue to be a race in the C model.

@OlivierNicole
Copy link
Collaborator Author

Thank you for helping with this.

1. The seq_cst ordering on atomic writes create a total order on all writes. But in the OCaml memory model, there is only an hb-order between writes to the same location

I am surprised by this. Does you mean that @stedolan’s mapping at ocaml/ocaml#10995 is wrong?

@anmolsahoo25
Copy link
Contributor

Sorry I edited the previous mapping. I hope that does not seem confusing.

@anmolsahoo25
Copy link
Contributor

anmolsahoo25 commented Dec 22, 2022

Thank you for helping with this.

  1. The seq_cst ordering on atomic writes create a total order on all writes. But in the OCaml memory model, there is only an hb-order between writes to the same location

I am surprised by this. Does you mean that @stedolan’s mapping at ocaml/ocaml#10995 is wrong?

No, no. I'm wrong there sorry. seq_cst only creates the total coherence order. hb-order may be different from the coherence order. My bad. But the point still stands that, without knowing how ThreadSanitizer induces hb-edges for seq_cst memory events, we may be going too strong and missing out on races that may otherwise be captured.

Edit: I am just trying to remember where C11 is broken when you mix seq_cst and ra accesses. Which is why I suggested the alternative mapping. Will let you know once I get around.

@kayceesrk
Copy link
Collaborator

kayceesrk commented Dec 23, 2022

There is a bug in the PLDI 2018 paper.

E1 hb E2 whenever E1 is an initial write and E2 is not.

hb is too strong here and not needed. This came up in the review of the memory model chapter: ocaml/ocaml#11280 (comment). CC @stedolan.


For non-atomic stores, it would be useful to sub-classify them as I have done in ocaml/ocaml#11418 (comment)

we need to discriminate between the following:

  1. Initialising writes
  2. non-initialising (non-atomic) writes of integers, where no publication of new objects can take place
  3. non-initialising (non-atomic) writes of pointers, with possible publication (caml_modify)
  4. writes to non-word-sized fields

Please write down the compilation strategy for each of these separately. It will be much clearer this way.


I am surprised by this. Does you mean that @stedolan’s mapping at ocaml/ocaml#10995 is wrong?

Atomic accesses should use seq_cst.


Overall, I think we need to have a correctness argument for compiling OCaml loads to plain loads. Here's my attempt at the soundness theorem

Assuming TSAN analysis for C programs is sound (no false positives), the compilation of OCaml accesses to TSAN instrumentation introduces no false positives.

What do you think? Does the current implementation have this property?

@OlivierNicole
Copy link
Collaborator Author

Thank you.

Slight digression, but it would be very useful to have a document that reflects the current state of the knowledge about the memory model. At least for unexperienced readers like me, it’s difficult to find all the information and keep it in mind. As you mention, the PLDI18 paper requires some errata, and there things it doesn’t cover, like the mapping with the C11 model, or other architectures, that are only covered in Github discussions.

I also have an increasing number of bookmarks of Github discussions that explain a particular point (like the publication safety of initializing writes at ocaml/ocaml#10995 (comment)) or explain the need to add a fence (ocaml/ocaml#10972 (comment)) or discuss potential issues (ocaml/ocaml#10992).

Assuming TSAN analysis for C programs is sound (no false positives), the compilation of OCaml accesses to TSAN instrumentation introduces no false positives.

What do you think? Does the current implementation have this property?

I think the current implementation probably doesn’t have it because e.g. OCaml atomic loads are signaled to TSan as acquired loads. I’m hoping that it can be easily fixed to have it by fixing the mapping, though.

@kayceesrk
Copy link
Collaborator

kayceesrk commented Dec 23, 2022

Memory model is likely to remain a complex topic given the nature of it. It may be useful to write an extended version of the PLDI 18 paper. The right time to write the paper would be when we have a few more native backends (s390x and power) and we’ve had lockfree data structures used for real. It is possible that the compilation strategy may evolve or new primitive operations may added.

@OlivierNicole
Copy link
Collaborator Author

To summarize, here is the current candidate mapping:

OCaml operation Signaled to TSan as hb-edge created
Atomic.load x __tsan_atomic_load(x, memory_order_seq_cst) rf edge causes hb (3.2)
Atomic.store x __tsan_atomic_exchange(x, memory_order_seq_cst) co edge causes hb (3.1)
Non-atomic load __tsan_read(x) only po (2)
Non-atomic store (initializing) __tsan_write(x); atomic_thread_fence(memory_order_seq_cst); hb edge from initializing write (1)
Non-atomic store (assignment, integer) __tsan_write(x) only po (2)
Non-atomic store (assignment, pointer) __tsan_write(x) only po (2)
Non-atomic store (non-word-sized field) __tsan_writeN(x), where N is one of 8, 16, 32, 64 only po (2)

@kayceesrk
Copy link
Collaborator

kayceesrk commented Jan 2, 2023

I have some more questions.

The optimized mapping in ocaml/ocaml#10995 assumes that you can transform

val = atomic_load_explicit(..., memory_order_acquire);

into:

val = atomic_load_explicit(..., memory_order_relaxed);
atomic_thread_fence(memory_order_acquire);

Does TSAN now understand atomic_thread_fence? See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97868.

There is also __tsan_atomic_thread_fence: https://github.com/llvm-mirror/compiler-rt/blob/69445f095c22aac2388f939bedebf224a6efcdaf/include/sanitizer/tsan_interface_atomic.h#L214. Not sure what this does.

@kayceesrk
Copy link
Collaborator

In the rest of the discussion, I'm assuming that __tsan_atomic_thread_fence actually does what the name indicates.

In OCaml, initialising writes are plenty. But these writes are not visible to other threads unless the newly allocated object is published. So we can optimise initialising writes by pushing the fence after that write to where the publication of the new object can happen. With that here is what I think the mappings will be:

Atomic load

__tsan_atomic_thread_fence(memory_order_acquire);
__tsan_atomic_load(x, memory_order_seq_cst);

See caml_atomic_load: https://github.com/ocaml/ocaml/blob/trunk/runtime/memory.c#L257-L258

Atomic store

__tsan_atomic_thread_fence(memory_order_acquire);
__tsan_atomic_exchange(x, memory_order_seq_cst);
__tsan_atomic_thread_fence(memory_order_release); /* for publication safety */

See caml_atomic_exchange: https://github.com/ocaml/ocaml/blob/trunk/runtime/memory.c#L272-L274

Non-atomic load

__tsan_read(x)

Non-atomic store (initialising)

__tsan_write(x)

Non-atomic store (assignment, integer)

__tsan_atomic_thread_fence(memory_order_acquire);
__tsan_write(x);

Non-atomic store (assignment, pointer)

__tsan_atomic_thread_fence(memory_order_acquire);
__tsan_write(x);
__tsan_atomic_thread_fence(memory_order_release); /* for publication safety */

See caml_modify: https://github.com/ocaml/ocaml/blob/trunk/runtime/memory.c#L154-L156

Non-atomic store (non-word-sized field)

No guarantees are provided for these accesses by the OCaml memory model. Hence, compiling them to

__tsan_writeN(x), where N is one of 8, 16, 32, 64

should be ok as happens-before will have to be established for them by other means (mutex, atomic variables, etc).


What do you think about the proposal above?

@OlivierNicole
Copy link
Collaborator Author

Does TSAN now understand atomic_thread_fence? See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97868.

Still not. A maintainer confirmed to me that it can be simulated using fake atomic accesses.

There is also __tsan_atomic_thread_fence: https://github.com/llvm-mirror/compiler-rt/blob/69445f095c22aac2388f939bedebf224a6efcdaf/include/sanitizer/tsan_interface_atomic.h#L214. Not sure what this does.

Looking at the implementation, it is a dummy function that does nothing regarding TSan’s knowledge of the HB relationships.

In OCaml, initialising writes are plenty. But these writes are not visible to other threads unless the newly allocated object is published. So we can optimise initialising writes by pushing the fence after that write to where the publication of the new object can happen.

Publication safety seems to also be established by address depedencies by some reports (reference ocaml/ocaml#10995 (comment)). I don’t know by what means TSan could take those into account.

What do you think about the proposal above?

Thank you for the updated proposal. I do not think that I, personally, can have a firm opinion about the absence of false positives with this mapping as of now. It certainly looks reasonable, but to completely convince myself that the theorem holds, I would first need to review all the relations needed in the OCaml memory model and verify that they are exactly the relations established by TSan with this mapping. This will take me a bit of work as, as I previously expressed, a lot of the arguments are currently flying over my head, and the information is scattered.

@OlivierNicole OlivierNicole force-pushed the tsan_patch branch 5 times, most recently from 5b320a7 to 10187d2 Compare March 3, 2023 16:13
This commit contains two changes to the root Makefile:

1. Removal of the following line:

$(DEPDIR)/runtime/%.bpic.$(D): OC_CFLAGS += $(SHAREDLIB_CFLAGS)

This line is useless because it makes no sense to alter any variable
in the CFLAGS category for the computation of dependencies, since only
the C preprocessor is involved in this stage and it does not take
C flags into account anyway, only C preprocessor flags.

2. When computing the dependencies for $(DEPDIR)/runtime/%.npic.$(D),
one should not refer to $(SHAREDLIB_CFLAGS), for a similar reason.

It has been verified that SHAREDLIB_CFLAGS is either empty, or contains
just -fPIC which is indeed not necessary for computing dependencies
(it is indeed a C flag rather than a C preprocessor flag).
In this target-specific definition, C and C preprocessor flags were mixed.
This commit distinguishes one form the other.
Given the convention that the OC_* build varialbes are reserved for
the build system, it seems better to make sure all of them are defined
in the private Makefile.build_config file, rather than in Makefile.config
which gets installed and thus becomes public.

This commit moves the definitions of OC_CFLAGS, OC_CPPFLAGS,
OC_LDFLAGS, OC_DLL_LDFLAGS and OC_EXE_LDFLAGS from Makefile.config.in to
Makefile.build_config.in. It also moves the defintion of MKEXE_VIA_CC,
since this variable relies on private build varables and does not seem
relevant or useful outside of the context of the build of the compiler itself.
The renamings done in this commit are:

OC_COMMON_CFLAGS -> OC_COMMON_COMPFLAGS
OC_COMMON_LDFLAGS -> OC_COMMON_LINKFLAGS
OC_BYTECODE_LDFLAGS -> OC_BYTECODE_LINKFLAGS
OC_NATIVE_CFLAGS -> OC_NATIVE_COMPFLAGS
OC_NATIVE_LDFLAGS -> OC_NATIVE_LINKFLAGS
This means moving its definition from Makefile.common to
Makefile.build_config.in
This is to let configure specify flags that will be used when
compiling C files to be linked with native code.

The variable is not used in the build system yet, just defined.
@OlivierNicole
Copy link
Collaborator Author

As discussed collectively, rebased on ocaml/ocaml#12108 which allowed a much needed cleaning of the build system aspects of TSan support.

@OlivierNicole OlivierNicole force-pushed the tsan_patch branch 2 times, most recently from c92436b to 9f01960 Compare March 15, 2023 17:22
shindere and others added 9 commits March 17, 2023 16:37
When compiled for linking with native code, C files use both the
common preprocessor flags and the native-specific cppflags.

The same should happen for assembly files and this commit makes sure
this is the case.
This flag is unused for now.

Co-authored-by: Olivier Nicole <olivier@chnik.fr>
Restore libunwind detection when TSan is enabled at configure time. This
is a cherry-picking of b694e84 by with some adaptations:

- libunwind detection is only attempted when tsan is enabled
- if tsan is enabled, libunwind is requested and not optional
- libunwind_include_dirs and libunwind_link_dirs become
  libunwind_cppflags and libunwind_ldflags, respectively

Co-authored-by: Olivier Nicole <olivier@chnik.fr>
ThreadSanitizer support is in two parts: instrumentation of the
generated native code, and runtime support.

The Cmm instrumentation pass is in asmcomp/thread_sanitizer.ml[i]. The
new C file runtime/tsan.c handles the task of letting TSan know about
function entries and exits when raising exceptions or handling effects.
Finally, some of the instrumentation calls have to be inserted directly
into the assembly routines of runtime/amd64.S.

Co-authored-by: Fabrice Buoro <fabrice@tarides.com>
Co-authored-by: Anmol Sahoo <anmol.sahoo25@gmail>
- Add a new set of tests in testsuite/tsan/
- A small number of tests have to be disabled when --enable-tsan is set,
  due to the fact that a call tree of depth >64k causes the
  ThreadSanitizer runtime to crash. (This is a limitation on the
  ThreadSanitizer side.)
- Add the no-tsan action to ocamltest, in order to test whether
  --enable-tsan is set.

Co-authored-by: Olivier Nicole <olivier@chnik.fr>
TSan warns about data races on these functions. We reported those
warnings in #11040, and silence them to avoid facing users of
TSan with data race reports that are not related to their code.
This is done in two ways: either un-instrumenting those functions, or
adding their name in __tsan_default_suppressions in tsan.c.

Co-authored-by: Fabrice Buoro <fabrice@tarides.com>
The functions that we un-instrumentation are called often but should not
contain data races with user code.

Co-authored-by: Olivier Nicole <olivier@chnik.fr>
@OlivierNicole
Copy link
Collaborator Author

Closing this PR now that the actual PR ocaml/ocaml#12114 has been opened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants