-
Notifications
You must be signed in to change notification settings - Fork 2
Status of ThreadSanitizer for OCaml
Google is behind LLVM's ThreadSanitizer, an effective approach to locate data races in concurrent code bases. The goal of this work is
- to extend the approach to full OCaml 5.0 and
- to keep the run-time and instrumentation overhead as low as possible
There are two components to ThreadSanitizer (TSan):
- A run-time library to track accesses to shared data and report races
- Compiler instrumentation that emits calls to the run-time library
Internally the run-time library associates with each word of application memory at least 2 “shadow words”. Each shadow word contains information about a recent memory access to that word, including a “scalar clock”. Those clocks serve to establish a happens-before (HB) relation, i.e. an event orderings that we are certain of.
This information is maintained as a “shadow state” in a separate memory region, and updated at every (instrumented) memory access. A data race is reported every time two memory accesses are made to overlapping memory regions, and:
- one of them is a write, and
- there is no established happens-before relation between them. More information about TSan's algorithm on their wiki.
When a potential data race is detected, TSan prints a data race reports with the affected memory location, and two backtraces (one for each memory access). While the backtrace for the latest access is easy to obtain by the usual unwinding (like the debugger does), the backtrace of the previous access is not readily available. What TSan does is keeping a history of function entry and exit events for each thread, which allows to reconstruct past backtraces. This history is implemented as a large ring buffer, so stacktraces for events that are too far in the past are lost.
The run-time library is reusable across different programming languages (C,C++,Go, ...).
For OCaml there have been two supplementary efforts - ideally these should be combined:
- Anmol's initial prototype extending the OCaml compiler to emit TSan instrumentation
- Sadiq has been running TSan on the OCaml runtime, investigating reported data races, and annotating the C run-time to reduce false positives
Briefly, Anmol's prototype
- add's a new
cmm
-based compiler pass adding instrumentation - the new pass instruments
- functions with
__tsan_func_entry
and__tsan_func_exit
- loads and stores with
__tsan_readN
and__tsan_writeN
- functions with
- as an optimization the pass currently doesn’t instrument
- immutable values (since they won't have races) or
- initializing writes (since they are already protected by a barrier making sure they are visible to all threads)
Anmol's work has been taken up by Olivier (@OlivierNicole) and Fabrice (@fabbing). The added work consisted, among other tasks, in:
- Report backtrace correctly. This was tricky to implement as it relies on the compiler inserting calls to
__tsan_func_entry()
and__tsan_func_exit()
at the right places, which is hard to do in the presence of tail calls, exceptions and effects. We have implemented ad hoc calls to__tsan_func_entry()
and__tsan_func_exit()
upon exception raising/catching and effect performing/handling/resuming, so the backtraces should be correct. This stack trace reconstruction is performed for both OCaml stack frames and C frames. For OCaml frames - Some data race reports concern the runtime and are not the responsability of the programmer. It is very hard to eliminate all the reports in the runtime (see ocaml#11040). For this reason, in the future we may want to silence these reports by default.
As of now, memory accesses are properly instrumented and running with ThreadSanitizer already allows to detect data races. The observed slowdown is about 7-13x.
- Get TSan running smoothly for pure OCaml code ✔️
- Instrumentation of atomic variables ✔️
- Address and/or decide on the below challenges
- Measure overhead of approach. We expect the overhead to be significantly lower than, e.g., Go or C/C++ where mutability is the default. ✔️ (Overhead has been evaluated to roughly 7-13x.)
- Test approach on larger code bases ✔️
- Extend to support TSan-instrumented C-code called through the FFI ✔️ (Works pretty much out of the box)
- Extend with a suppression mechanism that disables tracking or silences warnings for selected parts of the code (similar to the C TSan)
- Extend to bytecode interpreter (not a priority ATM)
Challenges
- Anmol and Francois: In the OCaml run-time system values can change address which means that the run-time library looses track of any relations. If there is a race on such a value with two accesses happening before/after such an address change, this could mean that the race is not detected. This can be caused by a GC promotion from the minor to the major heap.
- Sadiq: Since a global lock is taken during garbage collections, there should be no data race in this situation.
- Jan: What happens when a heap memory location is reclaimed and reused for a new value? Does this potentially cause false alarms?
- Olivier: Not as long as TSan can establish a happens-before relation between accesses to the old value and accesses to the new value. For this we need to make sure that TSan considers collections as a synchronization event.
- Instrumentation is based on
__tsan_func_entry
and__tsan_func_exit
. Do tail-calls require special TSan treatment?- Yes, definitely. In fact, that was where the majority of our effort was spent.
-
Lazy
values gets special treatment in the compiler: they have internal state and are short-circuited once they have been forced. Do they need special TSan treatment?- The fact that the C runtime is instrumented should suffice to handle those.
- TSan investigates a particular execution and therefore will not detect races in code paths that are not visited.
- Repositories and PRs
- Anmol's repo: https://github.com/anmolsahoo25/ocaml/tree/multicore-pr+tsan
- Olivier's repo (state of current work): https://github.com/OlivierNicole/ocaml/tree/tsan
- ocaml/ocaml#11040 issue https://github.com/ocaml/ocaml/issues/11040
- ocaml/ocaml#11110 PR https://github.com/ocaml/ocaml/pull/11110
- Anmol's blog post: https://anmolsahoo25.github.io/blog/thread-sanitizer-ocaml/
- Clang/LLVM TSan documentation: https://clang.llvm.org/docs/ThreadSanitizer.html
- Google Sanitizer wiki:
- TSan C/C++ Manual: https://github.com/google/sanitizers/wiki/ThreadSanitizerCppManual
- TSan Algorithm: https://github.com/google/sanitizers/wiki/ThreadSanitizerAlgorithm
- Slides from GCC Cauldron 2012: https://gcc.gnu.org/wiki/cauldron2012?action=AttachFile&do=get&target=kcc.pdf
- Papers
- Serebryany and Iskhodzhanov: ThreadSanitizer – data race detection in practice, WBIA'09 https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35604.pdf. Note that the algorithm presented in this paper is not the one used in the new version of TSan.
- Chabby and Ramanathan: A Study of Real-World Data Races in Golang, PLDI'22 https://arxiv.org/pdf/2204.00764.pdf
- Ahmad et al.: Kard: Lightweight Data Race Detection with Per-Thread Memory Protection, ASPLOS'21 https://web.ics.purdue.edu/~ahmad37/papers/ahmad-kard.pdf
- ThreadSanitizer Google group: https://groups.google.com/g/thread-sanitizer