Status of ThreadSanitizer for OCaml

Google is behind LLVM's ThreadSanitizer, an effective approach to locate data races in concurrent code bases. The goal of this work is

to extend the approach to full OCaml 5.0 and
to keep the run-time and instrumentation overhead as low as possible

Background

There are two components to ThreadSanitizer (TSan):

A run-time library to track accesses to shared data and report races
Compiler instrumentation that emits calls to the run-time library

Internally the run-time library associates with each word of application memory at least 2 “shadow words”. Each shadow word contains information about a recent memory access to that word, including a “scalar clock”. Those clocks serve to establish a happens-before (HB) relation, i.e. an event orderings that we are certain of.

This information is maintained as a “shadow state” in a separate memory region, and updated at every (instrumented) memory access. A data race is reported every time two memory accesses are made to overlapping memory regions, and:

one of them is a write, and
there is no established happens-before relation between them. More information about TSan's algorithm on their wiki.

When a potential data race is detected, TSan prints a data race reports with the affected memory location, and two backtraces (one for each memory access). While the backtrace for the latest access is easy to obtain by the usual unwinding (like the debugger does), the backtrace of the previous access is not readily available. What TSan does is keeping a history of function entry and exit events for each thread, which allows to reconstruct past backtraces. This history is implemented as a large ring buffer, so stacktraces for events that are too far in the past are lost.

The run-time library is reusable across different programming languages (C,C++,Go, ...).

Status

For OCaml there have been two supplementary efforts - ideally these should be combined:

Anmol's initial prototype extending the OCaml compiler to emit TSan instrumentation
Sadiq has been running TSan on the OCaml runtime, investigating reported data races, and annotating the C run-time to reduce false positives

Briefly, Anmol's prototype

add's a new cmm-based compiler pass adding instrumentation
the new pass instruments
- functions with __tsan_func_entry and __tsan_func_exit
- loads and stores with __tsan_readN and __tsan_writeN
as an optimization the pass currently doesn’t instrument
- immutable values (since they won't have races) or
- initializing writes (since they are already protected by a barrier making sure they are visible to all threads)

Anmol's work has been taken up by Olivier (@OlivierNicole) and Fabrice (@fabbing). The added work consisted, among other tasks, in:

Report backtrace correctly. This was tricky to implement as it relies on the compiler inserting calls to __tsan_func_entry() and __tsan_func_exit() at the right places, which is hard to do in the presence of tail calls, exceptions and effects. We have implemented ad hoc calls to __tsan_func_entry() and __tsan_func_exit() upon exception raising/catching and effect performing/handling/resuming, so the backtraces should be correct. This stack trace reconstruction is performed for both OCaml stack frames and C frames. For OCaml frames
Some data race reports concern the runtime and are not the responsability of the programmer. It is very hard to eliminate all the reports in the runtime (see ocaml#11040). For this reason, in the future we may want to silence these reports by default.

As of now, memory accesses are properly instrumented and running with ThreadSanitizer already allows to detect data races. The observed slowdown is about 7-13x.

Status of the work items

Get TSan running smoothly for pure OCaml code ✔️
Instrumentation of atomic variables ✔️
Address and/or decide on the below challenges
Measure overhead of approach. We expect the overhead to be significantly lower than, e.g., Go or C/C++ where mutability is the default. ✔️ (Overhead has been evaluated to roughly 7-13x.)
Test approach on larger code bases ✔️
Extend to support TSan-instrumented C-code called through the FFI ✔️ (Works pretty much out of the box)
Extend with a suppression mechanism that disables tracking or silences warnings for selected parts of the code (similar to the C TSan)
Extend to bytecode interpreter (not a priority ATM)

Challenges

Anmol and Francois: In the OCaml run-time system values can change address which means that the run-time library looses track of any relations. If there is a race on such a value with two accesses happening before/after such an address change, this could mean that the race is not detected. This can be caused by a GC promotion from the minor to the major heap.
- Sadiq: Since a global lock is taken during garbage collections, there should be no data race in this situation.
Jan: What happens when a heap memory location is reclaimed and reused for a new value? Does this potentially cause false alarms?
- Olivier: Not as long as TSan can establish a happens-before relation between accesses to the old value and accesses to the new value. For this we need to make sure that TSan considers collections as a synchronization event.
Instrumentation is based on __tsan_func_entry and __tsan_func_exit. Do tail-calls require special TSan treatment?
- Yes, definitely. In fact, that was where the majority of our effort was spent.
Lazy values gets special treatment in the compiler: they have internal state and are short-circuited once they have been forced. Do they need special TSan treatment?
- The fact that the C runtime is instrumented should suffice to handle those.

Caveats

TSan investigates a particular execution and therefore will not detect races in code paths that are not visited.

Resources

Repositories and PRs
- Anmol's repo: https://github.com/anmolsahoo25/ocaml/tree/multicore-pr+tsan
- Olivier's repo (state of current work): https://github.com/OlivierNicole/ocaml/tree/tsan
- ocaml/ocaml#11040 issue https://github.com/ocaml/ocaml/issues/11040
- ocaml/ocaml#11110 PR https://github.com/ocaml/ocaml/pull/11110
Anmol's blog post: https://anmolsahoo25.github.io/blog/thread-sanitizer-ocaml/
Clang/LLVM TSan documentation: https://clang.llvm.org/docs/ThreadSanitizer.html
Google Sanitizer wiki:
- TSan C/C++ Manual: https://github.com/google/sanitizers/wiki/ThreadSanitizerCppManual
- TSan Algorithm: https://github.com/google/sanitizers/wiki/ThreadSanitizerAlgorithm
Slides from GCC Cauldron 2012: https://gcc.gnu.org/wiki/cauldron2012?action=AttachFile&do=get&target=kcc.pdf
Papers
- Serebryany and Iskhodzhanov: ThreadSanitizer – data race detection in practice, WBIA'09 https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35604.pdf. Note that the algorithm presented in this paper is not the one used in the new version of TSan.
- Chabby and Ramanathan: A Study of Real-World Data Races in Golang, PLDI'22 https://arxiv.org/pdf/2204.00764.pdf
- Ahmad et al.: Kard: Lightweight Data Race Detection with Per-Thread Memory Protection, ASPLOS'21 https://web.ics.purdue.edu/~ahmad37/papers/ahmad-kard.pdf
ThreadSanitizer Google group: https://groups.google.com/g/thread-sanitizer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly