From 544f22d10f58585d4c6564398602a651c010faea Mon Sep 17 00:00:00 2001
From: Louis Ponet <louisponet@gmail.com>
Date: Sat, 8 Jun 2024 23:52:14 +0200
Subject: [PATCH] rewrite of seqlocks

---
 content/posts/icc_1_seqlock/index.md | 181 ++++++++++++++++-----------
 1 file changed, 108 insertions(+), 73 deletions(-)

diff --git a/content/posts/icc_1_seqlock/index.md b/content/posts/icc_1_seqlock/index.md
index d3893de..7e9b332 100644
--- a/content/posts/icc_1_seqlock/index.md
+++ b/content/posts/icc_1_seqlock/index.md
@@ -1,50 +1,58 @@
 +++
-title = "Inter Core Communication Pt 1: SeqLock"
+title = "Inter Core Communication Pt 1: Seqlock"
 date = 2024-05-25
-description = "A thorough investigation of the main synchronization primitive used by Mantra: the SeqLock"
+description = "A thorough investigation of the main synchronization primitive used by Mantra: the Seqlock"
 [taxonomies]
 tags =  ["mantra", "icc", "seqlock"]
 [extra]
 comment = true
 +++
 
-As the first technical topic in this blog, I will discuss the main method of inter core synchronization used in [**Mantra**](@/posts/hello_world/index.md): a `SeqLock`.
-It forms the fundamental building block for the "real" datastructures: `Queues` and `SeqLockVectors`, which will be the topic of the next blog post.
+As the first technical topic in this blog, I will discuss the main method of synchronizing the inter core communication used in [**Mantra**](@/posts/hello_world/index.md): a `Seqlock`.
+It forms the fundamental building block for the "real" communication datastructures: `Queues` and `SeqlockVectors`, which will be the topic of the next blog post.
 
-While designing the inter core communication (icc) layer,  have taken a great deal of inspiration from
-- [Trading at light speed](https://www.youtube.com/watch?v=8uAW5FQtcvE) by David Gross
-- An amazing set of references: [Awesome Lockfree](https://github.com/rigtorp/awesome-lockfree) by Erik Rigtorp
+After first considering the requirements that made me choose the `Seqlock` for **Mantra**, I will not make those in a hurry wait and get straight to the final implementation.
+
+For those still interested after that, I will continue by discussing how to verify the correctness of a `Seqlock` implementation, and potentially use memory barriers to make it reliable.
+This will involve designing tests, observing the potential pitfalls of function inlining, looking at some assembly code (funky), and strong-arming the compiler to do our bidding.
+
+Finally, we go through a quick 101 on low-latency timing that we use to gauge and optimize the performance of the implementation.
+
+Before continuing, I would like to give major credit to everyone involved with creating the following inspirational material
+- [Trading at light speed](https://www.youtube.com/watch?v=8uAW5FQtcvE)
+- An amazing set of references: [Awesome Lockfree](https://github.com/rigtorp/awesome-lockfree)
+- [C++ atomics, from basic to advanced. What do they really do?](https://www.youtube.com/watch?v=ZQFzMfHIxng)
 
 # Design Goals and Considerations
 
 - Achieve a close to the ideal ~30-40ns core-to-core latency (see e.g. [anandtech 13900k and 13600k review](https://www.anandtech.com/show/17601/intel-core-i9-13900k-and-i5-13600k-review/5) and the [fantastic core-to-core-latency tool](https://github.com/nviennot/core-to-core-latency))
 - data `Producers` do not care about and are not impacted by data `Consumers`
-- `Consumers` should not impact eachother, or the system as a whole
+- `Consumers` should not impact eachother
 
-# SeqLock
-The embodiment of the above goals in terms of synchronization techniques is the `SeqLock` (see [Wikipedia](https://en.wikipedia.org/wiki/Seqlock), [seqlock in the linux kernel](https://docs.kernel.org/locking/seqlock.html), and [Erik Rigtorp's C++11 implementation](https://github.com/rigtorp/Seqlock)).
+# Seqlock
+The embodiment of the above goals in terms of synchronization techniques is the `Seqlock` (see [Wikipedia](https://en.wikipedia.org/wiki/Seqlock), [seqlock in the linux kernel](https://docs.kernel.org/locking/seqlock.html), and [Erik Rigtorp's C++11 implementation](https://github.com/rigtorp/Seqlock)).
 
-The main considerations are:
+The key points are:
 - A `Producer` (or writer) is never blocked by `Consumers` (readers)
-- The `Producer` atomically increments a counter (the `Seq` in `SeqLock`) once before and once after writing the data
+- The `Producer` atomically increments a counter (the `Seq` in `Seqlock`) once before and once after writing the data
 - `counter & 1 == 0` (even) communicates to `Consumers` that they can read data
-- `counter_before_read == counter_after_read`: data was consistent while reading
-- Compare and swap can be used on the counter to allow multiple `Producers` to write to same `SeqLock`
+- `counter_before_read == counter_after_read`: data remained consistent while reading
+- Compare and swap could be used on the counter to allow multiple `Producers` to write to same `Seqlock`
 - Compilers and cpus in general can't be trusted, making it crucial to verify that the execution sequence indeed follows the steps we instructed. Memory barriers and fences are required to guarantee this in general
 
 # TL;DR
-Out of solidarity with your scroll wheel, the final implementation is
+Out of solidarity with your scroll wheel and without further ado:
 ```rust
 #[derive(Default)]
 #[repr(align(64))]
-pub struct SeqLock<T> {
+pub struct Seqlock<T> {
     version: AtomicUsize,
     data: UnsafeCell<T>,
 }
-unsafe impl<T: Send> Send for SeqLock<T> {}
-unsafe impl<T: Sync> Sync for SeqLock<T> {}
+unsafe impl<T: Send> Send for Seqlock<T> {}
+unsafe impl<T: Sync> Sync for Seqlock<T> {}
 
-impl<T: Copy> SeqLock<T> {
+impl<T: Copy> Seqlock<T> {
     pub fn new(data: T) -> Self {
         Self {version: AtomicUsize::new(0), data: UnsafeCell::new(data)}
     }
@@ -75,18 +83,31 @@ impl<T: Copy> SeqLock<T> {
 That's it, till next time folks!
 
 # Are barriers necessary?
-Most literature on `SeqLocks` focuses (rightly so) on guaranteeing correctness. The two main points are that
-- the `data` has no dependency on the `version` of the `SeqLock` which allows the compilerto merge or reorder the two increments to `self.version` in the `write` function, and the checks on `v1` and `v2` on the `read` side
-- depending on the model, the cpu can similarly reorder when the changes to `version` become visible to the other cores, and when `data` is actually copied in and out of the lock
+Most literature on `Seqlocks` focuses (rightly so) on guaranteeing correctness.
 
-For x86 cpus, the latter is less of a problem since they are "strongly memory ordered" and `version` is atomic. However, adding the barriers in this case leads to no-ops so they don't hurt either. See the [Release-Acquire ordering section in the c++ reference](https://en.cppreference.com/w/cpp/atomic/memory_order#Release-Acquire_ordering) for further information.
+Because the stored `data` does not depend on the `version` of the `Seqlock`, the compiler is allowed to merge or reorder the two increments to the `version` in the `write` function.
+The same goes for the checks on `v1` and `v2` on the `read` side of things.
+It potentially gets worse, though:
+depending on the architecture of the cpu, read and write operations to `version` and `data` could be reordered **on the hardware level**.
+Given that the `Seqlock's` correctness depends entirely on the sequence of `version` increments and checks around `data` writes and reads, these issues are big no-nos.
 
-## Torn data testing
+As we will investigate further below, memory barriers are the main solution to these issues.
+They keep the compiler in line by guaranteeing [certain things](https://en.cppreference.com/w/cpp/atomic/memory_order), forcing it to adhere to the sequence of instructions that we specified in the code.
+The same applies to the cpu itself.
+
+For x86 cpus, the memory barriers luckily do not require any *additional* cpu instructions, just that no instructions are reordered or ommitted.
+These cpus are [strongly memory ordered](https://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf), meaning that they guarantee the following: writes to some memory (i.e. `version`) can not be reordered with writes to other memory (i.e. `data`), and similar for reads.
+Other cpu architectures might require additional cpu instructions to enforce these guarantees.
+As long as we include the barriers, the `rust` compiler can figure out for us whether these are needed.
 
-Before demonstrating how one would go about verifiying if and why barriers are needed, let's design some tests to validate a `SeqLock` implementation.
-The main concern is data consistency, i.e. that a `Consumer` does not read data that is being written to.
-We test this by making a `Producer` fill and write an array with an increasing counter into the `SeqLock`, while a `Consumer` reads and verifies that all entries in the array are identical (see the highlighted line below)
+See the [Release-Acquire ordering section in the c++ reference](https://en.cppreference.com/w/cpp/atomic/memory_order#Release-Acquire_ordering) for further information on the specific barrier construction that is used in the `Seqlock`.
 
+## Torn data testing
+
+The first concern that we can relatively easily verify is data consistency.
+In the test below we verify that when a `Consumer` supposedly succesfully reads `data`, the `Producer` was indeed not simultaneously writing to it.
+We do this by making a `Producer` fill and write an array with an increasing counter, while a `Consumer` reads and verifies that all entries in the array are identical (see the highlighted line below).
+If reading and writing were to happen at the same time, the `Consumer` would at some point see partially new and partially old data with differing counter values.
 ```rust,linenos,hl_lines=14 15 17 26 27
 #[cfg(test)]
 mod tests {
@@ -95,7 +116,7 @@ mod tests {
 
     fn read_test<const N: usize>()
     {
-        let lock = SeqLock::new([0usize; N]);
+        let lock = Seqlock::new([0usize; N]);
         let done = AtomicBool::new(false);
         std::thread::scope(|s| {
             s.spawn(|| {
@@ -146,7 +167,7 @@ mod tests {
 }
 ```
 
-If I run these tests on an intel i9 14900k, using the following more naive implementation
+If I run these tests on an intel i9 14900k, using the following simplified `read` and `write` implementations without memory barriers
 ```rust
 pub fn read(&self, result: &mut T) {
     loop {
@@ -168,24 +189,36 @@ pub fn write(&self, val: &T) {
 ```
 I find that they fail for array sizes of 64 (512 bytes) and up. This signals that either the compiler or the cpu did some reordering of operations.
 
-The issue in this specific case is that the compiler has inlined the `read` function and reordered `let first = msg[0]` on line (15) to be executed before
-the `read` on line (14), causing obvious problems.
-Just to highlight how fickle inlining is, adding `assert_ne!(msg[0], 0)` after line (19) makes all tests pass.
+## Inline, and Compiler Cleverness
+Funnily enough, barriers are not necessarily needed to fix these tests. Yeah, I also was not happy that my illustrating example in fact does not illustrate what I was trying to illustrate.
 
-While changing one of the `self.version.load(Ordering::Relaxed)` operations in `read` to `self.version.load(Ordering::Acquire)` solves the issue in this case, adding `#[inline(never)]` is much more safe and does not lead to any difference in performance.
-We do the same for the `write` function to avoid similar shenanigans there.
+Nonetheless, I chose to mention it because it highlights just how much the compiler will mangle your code if you let it.
+I will not paste the resulting assembly here as it is rather lengthy (see [assembly lines (23, 50-303) on godbolt](https://godbolt.org/z/7MYaW7Pba)).
+The crux is that the compiler chose to inline the `read` function and then decided to move the `let first = msg[0]` statement of line (15) entirely outside of the `while` loop...
 
-Even though the tests now pass without using any barriers, we proceed with a double check by analyzing the assembly that was produced by the compiler.
+Strange? Maybe not.
+The compiler's reasoning here is actually similar to the one that requires us to use memory barriers.
+The essential point is, again, that the `data` field inside the `Seqlock` is not an atomic variable like `version`.
+This allows the compiler to assume that only the current thread touches it. Meanwhile, the `Consumer` thread never writes to `data`, so it never changes, right?
+Ha, might as well just set `first = data[0]` once and for all before starting with the actual `read` & verify loop.
+Of course, the reality is that the `Producer` is actually changing `data`. Thus, as soon as the `Consumer` thread `reads` it into `msg`, `first != msg[i]` causing our test to fail.
 
-## Deeper dive using [`cargo asm`](https://crates.io/crates/cargo-show-asm/0.2.34)
-I used `cargo asm` to perform the analysis presented below. It can be easily installed using `cargo` and has a very user friendly terminal based interface to hone in on the code of interest.
-[godbolt](https://godbolt.org) is another great choice, but when working on a larger codebase it become quite cumbersome to copy-paste all the supporting code for a given function.
-I recommend adding `#[inline(never)]` to force the compiler not to inline the function of interest.
+Interestingly, adding `assert_ne!(msg[0], 0)` after line (19) seems to make the compiler less sure about this code transformation because suddenly all tests pass.
+Looking at the resulting assembly confirms this observation as now line (15) is correctly executed each loop after first reading the `Seqlock`.
 
-### `SeqLock::<[usize; 1024]>::read`
-When using a large array with 1024 elements, the `rust` code compiles down to
+The first step towards provable correctness of the `Seqlock` is thus to add `#[inline(never)]` to the `read` and `write` functions.
+
+## Deeper dive using [`cargo asm`](https://crates.io/crates/cargo-show-asm/0.2.34)
+I kind of jumped the gun above with respect to reading compiler produced assembly. The tool I use by far the most for this is `cargo asm`.
+It can be easily installed using `cargo` and has a very user friendly terminal based interface.
+[godbolt](https://godbolt.org) is another great choice, but it can become tedious to copy-paste all the supporting code when working on a larger codebase.
+In either case, I recommend adding `#[inline(never)]` to the function of interest so its assembly can be more easily filtered out.
+
+Let's see what the compiler generates for the `read` function of a couple different array sizes.
+### `Seqlock::<[usize; 1024]>::read`
+When using a large array with 1024 elements, the assembly reads
 ```asm, linenos, hl_lines=19 22 26 27 28 30
-code::SeqLock<T>::read:
+code::Seqlock<T>::read:
         .cfi_startproc
         push r15
         .cfi_def_cfa_offset 16
@@ -228,22 +261,22 @@ code::SeqLock<T>::read:
         .cfi_def_cfa_offset 8
         ret
 ```
-First thing we can observe in lines (19, 22, 27) is that `rust` chose to not adhere to our ordering of fields in `SeqLock`, moving `version` behind `data`.
+The first thing we observe in lines (19, 22, 27) is that the compiler chose not to adhere to the ordering of fields in our definition of the `Seqlock`, moving `version` behind `data`.
 If needed, the order of fields can be preserved by adding `#[repr(C)]`.
 
-The lines that constitute the main operations of the `SeqLock` are highlighted, corresponding almost one-to-one with the `read` function:
+The operational part of the `read` function is, instead, almost one-to-one translated into assembly:
 1. assign function pointer to `memcpy` to `r15` for faster future calling
-2. move `version` at `SeqLock start (r14) + 8192 bytes` into `r12`
+2. move `version` at `Seqlock start (r14) + 8192 bytes` into `r12`
 3. perform the `memcpy`
-4. move `version` at `SeqLock start (r14) + 8192 bytes` into `rax`
+4. move `version` at `Seqlock start (r14) + 8192 bytes` into `rax`
 5. check `r12 & 1 == 0`
 6. check `r12 == rax`
 7. Profit...
 
-### `SeqLock::<[usize; 1]>::read`
+### `Seqlock::<[usize; 1]>::read`
 For smaller array sizes we get
 ```asm, linenos
-code::SeqLock<T>::read:
+code::Seqlock<T>::read:
         .cfi_startproc
         mov rax, qword ptr [rdi + 8]
         .p2align        4, 0x90
@@ -258,14 +291,15 @@ code::SeqLock<T>::read:
         ret
 ```
 Well at least it looks clean... I'm pretty sure I don't have to underline the issue with steps
-1. Do the copy into `rax`
-2. move the **version** into `rcx` and `rdx`
-3. test like before, I mean why even do this
+1. Do the copy of `data` into `rax`
+2. move `version` into `rcx`... and `rdx`?
+3. test `version & 1 != 1`
+4. test `rcx == rdx`... hol' on, what?
 4. copy from `rax` into the input
-5. No Stonks...
+5. wait a minute...
 
 This is a good demonstration of why tests should not be blindly trusted and why double checking the produced assembly is good practice.
-In fact, I never got the tests to fail after adding the `#[inline(never)]` we discussed earlier, even though the assembly clearly shows that nothing stops a `read` while a `write` is happening.
+In fact, I never got the tests to fail after adding the `#[inline(never)]` discussed earlier, even though the assembly clearly shows that nothing stops a `read` while a `write` is happening.
 This happens because the `memcpy` is done **inline/in cache** for small enough data, using moves between cache and registers (`rax` in this case).
 If a single instruction is used (`mov` here) it is never possible that the data is partially overwritten while reading, and it remains highly unlikely even when multiple instructions are required.
 
@@ -287,7 +321,7 @@ pub fn read(&self, result: &mut T) {
 }
 ```
 ```asm, linenos
-code::SeqLock<T>::read:
+code::Seqlock<T>::read:
         .cfi_startproc
         .p2align        4, 0x90
 .LBB6_1:
@@ -306,8 +340,8 @@ code::SeqLock<T>::read:
 It is interesting to see that the compiler chooses to reuse `rcx` both for the data copy in lines (5) and (6), as well as the second `version` load in line (8).
 
 With the current `rust` compiler (1.78.0), I found that only adding `Ordering::Acquire` in lines (4) or (7) of the `rust` code already does the trick.
-However, they only guarantee the ordering of loads of the `version` atomic when combined with an `Ordering::Release` store in the `write` function, not when the actual data is copied.
-That is where the `compiler_fence` comes in guaranteeing also this ordering. I have not noticed a change in performance when adding these additional barriers.
+However, they only guarantee the ordering of loads of the atomic `version` when combined with an `Ordering::Release` store in the `write` function, not when the actual data is copied in relation to it.
+That is where the `compiler_fence` comes in, guaranteeing also this ordering. As discussed before, adding these extra barriers in the code did not change the performance on x86.
 
 The corresponding `write` function becomes:
 ```rust
@@ -321,10 +355,11 @@ pub fn write(&self, val: &T) {
     self.version.store(v.wrapping_add(1), Ordering::Release);
 }
 ```
-Our `SeqLock` implementation should now be correct and is in fact pretty much identical to others around.
+Our `Seqlock` implementation should now be correct and is in fact pretty much identical to others that can be found in the wild.
 
-We now turn to an aspect that's covered much less frequently: timing and potentially optimizing the implementation.
-There is not much room to play with here, but while going through potential optimizations we will touch on some key concepts that surround the business of timing low latency constructs. As a bonus we will get a first glimpse into the inner workings of the cpu.
+Having understood a thing or two about memory barriers while solidifying our `Seqlock`, we now turn to an aspect that's covered much less frequently: timing and potentially optimizing the implementation.
+Granted, there is not much room to play with here given the size of the functions.
+Nevertheless, some of the key concepts that I will discuss in the process will be used in many future posts.
 
 P.S.: if memory models and barriers are really your schtick, live a little and marvel your way through [The Linux Kernel Docs on Memory Barriers](https://docs.kernel.org/core-api/wrappers/memory-barriers.html).
 
@@ -419,8 +454,8 @@ Combining this with branch prediction renders the final result quite dependent o
  
 Anyway, the lower end of these measurements serves as a sanity check and target for our `Seqlock` latency.
 
-## SeqLock performance
-In all usecases of `Seqlocks` in [**Mantra**](@/posts/hello_world/index.md) there are one or many `Producers` which 99% of the time don't produce anything, while `Consumers` are busy spinning on the last `SeqLock` until it gets written to.
+## Seqlock performance
+In all usecases of `Seqlocks` in [**Mantra**](@/posts/hello_world/index.md) there are one or many `Producers` which 99% of the time don't produce anything, while `Consumers` are busy spinning on the last `Seqlock` until it gets written to.
 
 We reflect this in the timing code's setup:
 - a `Producer` writes an `rdtscp` timestamp into the `Seqlock` every 2 microseconds
@@ -434,7 +469,7 @@ struct TimingMessage {
     data:   [u8; 1],
 }
 
-fn contender(lock: &SeqLock<TimingMessage>)
+fn contender(lock: &Seqlock<TimingMessage>)
 {
     let mut m = TimingMessage { rdtscp: Instant::now(), data: [0]};
     while m.data[0] == 0 {
@@ -442,7 +477,7 @@ fn contender(lock: &SeqLock<TimingMessage>)
     }
 }
 
-fn timed_consumer(lock: &SeqLock<TimingMessage>)
+fn timed_consumer(lock: &Seqlock<TimingMessage>)
 {
     let mut timer = Timer::new("read");
     core_affinity::set_for_current(CoreId { id: 1 });
@@ -459,7 +494,7 @@ fn timed_consumer(lock: &SeqLock<TimingMessage>)
     }
 }
 
-fn producer(lock: &SeqLock<TimingMessage>)
+fn producer(lock: &Seqlock<TimingMessage>)
 {
     let mut timer = Timer::new("write");
     core_affinity::set_for_current(CoreId { id: 2 });
@@ -478,7 +513,7 @@ fn producer(lock: &SeqLock<TimingMessage>)
 }
 
 fn consumer_latency(n_contenders: usize) {
-    let lock = SeqLock::default();
+    let lock = Seqlock::default();
     std::thread::scope(|s| {
         for i in 1..(n_contenders + 1) {
             let lck = &lock;
@@ -494,7 +529,7 @@ fn consumer_latency(n_contenders: usize) {
 ```
 
 ### Starting Point
-We use the `SeqLock` code we implemented above as the initial point, leading to the following latency timings for a single consumer (left) and 5 consumers (right):
+We use the `Seqlock` code we implemented above as the initial point, leading to the following latency timings for a single consumer (left) and 5 consumers (right):
 
 ![](consumer_latency_initial.png#noborder "initial_consumer_latency")
 *Fig 3. Initial Consumer Latency*
@@ -528,16 +563,16 @@ Measuring again, we find that this leads to a serious improvement, almost halvin
 
 Unfortunately, there is nothing that can be optimized on the `read` side of things.
 
-One final optimization we'll proactively do is to add `#[repr(align(64))]` the `SeqLocks`:
+One final optimization we'll proactively do is to add `#[repr(align(64))]` the `Seqlocks`:
 ```rust
 #[repr(align(64))]
-pub struct SeqLock<T> {
+pub struct Seqlock<T> {
     version: AtomicUsize,
     data: UnsafeCell<T>,
 }
 ```
-This fixes potential [`false sharing`](https://en.wikipedia.org/wiki/False_sharing) issues by never having two or more `SeqLocks` on a single cache line.
-While it is not very important when using a single `SeqLock`, it becomes crucial when using them inside `Queues` and `SeqLockVectors`.
+This fixes potential [`false sharing`](https://en.wikipedia.org/wiki/False_sharing) issues by never having two or more `Seqlocks` on a single cache line.
+While it is not very important when using a single `Seqlock`, it becomes crucial when using them inside `Queues` and `SeqlockVectors`.
 
 Looking back at our original design goals:
 - close to minimum inter core latency
@@ -546,12 +581,12 @@ Looking back at our original design goals:
 
 our implementation seems to be as good as it can be!
 
-We thus conclude our deep dive into `SeqLocks` here, also concluding this first technical blog post.
-We've laid the groundwork and have introduced some important concepts for the upcoming post on `Queues` and `SeqLockVectors` as Pt 2 on inter core communication.
+We thus conclude our deep dive into `Seqlocks` here, also concluding this first technical blog post.
+We've laid the groundwork and have introduced some important concepts for the upcoming post on `Queues` and `SeqlockVectors` as Pt 2 on inter core communication.
 
 See you then!
 
 # Possible future investigations/improvements
-- Use the [`cldemote`](https://www.felixcloutier.com/x86/cldemote) to force the `Producer` to immediately flush the `SeqLock` data to the consumers
+- Use the [`cldemote`](https://www.felixcloutier.com/x86/cldemote) to force the `Producer` to immediately flush the `Seqlock` data to the consumers
 - [UMONITOR/UMWAIT spin-wait loop](https://stackoverflow.com/questions/74956482/working-example-of-umonitor-umwait-based-assembly-asm-spin-wait-loops-as-a-rep#)