rewrite of hello_world

louisponet · Jun 6, 2024 · 68fb8e5 · 68fb8e5
1 parent 10be3e8
commit 68fb8e5
Showing 1 changed file with 49 additions and 48 deletions.
diff --git a/content/posts/hello_world/index.md b/content/posts/hello_world/index.md
@@ -9,18 +9,18 @@ comment = true
 +++
 
 I started work on **Mantra** in an effort to learn **rust** and explore the development process of a distributed, high(ish)-frequency/low-latency trading system in the language.
-One of the driving goals was to achieve a [tick-to-trade](https://beeksgroup.com/blog/tick-to-trade-is-it-the-new-must-have-metric-in-trading-performance/) internal latency on the order of 0.5 - 10$\mu$s (depending on trading algo complexity).
+From the get go, I've targeted an internal [tick-to-trade](https://beeksgroup.com/blog/tick-to-trade-is-it-the-new-must-have-metric-in-trading-performance/) latency in the low microsecond range.
+While this has forced me to thread carefully while adding an ever increasing amount of capabilities and features, I have been able to keep **Mantra's** internal latency firmly below this target.
 
-I've now reached a point where I feel like some of the solutions and designs I came up with are worth sharing. 
-While most code snippets will be written in `rust`, the language is not the main focus, which should make translating the discussed concepts easily applicable using other languages.
-I hope to also learn a thing or two through eventual discours with more seasoned practitioners, so be sure to make plentiful use of the *Comment* field!
+I've now reached a point where I feel like some of the solutions and designs I came up with are worth sharing, hence this blog.
+While **Mantra** itself will remain closed source, for obvious reasons, it will serve as a _fil rouge_ tying the concepts of deeper discussions together, and show how they can be applied to a real-world high-performance application.
 
-**Mantra** itself will remain closed-source, for obvious reasons, but I will use this blog to go through and give detailed discussions on some of its core building blocks.
-Most of these have been purposefully hand-crafted focusing on pragmatism given the scope of the project.
+As `rust` has been my language of choice for **Mantra**, many of the code snippets will be written in it.
+The un-enlightened should not fret, however, since the focus will be on the concepts themselves and most of these should be quite straight-forwardly translatable into any capable programming language.
 
 The intention of this initial blog post is to give an overview of the features and general design of **Mantra**, setting the stage for future more technical posts.
 
-Let's start with a quick billboard rundown of the current features.
+Enough with the intro, feature list time...
 # Features and Capabilities
 ![](ui.png#noborder "ui")
 ***Mantra** ui during a backtest*
@@ -29,85 +29,85 @@ Let's start with a quick billboard rundown of the current features.
 - **NO ASYNC**
 - Low latency inter core communication using hand crafted message **queues, seqlocks, and shared memory**
 - Full internal observability (message tracking) and high performance **in-situ telemetry** of latency and business logic
-- Full **persistence of messages** using an efficiently compressed encoding for post execution analysis and replay
+- Full **persistence of messages** for post execution analysis and replay
 - [L2](https://centerpointsecurities.com/level-1-vs-level-2-market-data/) based orderbooks
 - Concurrent handling of multiple trading algorithms
-- Balance and order tracking accross multiple `Exchanges`
+- Balance and order tracking accross multiple Exchanges
 - Continuous **ingestion and storage** of market data streams
-- WebSocket based connections to 5 crypto `Exchanges` (Binance, Bitstamp, Bitfinex, Coinbase and Kraken)
-- "In production" **backtesting** by replaying historical marketdata streams while mimicking execution with a **mock exchange**
-- Real-time UI for analysis of the system, marketdata and backtesting, capable of handling millions of datapoints
+- WebSocket based connections to 5 crypto Exchanges (Binance, Bitstamp, Bitfinex, Coinbase and Kraken)
+- "In production" **backtesting** by replaying historical market data streams while mimicking order execution with a **mock exchange**
+- Real-time UI for observation of the system, market data and backtesting
 - ~500k msgs/s throughput @ 0.5 - 10 microseconds internal latency
-- A focus on code simplicity and **locality of behavior**
-- < 15K LOC (for now)
+- A focus on code simplicity and **locality of behavior** (< 15K LOC for now)
 
 # System Design Overview
 The core design has changed very little since **Mantra's** inception.
-One of the main principles that has always guided me is a strong focus on pragmatism.
-After all, creating a fully fledged and capable low-latency trading engine, while learning a new programming language, is no small project to tackle in one's free time.
-A very modular and disconnected system design with isolated parts would allow me to effeciently add features while also allowing me to refactor previous implementations as
-I gathered more experience working with `rust`.
+Pragmatism has always been one of the guiding principles given the scope of creating a fully fledged and capable low-latency trading engine... while learning a new programming language.
 
-This made me to opt for an event based system where `Actors` communicate with eachother by passing messages through `Queues`.
-The potentially lower latency *single-function-hot-path* approach inevitably leads to more intertwined code, and I am convinced that it is easier to be build on top of a solid distributed system rather than the other way around.
+This has led me to a very modular and disconnected design, allowing me to add features with minimal friction while also minimizing the impact of inevitable refactorings as I learned more about `rust`.
+The potentially lower latency *single-function-hot-path* approach inevitably leads to more intertwined and hard to maintain code. I am also convinced that it is easier to be build the latter on top of a solid distributed system rather than the other way around.
 
-The schematic below shows a high-level overview of the current design of **Mantra**
+As the schematic below shows, **Mantra** is thus composed of a set of `Actors` that communicate with eachother through lock-free `Queues`.
 ![](system_design.svg#noborder)
 *Fig 1. High level design of **Mantra***
 
 The main execution logic and data flow is quite straight-forward:
-1. incoming `L2Update` and `TradeUpdate` market data messages get consumed by the `TradeModels`
-2. each `TradeModel` fills out a pre-defined set of ideal order positions, each with an `InstrumentId`, `price` and `volume`
+1. incoming `L2Update` and `TradeUpdate` market data messages (green, top-left) get consumed by the `TradeModels` (in grey)
+2. each `TradeModel` fills out a pre-defined set of ideal order positions
 3. the `Overseer` actor continuously loops through these, and compares them to previously sent `OrderRequests` and live `Orders` on the target `Exchange`
 4. if they don't match up and the necessary `balance` is available on the `Exchange`, the `Overseer` generates and publishes a new `OrderRequest`
-5. the `AccountHandler` connecting with the target `Exchange` will then consume and send these requests, while feeding back various updates to the `Overseer`
+5. the `AccountHandler` connecting with the target `Exchange` then sends these requests
+6. `Order`, `Balance` and `OrderExecution` updates are fed back `Overseer`
 
-An added benefit of centering the system around multi-consumer broadcasting message `Queues` is that it becomes extremely easy to attach a number of non latency critical auxiliary `Actors` that handle tasks such as **logging** without impacting the performance of the main business logic.
+As will be discussed in future posts, centering the system around multi-consumer message `Queues` has many benefits.
+They are set up to broadcast every message to every attached `Consumer`, and are designed such that `Consumers` do not impact eachother nor `Producers`.
+This makes adding functionality through new `Actors` frictionless. Most will have realized by now that it's essentially a low-latency microservice design that actually works (one person team etc).
 
-Lastly, as **crypto markets** are for now the main target (**Mantra** is by no means built specifically for crypto), the vast majority of latency anyway originates from the connection between me and the exchanges. I hope to be able to co-locate with certain exchanges in the future, at which point a single-function-hot-path becomes much more attractive for certain applications. Again, I believe a solid distributed system can form the perfect starting point for this as well.
+Another benefit is that functionality can be switched on or off at will. `Queues` use shared memory, meaning that these `Actors` could be running in different processes which is exactly how the **UI** and **telemetry** of **Mantra** work.
 
-That is why I really strived to keep the internal latency as low as possible at every step of the way. The result is that **Mantra** achieves internal tick-to-trade latencies between 400ns and 10$\mu$s on an untuned arch-linux based distro running on a less than prime example of the intel 14900 K.
+One final consideration against using a single-function-hot-path approach is that **Mantra** currently targets **crypto** markets.
+While there is nothing particularly specific to **crypto** outside of the WebSocket connections, it does mean that the vast majority of latency (10s of milliseconds) actually originates from the connection between my pc and the exchanges.
+I hope to be able to co-locate with some of them in the future at which point achieving the absolute minimal latency becomes more critical (target of ~10-100ns). Again, pragmatism...
 
+Having raved about `Queues` this much, let me now give an overview of the the inter core communication layer.
 ## Inter Core Communication (ICC)
 
 ![](Queue.svg#noborder)
 *Fig 2. Seqlocked Buffer*
 
-Let's zoom in to one of the fundamental parts of the system: the inter core communication layer.
-
-The first part consists of the message `Queues`, denoted in [Fig 1.](@/posts/hello_world/index.md#system-design-overview) by the red arrows and ovals that specify the message type of each queue.
+The `Queues` are denoted in [Fig 1.](@/posts/hello_world/index.md#system-design-overview) by the red arrows and ovals that specify the message type of each `Queue`.
 They are essentially [`Seqlocked`](https://en.wikipedia.org/wiki/Seqlock) ringbuffers that can be used both in *single-producer-multi-consumer* (SPMC) and *multi-producer-multi-consumer* (MPMC) modes.
 
-The main design considerations for their application in **Mantra** were:
+Reiterating the main design considerations for their application in **Mantra**:
 - Achieve a close to the ideal ~30ns core-to-core latency (see e.g. [anandtech 13900k and 13600k review](https://www.anandtech.com/show/17601/intel-core-i9-13900k-and-i5-13600k-review/5) and the [fantastic core-to-core-latency tool](https://github.com/nviennot/core-to-core-latency))
 - Every attached `Consumer` gets every message, also known as *broadcast* mode
 - `Producers` are "not" impacted by number of attached `Consumers` (difficult to achieve perfectly), mainly they don't care if `Consumers` can keep up
 - `Consumers` should not impact eachother, and should know when they got sped past by `Producers`
 
-Considering these design goals, `Producers` and `Consumers` do not share any state but the ringbuffer itself. `Consumers` simply know which version of the `Seqlocks` guarding the data they expect.
-This means they know when the next message is ready: a `Producer` has incremented the version of the next `Seqlock` to the expected one,
-as well as when they got sped past: a `Producer` incremented the version of the next `Seqlock` at least twice making it too high.
-
-In **Mantra**, aside from the `Consumers` that handle the business logic, each `Queue` also has a `Consumer` that persists each message to disk.
+As a result of the 3rd point, `Producers` and `Consumers` do not share any state other than the ringbuffer itself.
+`Consumers` know which version of the `Seqlocks` that guard the data they expect.
+That allows them to autonomously know when the next message is ready: a `Producer` has incremented the version of the next `Seqlock` to the one they expect.
+If while running through the `Queue` they encounter a `Seqlock` with a version that is larger than what they expected they know they have been sped past: a `Producer` has written data and incremented the counter twice.
 
-The second part of the ICC layer are the `SeqlockVectors` denoted by the blue rectangles in [Fig 1.](@/posts/hello_world/index.md#system-design-overview). They are used between the `TradeModels` and the `Overseer`.
-These were chosen over another `Queue` because `TradeModels` potentially recalculate their ideal positions on each incoming marketdata message.
-The `Overseer` takes care of quite some tasks. If it was busy while a `TradeModel` recomputed the values for a given ideal `Order` multiple times, the `Overseer` would still have to go through the messages from oldest to newest.
-Using a `SeqlockVector` means that the `TradeModels` can update their desired positions as often as they want and the `Overseer` will always potentially send `OrderRequests` based on the latest information.
+In **Mantra**, aside from the `Consumers` that handle the business logic, each `Queue` also has a `Consumer` that persists each message to disk. This allows for post-mortem analysis and replay.
 
-One big benefit of using the style of communication is that by using shared memory any process can safely observe the messages flying through `Queues` and access the data filled
-in the `SeqlockVectors`. As we will see later on this is very useful for offloading ancillary tasks to external tools.
+The second part of the ICC layer are the `SeqlockVectors` denoted by the blue rectangles in [Fig 1.](@/posts/hello_world/index.md#system-design-overview).
+They are for now only used between the `TradeModels` and the `Overseer`.
+The reason I've used these rather than another `Queue` is because `TradeModels` generally recalculate their ideal positions on each incoming marketdata message.
+If a `TradeModel` recomputed the values for a given ideal `Order` multiple times while the `Overseer` was busy, it would still have to go through the messages from oldest to newest.
+However, we really only want to potentially send `Orders` based on the latest information. Using `SeqlockVectors` allows the `TradeModels` to overwrite the different `Orders` and the `Overseer` only reads the latest one for each.
 
-In the next blog post I will do a much deeper dive on this layer, so stay tuned for that!
+I think that by now it is clear that the `Seqlock` is the main synchronization primitive that is used throughout **Mantra**, and the next blog post will be a deeper dive exactly on that.
+While it is quite a well understood concept, and I will indeed reference much literature that gave me inspiration, I will focus on how one would verify and time the implementation, which I think is not often discussed.
 
 ## Telemetry and Observability
 
-From the very beginning I put great emphasis on in-situ telemetry to keep the performance of different parts of **Mantra** in check at all times.
+As mentioned in the introduction, I have put great emphasis on in-situ telemetry from the very beginning to keep the performance of different parts of **Mantra** in check while adding features.
 
-Given the design of **Mantra** I decided:
+Given the low-latency nature of **Mantra** I decided:
 - To use the hardware timer [`rdtscp`](https://www.felixcloutier.com/x86/rdtscp) for timestamping: more accurate, less costly than OS timestamps
-- That each entering message gets an **origin** timestamp that is **propagated** to all downstream messages that originate from it
-- That when a message gets **published**, a timestamp is taken and its `delta` w.r.t. the **origin** timestamp is stored together with the publisher's `id`
+- That each message that enters the system gets an **origin** timestamp which is **propagated** to all downstream messages that result from it
+- That when a message gets **published** to a `Queue` its time `delta` w.r.t. the **origin** timestamp is stored together with the publisher's `id`
 - To offload these timestamps to specific timing `Queues` in **shared memory** so external tools can do the timing analysis
 
 This scheme allows me to automatically time the different parts of **Mantra** with minimal overhead and to a high degree of accuracy.
@@ -141,10 +141,11 @@ Some quick numbers are:
 "But why gobble up all that data", you may wonder. Glad you asked.
 
 I am of the very strong opinion that representative **backtests** should be performed as much as possible on the **"in-production"** system rather than through some idealized transformations of DataFrames (although that is perfect for initial strategy exploration).
+Or assuming that different parts of the system take a fixed amount of time to execute.
 This is especially true for *low-latency* systems, given how tightly coupled the performance and implementation of the system are with the trading algos and their parameters.
 
 I have thus implemented a `MockExchange` which feeds the captured historic market data back into the system, and simultaneously uses it to mimic the behavior of a real `Exchange`.
-Of course there are some approximations here, but it nonetheless provides a successful strategy for backtesting the system as a whole.
+Of course there are some approximations here, and backtests are not fully deterministic this way, it nonetheless provides a successful strategy for backtesting the system as a whole.
 
 This gives only a very short glimpse into this relatively deep topic, and I have some interesting additional experiments planned for the multi series of blog posts on Market Data.