Skip to content

Commit

Permalink
rewrite of hello_world
Browse files Browse the repository at this point in the history
  • Loading branch information
louisponet committed Jun 6, 2024
1 parent 10be3e8 commit 68fb8e5
Showing 1 changed file with 49 additions and 48 deletions.
97 changes: 49 additions & 48 deletions content/posts/hello_world/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,18 +9,18 @@ comment = true
+++

I started work on **Mantra** in an effort to learn **rust** and explore the development process of a distributed, high(ish)-frequency/low-latency trading system in the language.
One of the driving goals was to achieve a [tick-to-trade](https://beeksgroup.com/blog/tick-to-trade-is-it-the-new-must-have-metric-in-trading-performance/) internal latency on the order of 0.5 - 10$\mu$s (depending on trading algo complexity).
From the get go, I've targeted an internal [tick-to-trade](https://beeksgroup.com/blog/tick-to-trade-is-it-the-new-must-have-metric-in-trading-performance/) latency in the low microsecond range.
While this has forced me to thread carefully while adding an ever increasing amount of capabilities and features, I have been able to keep **Mantra's** internal latency firmly below this target.

I've now reached a point where I feel like some of the solutions and designs I came up with are worth sharing.
While most code snippets will be written in `rust`, the language is not the main focus, which should make translating the discussed concepts easily applicable using other languages.
I hope to also learn a thing or two through eventual discours with more seasoned practitioners, so be sure to make plentiful use of the *Comment* field!
I've now reached a point where I feel like some of the solutions and designs I came up with are worth sharing, hence this blog.
While **Mantra** itself will remain closed source, for obvious reasons, it will serve as a _fil rouge_ tying the concepts of deeper discussions together, and show how they can be applied to a real-world high-performance application.

**Mantra** itself will remain closed-source, for obvious reasons, but I will use this blog to go through and give detailed discussions on some of its core building blocks.
Most of these have been purposefully hand-crafted focusing on pragmatism given the scope of the project.
As `rust` has been my language of choice for **Mantra**, many of the code snippets will be written in it.
The un-enlightened should not fret, however, since the focus will be on the concepts themselves and most of these should be quite straight-forwardly translatable into any capable programming language.

The intention of this initial blog post is to give an overview of the features and general design of **Mantra**, setting the stage for future more technical posts.

Let's start with a quick billboard rundown of the current features.
Enough with the intro, feature list time...
# Features and Capabilities
![](ui.png#noborder "ui")
***Mantra** ui during a backtest*
Expand All @@ -29,85 +29,85 @@ Let's start with a quick billboard rundown of the current features.
- **NO ASYNC**
- Low latency inter core communication using hand crafted message **queues, seqlocks, and shared memory**
- Full internal observability (message tracking) and high performance **in-situ telemetry** of latency and business logic
- Full **persistence of messages** using an efficiently compressed encoding for post execution analysis and replay
- Full **persistence of messages** for post execution analysis and replay
- [L2](https://centerpointsecurities.com/level-1-vs-level-2-market-data/) based orderbooks
- Concurrent handling of multiple trading algorithms
- Balance and order tracking accross multiple `Exchanges`
- Balance and order tracking accross multiple Exchanges
- Continuous **ingestion and storage** of market data streams
- WebSocket based connections to 5 crypto `Exchanges` (Binance, Bitstamp, Bitfinex, Coinbase and Kraken)
- "In production" **backtesting** by replaying historical marketdata streams while mimicking execution with a **mock exchange**
- Real-time UI for analysis of the system, marketdata and backtesting, capable of handling millions of datapoints
- WebSocket based connections to 5 crypto Exchanges (Binance, Bitstamp, Bitfinex, Coinbase and Kraken)
- "In production" **backtesting** by replaying historical market data streams while mimicking order execution with a **mock exchange**
- Real-time UI for observation of the system, market data and backtesting
- ~500k msgs/s throughput @ 0.5 - 10 microseconds internal latency
- A focus on code simplicity and **locality of behavior**
- < 15K LOC (for now)
- A focus on code simplicity and **locality of behavior** (< 15K LOC for now)

# System Design Overview
The core design has changed very little since **Mantra's** inception.
One of the main principles that has always guided me is a strong focus on pragmatism.
After all, creating a fully fledged and capable low-latency trading engine, while learning a new programming language, is no small project to tackle in one's free time.
A very modular and disconnected system design with isolated parts would allow me to effeciently add features while also allowing me to refactor previous implementations as
I gathered more experience working with `rust`.
Pragmatism has always been one of the guiding principles given the scope of creating a fully fledged and capable low-latency trading engine... while learning a new programming language.

This made me to opt for an event based system where `Actors` communicate with eachother by passing messages through `Queues`.
The potentially lower latency *single-function-hot-path* approach inevitably leads to more intertwined code, and I am convinced that it is easier to be build on top of a solid distributed system rather than the other way around.
This has led me to a very modular and disconnected design, allowing me to add features with minimal friction while also minimizing the impact of inevitable refactorings as I learned more about `rust`.
The potentially lower latency *single-function-hot-path* approach inevitably leads to more intertwined and hard to maintain code. I am also convinced that it is easier to be build the latter on top of a solid distributed system rather than the other way around.

The schematic below shows a high-level overview of the current design of **Mantra**
As the schematic below shows, **Mantra** is thus composed of a set of `Actors` that communicate with eachother through lock-free `Queues`.
![](system_design.svg#noborder)
*Fig 1. High level design of **Mantra***

The main execution logic and data flow is quite straight-forward:
1. incoming `L2Update` and `TradeUpdate` market data messages get consumed by the `TradeModels`
2. each `TradeModel` fills out a pre-defined set of ideal order positions, each with an `InstrumentId`, `price` and `volume`
1. incoming `L2Update` and `TradeUpdate` market data messages (green, top-left) get consumed by the `TradeModels` (in grey)
2. each `TradeModel` fills out a pre-defined set of ideal order positions
3. the `Overseer` actor continuously loops through these, and compares them to previously sent `OrderRequests` and live `Orders` on the target `Exchange`
4. if they don't match up and the necessary `balance` is available on the `Exchange`, the `Overseer` generates and publishes a new `OrderRequest`
5. the `AccountHandler` connecting with the target `Exchange` will then consume and send these requests, while feeding back various updates to the `Overseer`
5. the `AccountHandler` connecting with the target `Exchange` then sends these requests
6. `Order`, `Balance` and `OrderExecution` updates are fed back `Overseer`

An added benefit of centering the system around multi-consumer broadcasting message `Queues` is that it becomes extremely easy to attach a number of non latency critical auxiliary `Actors` that handle tasks such as **logging** without impacting the performance of the main business logic.
As will be discussed in future posts, centering the system around multi-consumer message `Queues` has many benefits.
They are set up to broadcast every message to every attached `Consumer`, and are designed such that `Consumers` do not impact eachother nor `Producers`.
This makes adding functionality through new `Actors` frictionless. Most will have realized by now that it's essentially a low-latency microservice design that actually works (one person team etc).

Lastly, as **crypto markets** are for now the main target (**Mantra** is by no means built specifically for crypto), the vast majority of latency anyway originates from the connection between me and the exchanges. I hope to be able to co-locate with certain exchanges in the future, at which point a single-function-hot-path becomes much more attractive for certain applications. Again, I believe a solid distributed system can form the perfect starting point for this as well.
Another benefit is that functionality can be switched on or off at will. `Queues` use shared memory, meaning that these `Actors` could be running in different processes which is exactly how the **UI** and **telemetry** of **Mantra** work.

That is why I really strived to keep the internal latency as low as possible at every step of the way. The result is that **Mantra** achieves internal tick-to-trade latencies between 400ns and 10$\mu$s on an untuned arch-linux based distro running on a less than prime example of the intel 14900 K.
One final consideration against using a single-function-hot-path approach is that **Mantra** currently targets **crypto** markets.
While there is nothing particularly specific to **crypto** outside of the WebSocket connections, it does mean that the vast majority of latency (10s of milliseconds) actually originates from the connection between my pc and the exchanges.
I hope to be able to co-locate with some of them in the future at which point achieving the absolute minimal latency becomes more critical (target of ~10-100ns). Again, pragmatism...

Having raved about `Queues` this much, let me now give an overview of the the inter core communication layer.
## Inter Core Communication (ICC)

![](Queue.svg#noborder)
*Fig 2. Seqlocked Buffer*

Let's zoom in to one of the fundamental parts of the system: the inter core communication layer.

The first part consists of the message `Queues`, denoted in [Fig 1.](@/posts/hello_world/index.md#system-design-overview) by the red arrows and ovals that specify the message type of each queue.
The `Queues` are denoted in [Fig 1.](@/posts/hello_world/index.md#system-design-overview) by the red arrows and ovals that specify the message type of each `Queue`.
They are essentially [`Seqlocked`](https://en.wikipedia.org/wiki/Seqlock) ringbuffers that can be used both in *single-producer-multi-consumer* (SPMC) and *multi-producer-multi-consumer* (MPMC) modes.

The main design considerations for their application in **Mantra** were:
Reiterating the main design considerations for their application in **Mantra**:
- Achieve a close to the ideal ~30ns core-to-core latency (see e.g. [anandtech 13900k and 13600k review](https://www.anandtech.com/show/17601/intel-core-i9-13900k-and-i5-13600k-review/5) and the [fantastic core-to-core-latency tool](https://github.com/nviennot/core-to-core-latency))
- Every attached `Consumer` gets every message, also known as *broadcast* mode
- `Producers` are "not" impacted by number of attached `Consumers` (difficult to achieve perfectly), mainly they don't care if `Consumers` can keep up
- `Consumers` should not impact eachother, and should know when they got sped past by `Producers`

Considering these design goals, `Producers` and `Consumers` do not share any state but the ringbuffer itself. `Consumers` simply know which version of the `Seqlocks` guarding the data they expect.
This means they know when the next message is ready: a `Producer` has incremented the version of the next `Seqlock` to the expected one,
as well as when they got sped past: a `Producer` incremented the version of the next `Seqlock` at least twice making it too high.

In **Mantra**, aside from the `Consumers` that handle the business logic, each `Queue` also has a `Consumer` that persists each message to disk.
As a result of the 3rd point, `Producers` and `Consumers` do not share any state other than the ringbuffer itself.
`Consumers` know which version of the `Seqlocks` that guard the data they expect.
That allows them to autonomously know when the next message is ready: a `Producer` has incremented the version of the next `Seqlock` to the one they expect.
If while running through the `Queue` they encounter a `Seqlock` with a version that is larger than what they expected they know they have been sped past: a `Producer` has written data and incremented the counter twice.

The second part of the ICC layer are the `SeqlockVectors` denoted by the blue rectangles in [Fig 1.](@/posts/hello_world/index.md#system-design-overview). They are used between the `TradeModels` and the `Overseer`.
These were chosen over another `Queue` because `TradeModels` potentially recalculate their ideal positions on each incoming marketdata message.
The `Overseer` takes care of quite some tasks. If it was busy while a `TradeModel` recomputed the values for a given ideal `Order` multiple times, the `Overseer` would still have to go through the messages from oldest to newest.
Using a `SeqlockVector` means that the `TradeModels` can update their desired positions as often as they want and the `Overseer` will always potentially send `OrderRequests` based on the latest information.
In **Mantra**, aside from the `Consumers` that handle the business logic, each `Queue` also has a `Consumer` that persists each message to disk. This allows for post-mortem analysis and replay.

One big benefit of using the style of communication is that by using shared memory any process can safely observe the messages flying through `Queues` and access the data filled
in the `SeqlockVectors`. As we will see later on this is very useful for offloading ancillary tasks to external tools.
The second part of the ICC layer are the `SeqlockVectors` denoted by the blue rectangles in [Fig 1.](@/posts/hello_world/index.md#system-design-overview).
They are for now only used between the `TradeModels` and the `Overseer`.
The reason I've used these rather than another `Queue` is because `TradeModels` generally recalculate their ideal positions on each incoming marketdata message.
If a `TradeModel` recomputed the values for a given ideal `Order` multiple times while the `Overseer` was busy, it would still have to go through the messages from oldest to newest.
However, we really only want to potentially send `Orders` based on the latest information. Using `SeqlockVectors` allows the `TradeModels` to overwrite the different `Orders` and the `Overseer` only reads the latest one for each.

In the next blog post I will do a much deeper dive on this layer, so stay tuned for that!
I think that by now it is clear that the `Seqlock` is the main synchronization primitive that is used throughout **Mantra**, and the next blog post will be a deeper dive exactly on that.
While it is quite a well understood concept, and I will indeed reference much literature that gave me inspiration, I will focus on how one would verify and time the implementation, which I think is not often discussed.

## Telemetry and Observability

From the very beginning I put great emphasis on in-situ telemetry to keep the performance of different parts of **Mantra** in check at all times.
As mentioned in the introduction, I have put great emphasis on in-situ telemetry from the very beginning to keep the performance of different parts of **Mantra** in check while adding features.

Given the design of **Mantra** I decided:
Given the low-latency nature of **Mantra** I decided:
- To use the hardware timer [`rdtscp`](https://www.felixcloutier.com/x86/rdtscp) for timestamping: more accurate, less costly than OS timestamps
- That each entering message gets an **origin** timestamp that is **propagated** to all downstream messages that originate from it
- That when a message gets **published**, a timestamp is taken and its `delta` w.r.t. the **origin** timestamp is stored together with the publisher's `id`
- That each message that enters the system gets an **origin** timestamp which is **propagated** to all downstream messages that result from it
- That when a message gets **published** to a `Queue` its time `delta` w.r.t. the **origin** timestamp is stored together with the publisher's `id`
- To offload these timestamps to specific timing `Queues` in **shared memory** so external tools can do the timing analysis

This scheme allows me to automatically time the different parts of **Mantra** with minimal overhead and to a high degree of accuracy.
Expand Down Expand Up @@ -141,10 +141,11 @@ Some quick numbers are:
"But why gobble up all that data", you may wonder. Glad you asked.

I am of the very strong opinion that representative **backtests** should be performed as much as possible on the **"in-production"** system rather than through some idealized transformations of DataFrames (although that is perfect for initial strategy exploration).
Or assuming that different parts of the system take a fixed amount of time to execute.
This is especially true for *low-latency* systems, given how tightly coupled the performance and implementation of the system are with the trading algos and their parameters.

I have thus implemented a `MockExchange` which feeds the captured historic market data back into the system, and simultaneously uses it to mimic the behavior of a real `Exchange`.
Of course there are some approximations here, but it nonetheless provides a successful strategy for backtesting the system as a whole.
Of course there are some approximations here, and backtests are not fully deterministic this way, it nonetheless provides a successful strategy for backtesting the system as a whole.

This gives only a very short glimpse into this relatively deep topic, and I have some interesting additional experiments planned for the multi series of blog posts on Market Data.

Expand Down

0 comments on commit 68fb8e5

Please sign in to comment.