-
Notifications
You must be signed in to change notification settings - Fork 50
Design
Mizar is a collection of XDP programs that runs in the kernel-space. The XDP programs process packets and implement all the back-plane functionality, including the transit agent, switch, and routers. The XDP programs shall also perform some of the data-plane features.
A user-space daemon - called the transit daemon - is responsible for configuring the XDP programs through ebpf maps. The transit daemon also exposes a simple RPC interface which defines a contract for the control-plane to pass configuration information. Users can also interact directly with the transit daemon through a CLI program, which may run on the same host of the daemon or from a remote machine. The primary usage of the CLI program is for testing and troubleshooting.
The following diagram illustrates the high-level interactions between the daemon, the XDP programs, and the control-plane.
+--------------------------------------+
| |
| Control-plane |
| |
+--------------------------------------+
|
|
v
+--------------------------------------+
| |
| Data Distribution (Kafka) |
| |
+--------------------------------------+
|
+---------------------------------------v---------------------------+
| Host +-------------------+ |
| | Control-Plane | |
| | Agent | |
| +-------------------+ |
| | |
| RPC| |
| | |
| v |
|+----------------+ +-------------------+ |
|| CLI Interface |---RPC---->| user-space daemon | |
|+----------------+ +-------------------+ |
| | User Space |
| --------------+---------------------------|
| v Kernel |
| +--------------------+ |
| | ebpf maps | |
| +--------------------+ |
| | XDP Programs | |
| +--------------------+ |
| |
+-------------------------------------------------------------------+
Several tunneling protocols can be used to implement an overlay network. We choose Geneve as the tunneling protocol for the back-plane. Geneve is an emerged overlay network standard that supports all the capabilities of VxLan, GRE, and STT. It supports a large amount of metadata which will be required for data-plane feature implementation and still ensures extensibility. Unlike VXLan and NVGRE, which provides simple tunneling for underlay network partitioning, Geneve provides the needed connectivity between multiple components of distributed systems. This connectivity is required in order to implement various functions of {name}. Several NICs are already supporting Geneve encapsulation/decapsulation offloading. Also, Geneve supports randomized source port over UDP, which allows consistent flow routing and processing among multi-paths. We could use STT which provides similar functionality to Geneve, but on TCP, thus Geneve remains preferred for its lower overhead.
The RFC draft VNO3-Geneve defines the Geneve packet format.
Geneve header can have a minimum length of 8 bytes and a maximum of 260 bytes. The necessary fields to note in the Geneve header are:
- Virtual Network Identifier (VNI): In the standard, this can define an identifier for a unique element of a virtual network. In {name}, the VNI shall have the semantics of the VPC ID.
- Opt Len: The length of the options fields, expressed in multiples of four bytes, not including the eight bytes fixed tunnel header.
We shall be using Geneve TLV options to inject packet tunnel metadata that will support the implementation of various network functions including network policy, QoS, load balancing, NAT, … etc.
For example, in the case of container networking, we shall represent a label-selector as one Geneve option by which policy enforcement points shall execute various decisions based on policy rules. Also, the Geneve option shall encode the type of end-point, which shall alter the routing decisions.
Mizar adopts a simple data model that is essential to extend its functionality. The model consists of endpoint, network, and VPC constructs.
A VPC is the conventional Virtual Private Cloud construct that is primarily defined by a CIDR block within a region. The VNI of the Geneve header uniquely identifies the VPC. A network function must belong to a VPC, where the VNI provides the primary logical separation mechanism used to support multi-tenancy. One or more transit routers route traffic within a VPC. The following fields define a VPC data-model.
-
vni: Unique ID of the VPC that shall represent the Geneve VNI. At the moment, the most significant 64-bits of the vni uniquely identifies an administrative domain (e.g., single-tenant), and the least significant 64-bits uniquely identifies a VPC within an administrative domain.
-
cidr: The CIDR block of the VPC.
-
routers IP: A list of the IP address of the transit routers of the VPC.
An endpoint is a logical representation of an overlay IP within a network and a VPC. The IP must belong to a CIDR of a network, hence the CIDR of the VPC. The endpoint is also identified by a type, that determines how transit switches to route traffic from and to the endpoint. The following fields define an endpoint:
-
type: {Simple, Scaled, Proxied}
-
IP: Endpoint IP (V4/V6)
-
tunnel protocol: {VxLan, Geneve}
-
Remote IPs: A list of IP addresses that represents the host(s) of the endpoint. In the case of Simple endpoint, this is the IP of the endpoint's host.
-
Endpoint Geneve Options: A list of custom Geneve options that shall be attached to the tunnel packets of the endpoint to realize and application.
-
Remote Selection Function: The function used to select the remote IP mapping the endpoint {hash, colocated}.
-
Bypass Decapsulation: A flag indicates that the endpoint is allowed to receive tunnel packets as is without decapsulation
This is the fundamental endpoint type, which is analogous to a conventional virtual interface of a container or a virtual machine. A simple endpoint has a 1:1 mapping to a host or a network function (tunnel interface). Traffic ingressing to a simple-endpoint is decapsulated and forwarded to a single tunnel interface or a network function. The following figure illustrates the remote association of a simple endpoint.
+-----------------+
| |
+-----------------+ | |
| Simple Endpoint |----1:1--->| Host. |
+-----------------+ | |
| |
+-----------------+
A scaled endpoint has a 1: N mapping to N end-hosts or network functions. A transit switch routes traffic ingressing to a scaled-endpoint to one of its remote IPs by typically hashing the 5-tuples of the inner packets. The control-plane may configure other selection functions to determine the final packet destination. This is useful in implementing scalable network functions such as Layer-4 load balancers, or a NAT device. The following figure illustrates the remote association of a scaled endpoint.
+-----------------+
| +---------------+-+
| | +---------------+-+
+-----------------+ +----->| | | |
| Scaled Endpoint +-1:N--+-+----+>| | |
+-----------------+ +----+-+>| Substrate Hosts |
+-+ | |
+-+ |
+-----------------+
A proxied endpoint has a 1:1 mapping to another endpoint. The other endpoint can be simple, scaled, or proxied endpoint. Fundamentally a proxied endpoint provides the underlying packet forwarding mechanisms required to implement VPC endpoints. The following figure illustrates the remote association of a proxied endpoint.
+-----------------+ +-----------------+
|Proxied Endpoint |--1:1--->|Another Endpoint |
+-----------------+ +-----------------+
Mizar defines a network in a broader term as a compartment of multiple data points. Conventionally a network is a subset of specific CIDR block from the VPC CIDR block, but the data-model allows defining the networks as a group of endpoints that don't necessarily share IP address from the same CIDR block. A network represents the logical separation where flow switching occurs with a minimal number of hops. To support various use cases for both conventional VMs, Containers, and future compute types, Mizar primarily supports two types of networks:
- subnet: This is classical VPC subnets defined by a CIDR block of the network must fall within the CIDR space of the VPC. An endpoint belongs to the subnet that has the longest prefix match.
- group: This is a new logical network defined by a label. Endpoints can join and leave a group-network dynamically according to group policies. When an endpoint is permitted to join a group-network, the outer header of the encapsulated packet will have a Geneve option that contains the group-label which allows network functions to make decisions based on group-network memberships.
- cidr: The CIDR block of the subnet (a subset of the VPC CIDR).
- switch IP: The IP address of the transit switches of the network.
- group ID: The group ID of a network of type group (zero otherwise).
The following diagram illustrates the software design of the Mizar back-plane. An endpoint is realized as a simple peer veth interface, where one end of the interface (veth0) represents the virtual machine, container, or network function endpoint. The other end of the interface (veth) remains in the root namespace. A transit agent is an XDP program processing the ingress packet of veth in-kernel.
+-----------------------------------------+
| Userspace Control and Management Agent |
+--------^--------------------------------+
| User-space
*--------+-----------------------------------------*
| +---------------+ Kernel-space
| | veth0 of |
| | endpoint |
| | (egress) |
| +-------+-------+
v |
+----------------+ |
| eBPF maps | |
| Agent | |
|(Config/Monitor)| |
+--------^-------+ |
| |
| v
+----+---+-------+ REDIRECT +-----+
| | veth |---encap -----------| |
| |ingress| Packets | |
| +-------+-------+ | NIC |
| Transit Agent | veth | REDIRECT | |
| XDP program |egress |<--decap ---+ |
+----------------+---+---+ Packets | |
| +-----+
+---------v---------+
| veth0 of endpoint |
| (ingress) |
+-------------------+
Another XDP program is running on the ingress path of the physical NIC of the host, and this program is capable of assuming either the transit switch or the transit router roles depending on the context of the flow processing. This is the main program that implements the back-plane processing pipeline and will be referred to as the back-plane XDP program.
Mizar uses eBPF maps to manage the configuration and monitoring of both XDP programs. A userspace control process populates the EBPF maps with endpoint, switch, or router configuration. It also reads monitoring data for both programs, including metrics, health status, and fault conditions.
+-----------------------+
| Userspace Control and |
| Management Agent |
+-----------^-----------+
User-space |
*---------------------------------+--------------------*
Kernel-space v
+----------------+ +-----+
| eBPF maps | | |
| Switch/Router | | |
|(Config/Monitor)| | |
+--------^-------+ | |
| | |
+-----+ | | Net |
| |REDIRECT +--------+--------+-------+ |Stack|
| +-encap --> NIC | Switch/Router | | |
|veth |Packets | egress | XDP program | PASS |
| | +--------+-------+ |--non-tunnel|
| | | NIC | | packets |
| |--------REDIRECT--|ingress| | | |
+-----+ decap +-------+--------+ +-----+
Packets ^
|
*----------------------------+-------------------------*
|
| HW
The transit agent XDP program is a lightweight program that primarily tunnels the packets on the ingress path towards a transit switch of the endpoint's network. Then the agent simply redirects the packets to the NIC egress if the destination endpoint is not hosted on the same host as the agent. In the latter case, the packet shall be routed directly to the veth interface of the destination endpoint.
The back-plane xdp program runs on the NIC ingress and processes all outer header information of the tunnel packets. The following diagram illustrates the ingress packet processing pipeline for the back-plane XDP program.
|
RX
|
v
+-----------+
| Parse | +--------------+
| Headers |------------------>|Network Stack |
+-----------+ +--------------+
|
|
v
+-----------+ +------------+
| (Local) | | In-network |
| Process | | endpoint | +--------------+
|Destination|-->| functional |-->|endpoint agent|
| endpoint | | processing | +--------------+
+-----------+ +------------+
|
+-------------+ +-----------+ |
| In-network | | (Switch) | |
| switch |<--|Forward to |<--------+
| processing | | Remote IP |
+-------------+ +-----------+
|
| +-----------+ +------------+
| | (Router) | | In-network |
+--------->|Forward to |-->| router |
| Network | | processing |
+-----------+ +------------+
|
+------------+ |
| (Switch) | |
| Forward to |<-------+
| VPC Router |
+------------+
|
|
TX
|
v
The program role of transit switch/router is determined by the packet context during processing. The processing pipeline follows the following roles:
-
If the ingress packets are not Geneve encapsulated, the program passes the packet for normal processing to the Linux network stack.
-
If the destination endpoint belongs to a known endpoint IP address on the same host, activate an in-network packet processing stage. This stage can implement in-network functions for an endpoint such as NAT. The in-network processing can be bypassed or followed by redirect the packet to the corresponding veth egress. The redirection can be for the entire packet, or the decapsulated packet according to the processing context or type of the endpoint.
-
If the destination endpoint remote IP (host) is known, forward the packet to the remote IP (Transit Switch Function)
-
If the destination endpoint's network is known, forward the packet to one of the transit switches (Transit Router Function) or transit router (in case of VPC to VPC peering) of the destination network based on source/destination words hashing. Otherwise, (network is not known) and if the program is the Transit router of the VPC, drop the packet.
-
Forward the packet to one of the transit routers of the VPC (Transit Switch Function)
The control-plane agent programs the data-plane. The data-plane supports an internal protocol to facilitate data-plane programming through remote RPC.
Fundamentally, users perform nine operations that shall be supported by the control-plane workflow:
- Create/Update/Delete VPC
- Create/Update/Delete Network
- Create/Update/Delete Endpoint
The workflows for a create/update/delete API are similar (not necessarily exact). For simplicity of presentation, I will focus on the creative workflow to illustrate the requirements of the control-plane business logic.
Conceptually - and in high level - there are three lookup tables in each transit XDP program.
- VPC table: A key/value table where the key is the tunnel id of the VPC, and the value is the list of the transit routers of the VPC.
- Network table: A key/value table where the key is the network ip, and the value is the list of the transit switches of the VPC.
- Endpoint table A key/value table where the key is the endpoint IP, and the value is the host of the endpoint.
Note: See the trn_datamodel.h file for the exact definition of the table keys and data values.
On creating a VPC, the control-plane workflow shall be as follows:
- Trigger the placement algorithm to allocate multiple transit routers to the VPC. For example, VPC0->[R1, R2], where R1 and R2 are the IP addresses of the substrate node (host) functioning as a transit router.
- Call an update_vpc API on R1 and R2 that populates their VPC table. For example, the VPC table in both R1 and R2 shall have the following information:
Tunnel ID | VPC |
---|---|
VPC0 | [R1, R2] |
Table: VPC table in R1 and R2
When the user creates a network within a VPC, for example, net0 and net1 in vpc0 (where net0 and net1 are the network addresses of net0 and net1), the control-plane workflow shall be as follows:
-
Trigger the placement algorithm to allocate multiple transit switches for net0 and net1. For example net0 -> [S00, S01], where S00 And S01 are the IP addresses of the transit switches serving net0. Similarly, net1 -> [S10, S11], where S10 and S11 are the IP addresses of the transit switches serving net1.
-
Call an update_net API first on the R1 and R2 that populates their Network table. In our example, the network table of R1 and R2 shall look like:
Network Key | Network |
---|---|
{net0, VPC0} | [S00, S01] |
{net1, VPC0} | [S10, S11] |
Table: Network table in R1 and R2
- Call an update_vpc API for the switches of net0 and the switches of net1. Accordingly, the VPC tables in S00, S01, S10, S11 shall look like:
Tunnel ID | VPC |
---|---|
VPC0 | [R1, R2] |
Table: VPC table in S00, S01, S10, S11
When the user ** attaches** an endpoint to a container, the control plane needs to perform four main actions (through the data-plane):
- Create the actual virtual interface on the host
- Execute the transit agent xdp program on the virtual interface
- Populate the endpoint metadata on the transit agent xdp program
- Populate the endpoint table on all the transit switches of the network with routing entries of the new endpoint.
Let's consider; for example, the user attaches a simple endpoint ep0 in net0 to a VM in host0. ep0 is the virtual IP of the endpoint, and host0 is the IP address of the host. The control-plane workflow shall be as follows:
-
Call a create_endpoint API for host0. Accordingly, the network control-plane agent shall create a virtual interface pair veth0 -> veth0_root, attach veth0 to the compute resources (e.g., container namespace), and load the transit agent program on veth0_root.
-
Call an update_network API for the transit agent Running on veth0_root. The network table on the transit agent shall look like
Network Key | Network |
---|---|
{net0, VPC0} | [S00, S01] |
Table: Network table of Transit agent of veth0_root
- Call an update_endpoint API for net0 transit switches; S00 and S01. Accordingly, the network table on net0 transit switches shall look like:
Endpoint Key | Endpoint |
---|---|
{VPC0, ep0} | [host0] |
Table: Endpoint table of S00 and S01
The following table summarizes the primary control-plane workflows as well as the table to be populated during each operation.
User API/Events | Control-Plane Algorithm | Data-plane RPCs | Data-plane ebpf map |
---|---|---|---|
Create VPC | Allocate transit routers for the VPC | call update_vpc on all transit routers of the VPC (optional) | Popoulate VPC map (transit XDP program) |
Create Network | Allocate transit switches for the network | 1) Call,update_vpc on all transit switches of the network,2) Call,update_network on all transit switches (optional) | Populate VPC map (transit XDP program) |
Attach Endpoint | Invoke virtual interface creation procedures (e.g., netns, libvert) on host | 1) Call create_endpoint on host, 2) Call update_network on transit, agent of the host, 3) Call update_endpoint on all transit switches of,the network | Populate, Endpoint map,(transit XDP,program), Network map (of,transit agent) |
Update VPC | Update the vpc data | 1) Call,update_vpc on,all transit switches,of networks,inside the,VPC, 2) Call update_vpc in transit routers in the VPC (optional) | Populate VPC map (transit XDP program) |
Update Network | Update network data in all endpoint's transit agents | Call update_network in all endpoints of a network | Populate Network map of transit agent |
Update Endpoint | Update endpoint data in transit switches of the network | Call update_network in all the transit switches of an endpoint | Populate Endpoint map (transit XDP program) |
Delete VPC | (Applicable only if VPC has no network) | 1) Call,delete_vpc on all,transit switches,of all networks,in the,VPC, 2) Call delete_vpc on all transit routers in the VPC (optional) | Populate VPC map (transit XDP program) |
Delete Network | Applicable only if a network has no endpoints | 1) Call,delete_network on all, transit routers, of the VPC,2) Call,delete_network on transit switches of the network (optional) | Populate Network map (transit XDP program) |
Delete Endpoint | Deletes the endpoint and unload the transit agent program on the host | 1) Call delete_endpoint, on all, transit switches, of a network, 2) Unload the transit agent xdp program on the peer virtual interface. | Populate Endpoint map (transit XDP program) |
Table: Summary of workflows
Mizar provides a simple, efficient, and flexible mechanisms to create overlay networks with various types of endpoints support at scale. The control-plane continuously optimizes the placement of the transit switches and transit routers.
Smart Scaling and Placement algorithms are at the primary mechanisms to achieve smart decisions, including:
- Smart Routing
- Smart Configuration
- Smart Congestion control
Placement and Scaling entail determining which transit xdp program (physical host) shall be configured to function as a transit router or a transit switch.
Placement objectives of transit switches and routers can be, but not limited to:
Accelerate Endpoint Provisioning: For example, in serverless applications, endpoint provisioning is very frequent. Thus the placement algorithm shall minimize the number of transit switches to be used within a network while ensuring that the switches are placed on nodes of sufficient bandwidth to accommodate the applications traffic.
Minimize Communication Latency: For example in applications where VPC configuration is not frequently changed or for latency-sensitive applications (e.g., conventional IT applications, predefined compute clusters), the placement algorithm shall minimize latency by ensuring that the transit switches placed in proximity or co-location with endpoints.
Improve Network Resiliency: Typically transit switches shall be placed within the same availability zone of their network to confine the latency boundary of the network traffic. The transit routers shall span multiple availability zones to ensure availability. That said, the placement algorithm shall be flexible also to place the transit switches across multiple availability zones to improve resiliency. This is particularly important if the transit switches are serving scaled or proxied endpoints. Such endpoints are usually used to implement network functions that are regional or global, and placement of the transit switches serving these endpoints across multiple availability zones are essential for availability reasons.
Scaling objectives of transit switch and routers are to optimize the placement decisions such that continuously :
- Control Network Congestion: By continuously increasing the number of transit switches needed to serve the network traffic.
- Optimize Configuration Overhead and Cost Minimization: By continuously evaluating opportunities to scale down the number of transit switches and routers within a network or VPC. Scaling down the number of transit switches shall not impact ongoing or expected traffic.
All of these objectives shall be satisfied while ensuring that the number of transit switches and routers shall always support an unexpected surge in traffic that occur faster than the reacting time of scaling decision algorithm.