For maximum performance, several hardware, operating system and Snabb parameters need to be tuned. We start with some generic system-oriented tips and move towards application configuration.
Go into your BIOS and verify that you or your hardware vendor have not enabled aggressive power-saving modes that could downclock your processors. A CPU in a power-saving mode typically takes some time to return to peak performance, and this latency can cause packet loss. Servers are often shipped configured to prioritize energy efficiency over performance, which is not the best choice if avoiding packet loss is a goal.
Note that each vendor offers different BIOS settings to enable or disable energy efficiency settings or profiles, typically named "Max performance", "Energy Efficiency" and "Custom". Select "Max performance" for latency sensitive Snabb use.
Hyperthreads are a way of maximizing resource utilization on a CPU core, driven by the observation that a CPU is often waiting on memory or some external event, and might as well be doing something else while it's waiting. In such a situation, it can be advantageous to run a second thread on that CPU.
However for Snabb that's exactly what we don't want. We do not want another thread competing for compute and cache resources on our CPU and increasing our latency. For best results, lowest latency, and to avoid a source of packet loss, disable hyperthreading via the BIOS settings.
Intel Turbo Boost Technology allows processor cores to run faster than the rated operating frequency if they're operating below power, current, and temperature specification limits. The conservative thing to do is to turn off Turbo Boost when doing performance tests of Snabb data planes, because that way you reduce variance due to CPU clock speed changing.
To avoid power-saving heuristics causing decreased throughput and higher
latency, set the CPU frequency governor to performance
:
for CPUFREQ in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
[ -f $CPUFREQ ] || continue;
echo -n performance > $CPUFREQ;
done
When running a Snabb data plane, we don't want interference from the
Linux kernel. In normal operation, a Snabb data plane won't even make
any system calls at all. You can prevent the Linux kernel from
pre-empting your Snabb application to schedule other processes on its
CPU by reserving CPUs via the isolcpus
kernel boot setting.
To isolate CPUs, boot your Linux kernel with the isolcpus
parameter.
Under NixOS, edit /etc/nixos/configuration.nix
to add this parameter:
boot.kernelParams = [ "isolcpus=1-5,7-11" ];
The line above prevents the kernel to schedule processes in CPUs ranging
from 1 to 5 and 7 to 11. That leaves CPUs 0 and 6 for the Linux kernel.
By default, the kernel will arrange deliver interrupts to the first CPU
on a socket, so this isolcpus
setting should also isolate the
data plane from interrupt handling as well.
After adding the isolcpus
flag run nixos-rebuild switch
and then reboot
your workstation to enable the changes.
For a standard Ubuntu or other system, edit /etc/default/grub
to add the appropriate isolcpus
line.
GRUB_CMDLINE_LINUX_DEFAULT="... isolcpus=1-5,7-11 ..."
Normally Linux will handle hardware interrupts on the first core on a
socket. In our case above, that would be cores 0 and 6. That works
well with our isolcpus
setting as well: interrupts like timers and so
on will only get delivered to the cores which Linux is managing already,
and won't interrupt the data planes.
However, some distributions (notably Ubuntu) enable irqbalanced
, a
daemon whose job it is to configure the system to deliver interrupts
to all cores. This can increase interrupt-handling throughput, but
that's not what we want in a Snabb scenario: we want low latency for
the data plane, and handling interrupts on data-plane CPUs is
undesirable. When deploying on Ubuntu, be sure to disable
irqbalanced
by setting ENABLED
to 0 in /etc/default/irqbalance
:
$ cat /etc/default/irqbalance
#Configuration for the irqbalance daemon
#Should irqbalance be enabled?
ENABLED="0"
...
Various interrupt counters per CPU core can be retrieved via
$ cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 CPU8 CPU9 CPU10 CPU11 CPU12 CPU13 CPU14 CPU15 CPU16 CPU17 CPU18 CPU19
0: 41 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 IR-IO-APIC 2-edge timer
8: 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 IR-IO-APIC 8-edge rtc0
...
The oldest CPU supported by Snabb is the Sandy Bridge series, and this series happens to have a bug in their IOMMU support that can manifest itself either as total packet loss or severe performance penalties. As a result of this bug, we never test Snabb data planes with the IOMMU on even on later CPU models, so in production you should also disable the IOMMU via a kernel boot parameter:
# Ubuntu
GRUB_CMDLINE_LINUX_DEFAULT="... intel_iommu=off ... "
# NixOS
boot.kernelParams = [ ... "intel_iommu=off" ... ];
It is said that newer CPUs don't have this issue and that we should be
using iommu=pt
for "passthrough" support. See
http://dpdk.org/ml/archives/dev/2014-October/007411.html and
https://lwn.net/Articles/329174/ for further details. The IOMMU
sounds like a nice idea and we would like to be able to take advantage
of it eventually.
By default on a Xeon machine, the virtual memory system manages its allocations in 4096-byte "pages". It has a "page table" which maps virtual page addresses to physical memory addresses. Frequently-used parts of a page table are cached in the "translation lookaside buffer" (TLB) for fast access. A virtual memory mapping that describes 500 MB of virtual memory would normally require 120000 entries for 4096-byte pages. However, a TLB only has a limited amount of space and can't hold all those entries. If it is missing an entry, that causes a "TLB miss", causing an additional trip out to memory to fetch the page table entry, slowing down memory access.
To mitigate this problem, it's possible for a Xeon machine to have some "huge pages", which can be either 2 megabytes or 1 gigabyte in size. The same 500MB address space would then require only 250 entries for 2MB hugepages, or just 1 for 1GB hugepages. That's a big win! Also, memory within a huge page is physically contiguous, which is required to interact with some hardware devices, notably the Intel 82599 NICs.
However because hugepages are bigger and need to be physically
contiguous, it may be necessary to pre-allocate them at boot-time. To
do that, add the default_hugepagesz
, hugepagesz
, and hugepages
parameters to your kernel boot. In NixOS, we use the following, adding
on to the isolcpus
setting mentioned above:
boot.kernelParams = [ "default_hugepagesz=2048K" "hugepagesz=2048K"
"hugepages=10000" "isolcpus=1-5,7-11" ];
For Ubuntu, edit the Grub configuration as described above.
To use one-gigabyte huge pages, replace 2048K
above with 1G
and
reduce the hugepages
count down to something more reasonable,
perhaps 64.
Actual use of hugepages can be monitored with
$ cat /proc/meminfo |grep Huge
AnonHugePages: 12310528 kB
HugePages_Total: 64
HugePages_Free: 58
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 1048576 kB
On NUMA systems, the allocation and usage per node can be seen with
$ cat /sys/devices/system/node/node*/meminfo|grep Huge
Node 0 AnonHugePages: 12288 kB
Node 0 HugePages_Total: 32
Node 0 HugePages_Free: 32
Node 0 HugePages_Surp: 0
Node 1 AnonHugePages: 12298240 kB
Node 1 HugePages_Total: 32
Node 1 HugePages_Free: 26
Node 1 HugePages_Surp: 0
(The above example shows six 1G pages in use on Node 1 with 2 Snabb processes serving one 10GE port each).
Many Snabb network functions take a --cpu
argument, which will
arrange for the Snabb process to run on a particular CPU. It will
also arrange to make sure that all memory used by that Snabb process
is on the same NUMA node as that CPU, and it will check that any PCI
device used by that Snabb process has affinity to that NUMA node, and
issue a warning if anything is amiss.
If the Snabb network function that you are using does not have a
--cpu
option, file a bug. Binding Snabb to a CPU and NUMA node can
also be done using the numactl --membind
and taskset -c
commands,
but we recommend the --cpu
argument as it is easiest and it also
runs various consistency checks, such as checking that PCI devices
that you use are local to your NUMA node.
In a machine with multiple sockets, you usually have Non-Uniform
Memory Access, or NUMA. On such a system, a PCI device or a range of
memory might be "closer" to one node than another. The --cpu
argument to a nework function, described above, will issue a warning
if you use a PCI device that is not local to the NUMA node that
corresponds to the chosen CPU.
To determine what PCI devices are local to what NUMA nodes, you need to
grovel around in /sys
. For example if you are going to be working
with NICs 0000:01:00.0
, 0000:01:00.1
, 0000:02:00.0
, and
0000:02:00.1
, check:
$ for device in 0000:0{1,2}:00.{0,1}; do \
echo $device; cat /sys/bus/pci/devices/$device/numa_node; \
done
0000:01:00.0
0
0000:01:00.1
0
0000:02:00.0
0
0000:02:00.1
0
So all of these are on NUMA node 0. Then you can check your CPUs:
$ numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5
node 1 cpus: 6 7 8 9 10 11
So for these we should run our binaries under --cpu CPU
to bind them
to CPUs in the NUMA node 0, and to arrange to use only memory that is
local to that CPU.
Note that Linux also offers the command lstopo
to get an overall
picture in text and graphical form:
lstopo --of pdf > lstopo.pdf
Example from a Lenovo RD650 (Intel(R) Xeon(R) CPU E5-2650 v3):
The way that Snabb interfaces with a NIC is that it will configure the NIC to receive incoming packets into a /ring buffer/. This ring buffer is allocated by Snabb (incidentally, to a huge page; see above) and will be filled by the NIC. It has to be a power of 2 in size: so it can hold space for 64 packets, 128 packets, 256 packets, and so on. The default size is 1024 packets and the maximum is 65536. The NIC will fill this buffer with packets as it receives them: first to slot 0, then to slot 1, all the way up to slot 511 (for a ring buffer of the default size), then back to slot 0, then slot 1, and so on. Snabb will periodically take some packets out of this read buffer (currently 128 at a time), process them, then come back and take some more: wash, rinse, repeat.
Many network functions offer the ability to configure the ring buffer
size via a --ring-buffer-size
argument. (Again, if the network
function you are using does not provide this option, file a bug.)
What is the right size? Well, there are a few trade-offs. If the
buffer is too big, it will take up a lot of memory and start to have
too much of a cache footprint. The ring buffer is mapped into Snabb's
memory as well, and the NIC arranges for the ring buffer elements that
it writes to be directly placed in L3 cache. This means that
receiving packets can evict other entries in L3 cache. If your ring
buffer is too big, it can evict other data resident in cache that you
might want.
Another down-side of having a big buffer is latency. The bigger your buffer, the more the bloat. Usually, Snabb applications always run faster than the incoming packets, so the buffer size isn't an issue when everything is going well; but in a traffic spike where packets come in faster than Snabb can process them, the buffer can remain full the whole time, which just adds latency for the packets that Snabb does manage to process.
However, too small a buffer exposes the user to a higher risk of packet loss due to jitter in the Snabb breath time. A "breath" is one cycle of processing a batch of packets, and as we mentioned it is currently up to 128 packets at a time. This processing doesn't always take the same amount of time: the contents of the packets can obviously be different, causing different control flow and different data flow. Even well-tuned applications can exhibit jitter in their breath times due to differences between what data is in cache (and which cache) and what data has to be fetched from memory. Infrequent events like reconfiguration can also cause one breath to take longer than another. Poorly-tuned applications might have other sources of latency such as garbage collection, though this is not usually the case in production-ready Snabb network functions like the NFV virtual switch or the lwAFTR.
So, a bigger ring buffer insulates packet processing from breath jitter. You want your ring buffer to be big enough to not drop packets due to jitter during normal operation, but not bigger than that. In our testing we usually use the default ring buffer size. In your operations you might want to increase this up to 2048 entries. We have not found that bigger ring buffer sizes are beneficial, but it depends very much on the environment.
Remember that buffers are fundamentally mechanisms for smoothing over variations in latency. The real problem is the latency variation -- focus your effort on getting that right and you won't have to fiddle much with ring buffer sizes.
Running network functions under a Snabb NFV virtual switch is a great for isolation, however you do have to get a number of things right or you are not going to have a good experience. Really we should link out to another guide here, but until that happens, here are some brief notes.
All of the above considerations apply to running a virtual switch -- give each virtual switch instance a dedicated CPU core on the host, bind it to the same NUMA node of the card and the CPU, and make sure there are enough huge pages.
Additionally, for good performance, the host switch must run on the same NUMA node as the guest.
Vhost-User is used to connect Snabb with a high performance virtual interface attached to a QEMU-based virtual machine. This requires hugepages made available to Qemu:
cd <qemu-dir>/bin/x86_64-softmmu/
qemu-system-x86_64 -enable-kvm -m 8000 -smp 2 \
-chardev socket,id=char0,path=./xe0.socket,server \
-netdev type=vhost-user,id=net0,chardev=char0 \
-device virtio-net-pci,netdev=net0,mac=02:cf:69:15:0b:00 \
-object memory-backend-file,id=mem,size=8000M,mem-path=/hugetlbfs,share=on \
-numa node,memdev=mem -mem-prealloc \
-realtime mlock=on \
/path/to/img
The allocated memory must match the memory-backend-file size (example shows 8GB). While qemu will fail to boot if there isn't enough hugepages allocated, it is recommended to have some spare and note that the pages are split amongst the NUMA nodes. Check the paragraph on NUMA in this document. It is recommended to specify the qemu option '-realtime mlock=on', despite it being the default. This ensures memory doesn't get swapped out.
Virtual switches like Snabb NFV and Snabb vMX share memory also with one or more QEMU processes running Virtual Machines via VhostUser. These QEMU processes must also be pinned to the same NUMA node with numactl with optional CPU pinning:
numactl --membind=1 /usr/local/bin/qemu-system-x86_64 ...
numactl --membind=1 --physcpubind=16-17 /usr/local/bin/qemu-system-x86_64 ...
Actual memory usage per node can be displayed with 'numastat':
$ sudo numastat -c snabb
Per-node process memory usage (in MBs)
PID Node 0 Node 1 Total
--------------- ------ ------ -----
...
6049 (snabb) 0 3731 3731
6073 (snabb) 0 3732 3732
...
--------------- ------ ------ -----
Total 5 7753 7758
$ sudo numastat -c qemu
Per-node process memory usage (in MBs)
PID Node 0 Node 1 Total
--------------- ------ ------ -----
1899 (qemu-syste 0 7869 7869
1913 (qemu-syste 0 4171 4171
...
--------------- ------ ------ -----
Total 4 12040 12044
Above example shows two snabb processes (6049 & 6073) using memory only from node 1 as desired based on the NIC ports served by node
- Two QEMU based Virtual Machines also only use memory from node 1. If some memory is still used by another node for a given process, investigate its source and fix it. Possible candidates are SHM based filesystems use by Snabb in /var/run/snabb and SNMP SHM location if enabled.
CPU pinning via tasket and numactl work also within Docker Containers running in privileged mode. It is important to note that the Container can use all CPU cores, including the ones specifically excluded by the kernel option isolcpus:
$ taskset -cp $$
pid 9822's current affinity list: 0,4-8,12-15
$ docker run --name ubuntu -ti ubuntu:14.04.4
root@c819e0f106c4:/# taskset -cp $$
pid 1's current affinity list: 0-15
Fragment reassembly is a costly operation. Make sure that network MTUs are set such that fragmentation is rare.
Simply enabling ingress and/or egress filtering has a cost. Enabling all filters adds 4 apps to the Snabb graph, and there is a cost for every additional Snabb app. In our tests, while a normal run can do 10 Gbps over two interfaces in full duplex, enabling filters drops that to 8.2 Gbps before dropping packets. However we have not found that the performance depends much on the size of the filter, as the filter's branches are well-biased.
Here are a couple of notes for people writing their first Snabb data plane.
Defined in src/core/app.lua and enabled before calling engine.main() via
engine.busywait = true
engine.main(...)
If true then the engine will poll for new data in a tight loop (100% CPU) instead of sleeping according to the Hz setting. This will reduce overall packet latency and increase throughput at the cost of utilizing the CPU hosting Snabb at 100%.
As we mentioned above, a Snabb data plane that is not running fast
enough will fail to handle all packets incoming from the network. If
the network hardware supports it, Snabb can monitor these drops.
Install an "IngressDropMonitor" timer from
lib.timers.ingress_drop_monitor
, to issue warnings when your app is
too slow.
TLB (Translation Lookaside Buffer) is a cache of the translations from virtual memory addresses to physical memory addresses. When a processor changes the virtual-to-physical mapping of an address, it needs to tell the other processors to invalidate that mapping in their caches. The actions of one processor causing the TLBs to be flushed on other processors is what is called a TLB shootdown.
Occasional packet loss at ingress has been observed while TLB shootdowns happened on the system. There are per-CPU counters of TLB shootdowns available in the output of 'cat /proc/interrupts':
$ cat /proc/interrupts |grep TLB
TLB: 2012 2141 1778 1942 2357 2076 2041 2129 2209 1960 486 0 0 0 0 0 0 145 0 0 TLB shootdowns
They should remain 0 for the CPU serving Snabb (pinned via taskset or numactl).