Skip to content

VMA Parameters

Liran Oz edited this page Nov 25, 2018 · 1 revision

VMA Parameters

This wiki page will help you choose the correct VMA parameters for your application. a For the full list of VMA parameters, default values and explanations, please see README.txt.

Do not ignore any VMA errors/warnings as a result of changing VMA parameters! Some of your settings are probably problematic.


VMA support number of specs with pre-configured settings –

Latency profile spec - optimized latency from basic use cases. System is tuned to keep balance between Kernel and VMA. Note: It may not reach the maximum bandwidth Example: VMA_SPEC=latency

Multi ring latency spec - optimized for use cases that are keen on latency where two application communicate using send-only and receive-only TCP sockets Example: VMA_SPEC=multi_ring_latency

Recommended configuration

$ VMA_SPEC=latency LD_PRELOAD=$VMA_LOAD numactl --cpunodebind=1 taskset -c 7,9 sockperf...........


Follow the descriptions which fit your application:

I only have offloaded traffic

If all of your traffic is offloaded, you can gain performance improvment by instructing VMA not to poll the OS for incoming traffic.

Use the following parameters:

  • VMA_RX_UDP_POLL_OS_RATIO=0
  • VMA_SELECT_POLL_OS_RATIO=0

Some of my traffic is not offloaded

If the not offloaded traffic have high priority, set the following parameters to low ratio (high polling):

  • VMA_RX_UDP_POLL_OS_RATIO=1
  • VMA_SELECT_POLL_OS_RATIO=1

I need to offloaded only specific traffic flows

You have several options to blacklist/whitelist traffic flows:

  • Use libvma.conf for specifing which traffic flows should be offloaded.
  • Use VMA extra API - add_conf_rule() - for adding libvma.conf rules at run-time.
  • Use VMA extra API - thread_offload() - to specify for each thread if sockets should be created as offloaded sockets or not.
  • Use the environment variable VMA_OFFLOADED_SOCKETS to specify if sockets should be created as offloaded/not-offloaded by default. It is recommended to use this with thread_offload() extra API.

My application is multithreaded

Currently,VMA is not handling well more than one thread per cpu core. The following instructions are mainly for applications with less threads than cores. Please see the end of this section for the case of several threads on the same core.

Currently, these instructions will benefit mainly applications which are using epoll. Application which are using select/poll will not benefit as much, or even suffer from performance degredation.

Please read first the explanation about VMA buffers in "I need low memory usage".

If you have less networking threads than cores, use the following parameters:

  • VMA_RING_ALLOCATION_LOGIC_RX=20
  • VMA_RING_ALLOCATION_LOGIC_TX=20
  • VMA_RX_BUFS= #networking_threads X 50000

Now, if VMA_RX_BUFS is too large (note that each buffer is of size VMA_MTU [=1500]), you might want to change VMA_RX_WRE which will allow you to consume less memory. A good VMA_RX_BUFS value will be = VMA_RX_WRE x #rings + some spare buffers. The number of rings is per offloaded interface and per thread (VMA_RING_ALLOCATION_LOGIC=20). For example, if you are accessing two interfaces (ib0, eth2) from each thread, and you have 8 threads, set VMA_RX_BUFS = (VMA_RX_WRE x (2*8)) x 1.5. If you want to have less rings in the system in order to save memory, you can use VMA_RING_LIMIT_PER_INTERFACE to limit the number of rings, but it is recommended to have a separate ring per thread/core. You might have to use it if you have too many thread/cores.

You also might want to set VMA_INTERNAL_THREAD_AFFINITY to a core on the close NUMA node that does not have something heavy running on (default is 0).

If you have more threads than cores, and you don't bind threads to cores, we suggest using:

  • VMA_RING_ALLOCATION_LOGIC_TX=31
  • VMA_RING_ALLOCATION_LOGIC_RX=31
  • VMA_RX_BUFS= #cores X 50000

If you do bind threads to cores use 30 instead of 31.

You may try using the option 30 or 31 even for less threads than cores.

Also, the parameter VMA_THREAD_MODE=3 might help, but it might make things worse, so you need to try and check.

WARNING: This feature might hurt performance for applications which their main processing loop is based in select() and/or poll().

My application have thousands of sockets

You will need to enlarge the amount of VMA buffers.

For RX side, each socket might hold up to 2 X VMA_RX_WRE_BATCHING (default: 64) unused buffers for each interface it is recieving from. Therfore, VMA_RX_BUFS (default: 200000) should be at least #sockets X VMA_RX_WRE_BATCHING X 2 X #interfaces.

For TX side, each UDP socket might hold up to 8 unused buffers and each TCP socket might hold up to 16 unused buffers for each traffic flow it is sendting to. Therfore, VMA_TX_BUFS (default: 16000) should be at least #udp_sockets X 8 X #traffic_flows + #tcp_sockets X 16.

Many connections from a single client to a server listen socket

If you have a listen socket that accept many connections from the same source IP, you can improve performance by using VMA_TCP_3T_RULES=1

My machine have several numa nodes

You will probably see better performance if you bind your application to the closest numa node to the NIC.

First, check on which numa node your NIC is (you can use lspci), and which cpu cores are associated with this numa node (you can use numactl --hardware).

Then:

  • Set VMA_INTERNAL_THREAD_AFFINITY=core_id_on_the_numa_node
  • Use taskset to bind the application to the rest of the cores on this numa node

My application was written in Java

Java is running over IPV6 by default. VMA currently does not support IPV6, make sure you are running over IPV4:

Java -Djava.net.preferIPv4Stack=true …

I need low memory usage

You can control the amount of buffers in VMA using the following parameters:

  • VMA_RX_BUFS
  • VMA_TX_BUFS

Use VMA_RX_BUFS to specifiy the amount of RX buffers for all interfaces and sockets in VMA.

The value of VMA_RX_BUFS must be gretaer than the number of offloaded interfaces multiplied by VMA_RX_WRE (VMA_RX_BUFS > num_offloaded_interfaces X VMA_RX_WRE). You will need to set VMA_RX_WRE accordingly.

The same goes for VMA_TX_BUFS and VMA_TX_WRE.

The size of each buffer is determined by VMA_MTU.

If your application is only dealing with small packet sizes, you can lower VMA_MTU in order to save memory.

The total amount of memory consumed by VMA buffers is calculated by this formula: MEMORY = (VMA_RX_BUFS + VMA_TX_BUFS) x MTU.

The default values of the involved parameters are:

  • VMA_RX_BUFS=200000
  • VMA_TX_BUFS=200000
  • VMA_RX_WRE=16000
  • VMA_TX_WRE=16000

I need low CPU utilization

Although VMA is built to fully utilize CPU in its RX flows, you can use the following parameters in order to reduce it:

  • VMA_RX_POLL=0
  • VMA_SELECT_POLL=0

I want to extract VMA logs

Use VMA_TRACELEVEL to set the log level of messages printed to the screen. The default is 3, use 4 or 5 (or even 6 - not recommended) for more output. Each level will also print all lower levels.

Use VMA_LOG_FILE to redirect the log to a file. If you use multithread application or severl VMA applications, use "%d" in the name of the log file in order for it to contain the relevent process/thread id. For example: VMA_LOG_FILE="/tmp/vmalog.%d".

Use VMA_LOG_DETAILS to control the details (thread-id, process-id and time) in every log message. Values are 0-3.

Note that using high log levels will slow your application. It can help if you redirect the log to a file.

I want to extract VMA statistics

Use VMA_STATS_FILE to output VMA statistic to a file.

You can also see the statistics using vma_stats.

Tuning IOMUX - epoll/select/poll

VMA_SELECT_POLL

While blocking, control the duration in which to poll the hardware on Rx path before going to sleep (pending an interrupt blocking or timeout on OS select(), poll() or epoll_wait().

For best latency, use -1 for infinite polling.

Default value is 100000 (msec).

VMA_SELECT_POLL_OS_RATIO

Control the ratio in which VMA will poll the not-offloaded sockets.

Disable with 0 if all traffic is offloaded.

Set to high polling (low ratio) if not-offloaded traffic have high priority.

Default value is 10.

Tuning RX calls

VMA_RX_POLL

For blocking sockets only, control the number of times to poll on RX path for ready packets before going to sleep (wait for interrupt in blocked mode).

For best latency, use -1 for infinite polling.

For low CPU usage use 1 for single poll.

Default value is 100000

General tricks

  • Set VMA_MTU / VMA_MSS to your maximum packet size (sent and received). If your maximum packet size is small, this will save memory, might improve latency and give better cache utilization.