Skip to content

A VMA Basic Usage

NirNitzani edited this page May 10, 2017 · 1 revision

This guide gather best practice steps for VMA usage - It's recommended to review the Installation guide and user manual from here: http://www.mellanox.com/page/software_vma?mtag=vma
These are the most updated document and provide all the necessary information on how to use VMA.

The measurements were taken on two HP HPE ProLiant DL360 Gen9 CPU E5-2697 v3 @ 2.60GHz (Max turbo 3.6Ghz) servers with CentOS 7.2 x86_64. Two Mellanox ConnectX-4 connected back to back. Ethernet speed was configured to 10Gbe.

Preconditions:

  • It's very important to use a tuned machine and the correct NUMA & cores (more on that in VMA Tuning Guide)
  • Two Machines – one for the server role and second as a client
  • Management interfaces configured with an IP and machines can ping each other
  • Physical installation of Mellanox Card in your machines
    • Verify by "lspici | grep Mellanox" that your system recognized Mellanox HCA

Example:

$ lspci |grep Mellanox
03:00.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]

Basic Ping-Pong TCP Test:

  • VMA Version: 8.2.10
  • OFED version: MLNX_OFED_LINUX-4.0-2.0.0.1
  • SockPerf version: 2.8-0
  • Using NUMA 1 and Cores 13,19

1. Installing OFED drivers and VMA

2. Kernel Performance without VMA (for reference)

  • First machine (Server side):
    $ numactl --cpunodebind=1 taskset -c 19,13 sockperf sr --msg-size 14 --ip 11.4.3.3 --port 19140 –tcp

    Server side example output:
    sockperf: == version #2.8-0.git3dd5971d7d7a ==
    sockperf: [SERVER] listen on:
    [0] IP = 11.4.3.1 3 PORT = 19140 # TCP
    sockperf: Warmup stage (sending a few dummy messages)...

  • Second machine run (Client side):
    $ numactl --cpunodebind=1 taskset -c 19,13 sockperf pp --time 4 --msg-size 14 --ip 11.4.3.3 --port 19140 –tcp

Client side example output:

sockperf: == version #2.8-0.git3dd5971d7d7a ==
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)  
  
[ 0] IP = 11.4.3.3        PORT = 19140 # TCP  
sockperf: Warmup stage (sending a few dummy messages)...  
sockperf: Starting test...  
sockperf: Test end (interrupted by timer)  
sockperf: Test ended  
sockperf: [Total Run] RunTime=4.100 sec; SentMessages=469124; ReceivedMessages=469123
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=4.000 sec; SentMessages=457948; ReceivedMessages=457948
sockperf: ====> avg-lat=  4.349 (std-dev=0.336)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 4.349 usec
sockperf: Total 457948 observations; each percentile contains 4579.48 observations
sockperf: ---> <MAX> observation =   42.027
sockperf: ---> percentile 99.999 =    6.944
sockperf: ---> percentile 99.990 =    6.493
sockperf: ---> percentile 99.900 =    5.543
sockperf: ---> percentile 99.000 =    5.027
sockperf: ---> percentile 90.000 =    4.753
sockperf: ---> percentile 75.000 =    4.604
sockperf: ---> percentile 50.000 =    4.401
sockperf: ---> percentile 25.000 =    4.041
sockperf: ---> <MIN> observation =    3.576

3. VMA Performance

VMA performance has been checked by running sockperf and using the VMA_SPEC=latency environment variable

  • First machine (Server side):
    $ VMA_SPEC=latency LD_PRELOAD=$VMA_LOAD numactl --cpunodebind=1 taskset -c 19,13 sockperf sr --msg-size 14 --ip 11.4.3.3 --port 19140 --tcp

  • Second machine run (Client side):
    $ VMA_SPEC=latency LD_PRELOAD=$VMA_LOAD numactl --cpunodebind=1 taskset -c 19,13 sockperf pp --time 4 --msg-size 14 --ip 11.4.3.3 --port 19140 –tcp

Client side example output (trimmed):

VMA INFO: ---------------------------------------------------------------------------
VMA INFO: VMA_VERSION: 8.2.10-0 Release built on Mar 28 2017 03:35:42
VMA INFO: Cmd Line: taskset -c 19,13 sockperf pp --time 4 --msg-size 14 --ip 11.4.3.3 --port 19140 --tcp
VMA INFO: OFED Version: MLNX_OFED_LINUX-4.0-2.0.0.1:
VMA INFO: Spec                           Latency                    [VMA_SPEC]
VMA INFO: ---------------------------------------------------------------------------
.
.
.
sockperf: == version #2.8-0.git3dd5971d7d7a ==
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 11.4.3.3        PORT = 19140 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=4.100 sec; SentMessages=1492229; ReceivedMessages=1492228
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=4.000 sec; SentMessages=1455879; ReceivedMessages=1455879
sockperf: ====> avg-lat=  1.359 (std-dev=0.031)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 1.359 usec
sockperf: Total 1455879 observations; each percentile contains 14558.79 observations
sockperf: ---> <MAX> observation =    6.271
sockperf: ---> percentile 99.999 =    2.085
sockperf: ---> percentile 99.990 =    1.569
sockperf: ---> percentile 99.900 =    1.463
sockperf: ---> percentile 99.000 =    1.428
sockperf: ---> percentile 90.000 =    1.396
sockperf: ---> percentile 75.000 =    1.378
sockperf: ---> percentile 50.000 =    1.359
sockperf: ---> percentile 25.000 =    1.338
sockperf: ---> <MIN> observation =    1.253

Note that some additional VMA and Sockperf headers on both client and server were trimmed

VMA is showing ~300% performance improvement comparing to kernel
Average latency:

  • Using Kernel 3.576 usec
  • Using VMA 1.253 usec

Further Reading / Tuning

Now VMA is working, it’s important to implement any server manufacturer and Linux distribution tuning recommendations for lowest latency.

Few examples: