Skip to content
This repository has been archived by the owner on Dec 20, 2022. It is now read-only.

Performance Tuning for Mellanox Adapters

Yuval Degani edited this page Oct 4, 2018 · 7 revisions
  • Mellanox aims to provide the best out-of-box performance possible, however, in some cases, achieving optimal performance may require additional system and/or network adapter configurations.

  • As a starting point, it is always recommended to download and install the latest MLNX_OFED drivers for your OS.

  • If you are using an Ethernet fabric, be sure to correctly configure flow control, as described in the following page: "Flow Control and QoS configuration for RoCE fabrics" .

  • Prior to running any SparkRDMA jobs, you should first test the point to point performance between nodes to ensure you are achieving expected results. This can be done using the ib_send_bw utility provided in the MLNX_OFED package.

    ib_send_bw is a client/server utility. First, you must start the server instance. The utility will default to the first adapter it finds within the system. If you have multiple devices within the same system, then you will need to use the -D flag to specify the correct interface. The ibdev2netdev command will give you the mapping from RDMA device name to ethernet device name:

    $ ibdev2netdev
    mlx5_0 port 1 ==> ens2 (Up)
    

    Start the server side

    $ ib_send_bw -d mlx5_0
    
    ************************************
    * Waiting for client to connect... *
    ************************************
    

    Run the client to connect to the IP address of the server:

    $ ib_send_bw -d mlx5_0 192.168.1.14
    ---------------------------------------------------------------------------------------
                    Send BW Test
     Dual-port       : OFF              Device         : mlx5_0
     Number of qps   : 1                Transport type : IB
     Connection type : RC               Using SRQ      : OFF
     TX depth        : 128
     CQ Moderation   : 100
     Mtu             : 1024[B]
     Link type       : Ethernet
     GID index       : 5
     Max inline data : 0[B]
     rdma_cm QPs         : OFF
     Data ex. method : Ethernet
    ---------------------------------------------------------------------------------------
     local address: LID 0000 QPN 0x01b6 PSN 0x29626e
     GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:01:15
     remote address: LID 0000 QPN 0x02be PSN 0x1c5dfe
     GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:01:14
    ---------------------------------------------------------------------------------------
     #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
     65536      1000             8125.36            8125.06                0.130001
    ---------------------------------------------------------------------------------------
    
  • It is also recommended to run the mlnx_tune utility, which will run several system checks and provide notification of any potential settings which will cause performance degradation. Running mlnx_tune requires superuser privlidges.

    $ sudo mlnx_tune
    2017-08-16 14:47:17,023 INFO Collecting node information
    2017-08-16 14:47:17,023 INFO Collecting OS information
    2017-08-16 14:47:17,026 INFO Collecting CPU information
    2017-08-16 14:47:17,104 INFO Collecting IRQ Balancer information
    2017-08-16 14:47:17,107 INFO Collecting Firewall information
    2017-08-16 14:47:17,111 INFO Collecting IP table information
    2017-08-16 14:47:17,115 INFO Collecting IPv6 table information
    2017-08-16 14:47:17,118 INFO Collecting IP forwarding information
    2017-08-16 14:47:17,122 INFO Collecting hyper threading information
    2017-08-16 14:47:17,122 INFO Collecting IOMMU information
    2017-08-16 14:47:17,124 INFO Collecting driver information
    2017-08-16 14:47:18,281 INFO Collecting Mellanox devices information
    
    Mellanox Technologies - System Report
    
    ConnectX-5 Device Status on PCI 03:00.0
    FW version 12.18.2000
    OK: PCI Width x16
    OK: PCI Speed 8GT/s
    PCI Max Payload Size 256
    PCI Max Read Request 512
    Local CPUs list [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
    
    ens2 (Port 1) Status
    Link Type eth
    OK: Link status Up
    Speed 100GbE
    MTU 1500
    OK: TX nocache copy 'off'
    
    2017-08-16 14:47:18,777 INFO System info file: /tmp/mlnx_tune_170816_144716.log
    
  • After running the mlnx_tune command, it is highly recommended to set the cpuList parameter (described in Configuration Properties) within spark.conf file using the NUMA cores associated with the Mellanox device.

    Local CPUs list [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
    
    spark.shuffle.rdma.cpuList 0-15
    
  • More indepth performance resources can be found in the Mellanox Community post: Performance Tuning for Mellanox Adapters