Skip to content
This repository has been archived by the owner on Dec 20, 2022. It is now read-only.

Troubleshooting SparkRDMA

Peter Rudenko edited this page Mar 30, 2018 · 2 revisions

Troubleshooting

  • If you encounter spark job failures or performance inconsistencies when using the SparkRDMA plugin it is a good idea to refer to the job logs in hopes of identifying any potential issues.

    $ cat <your log file>  | grep Rdma
    
  • There will be a lot of informative information, not all of which is related to an actual error. A common issue related to performance is oversubscription of a QP. If you see the following indication, please follow the recomendation and increase the rdmaSendDepth parameter.

    17/08/14 14:33:38 WARN RdmaChannel: RDMA channel org.apache.spark.shuffle.rdma.RdmaChannel@7608ffc9 oversubscription detected. RDMA send queue depth is too small. To improve performance, please set set spark.shuffle.io.rdmaSendDepth to a higher value (current depth: 1024
    
  • Failed to bind. Make sure your NIC supports RDMA. - add the following to spark-env.sh:

RDMA_INTERFACE="RDMA_INTERFACE_NAME"
RDMA_IP=`ip addr show $RDMA_INTERFACE | grep "inet\b" | awk '{print $2}' | cut -d/ -f1`

export SPARK_LOCAL_IP=$RDMA_IP