Skip to content
This repository has been archived by the owner on Dec 20, 2022. It is now read-only.

Getting lower RDMA perf that TCP/IP perf #21

Open
gilbertc777 opened this issue Feb 13, 2019 · 16 comments
Open

Getting lower RDMA perf that TCP/IP perf #21

gilbertc777 opened this issue Feb 13, 2019 · 16 comments

Comments

@gilbertc777
Copy link

Hi all,

I am running Hibench suite terasort against RDMA Spark. My current results show that TCP is performaing better than RDMA. Any thoughts on configs to look into? Happy to post config settings currently in place as well to help diagnose.

@petro-rudenko
Copy link
Member

Hi, sure. Would be glad to assist. Can you please tell your setup (number of nodes, hardware type, what workload are you running, spark/hibench config that you are using, etc).

@gilbertc777
Copy link
Author

Hi Peter,

I am using Hibench for running Tersort and Wordcont benchmarks.

The setup is as follows:

10 nodes with 36 cores each @ 3Ghz and 384GB ram

100G ROCE over Mellanox CX4 cards

I will upload the configs in a few.

Thanks for the help!

@gilbertc777
Copy link
Author

gilbertc777 commented Feb 14, 2019

SPARK_MASTER_HOST=hpc-cluster-cn1-rdma
SPARK_MASTER_IP=192.18.18.1
SPARK_LOCAL_DIRS=/localvolume/hadoop/spark/tmp
SPARK_WORKER_DIR=/localvolume/hadoop/spark/worker
SPARK_LOCAL_DIRS=/localvolume/hadoop/spark/local
RDMA_INTERFACE=mlx5_0
--

HiBench spark.conf

# Spark home
hibench.spark.home      $SPARK_HOME
# Spark master
#   standalone mode: spark://xxx:7077
#   YARN mode: yarn-client 
hibench.spark.master    spark://hpc-cluster-cn1-rdma:7077
#hibench.spark.master yarn
spark.shuffle.compress          false
spark.shuffle.spill.compress    false
spark.driver.extraClassPath /localvolume/nfs-share/hadoop-rdma/applications/mellanox/spark-2.4.0-bin-hadoop2.7/rdma/spark-rdma-3.1-for-spark-2.4.0-jar-with-dependencies.jar
spark.executor.extraClassPath /localvolume/nfs-share/hadoop-rdma/applications/mellanox/spark-2.4.0-bin-hadoop2.7/rdma/spark-rdma-3.1-for-spark-2.4.0-jar-with-dependencies.jar
spark.shuffle.manager org.apache.spark.shuffle.rdma.RdmaShuffleManager
spark.roce.enabled false
spark.ib.enabled false
spark.shuffle.rdma.device.name mlx5_0
spark.shuffle.rdma.pconn.enabled true
spark.broadcast.compress false                                                                                                                                               spark.broadcast.checksum false
spark.locality.wait 0 
#spark.shuffle.rdma.chunk.size 2097152                                                                                                                                       #spark.shuffle.rdma.memory.slab.size 134217728                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         # executor number and cores when running on Yarn                                                                                                                             hibench.yarn.executor.num     2
hibench.yarn.executor.cores   2
# executor and driver memory in standalone & YARN mode
spark.executor.memory  4g
spark.driver.memory    4g
--

[opc@hpc-cluster-cn1-rdma conf]$ cat hadoop.conf
# Hadoop home
hibench.hadoop.home    $HADOOP_HOME                                                                                                                                                                                                                                # The path of hadoop executable
hibench.hadoop.executable     ${hibench.hadoop.home}/bin/hadoop
# Hadoop configraution directory 
hibench.hadoop.configure.dir  ${hibench.hadoop.home}/etc/hadoop                                                                                                                                                                                                                                                                                           # The root HDFS path to store HiBench data                                                                                                                                   hibench.hdfs.master       hdfs://hpc-cluster-cn1-rdma:9000
# Hadoop release provider. Supported value: apache, cdh5, hdp
hibench.hadoop.release    apache
--

@gilbertc777
Copy link
Author

gilbertc777 commented Feb 14, 2019

Top of the hibench.conf

[opc@hpc-cluster-cn1-rdma conf]$ cat hibench.conf
# Data scale profile. Available value is tiny, small, large, huge, gigantic and bigdata.                                                                                     # The definition of these profiles can be found in the workload's conf file i.e. conf/workloads/micro/wordcount.conf     
hibench.scale.profile                bigdata 
# Mapper number in hadoop, partition number in Spark
hibench.default.map.parallelism         1000
# Reducer nubmer in hadoop, shuffle partition number in Spark
hibench.default.shuffle.parallelism     15000

@petro-rudenko
Copy link
Member

@gilbertc777 thanks, what's the result you get for TCP and RDMA?

@gilbertc777
Copy link
Author

ScalaSparkTerasort 2019-02-12 23:58:52 32000000000 112.184 285245667 28524566

ScalaSparkTerasort 2019-02-13 00:02:02 32000000000 122.759 260673351 26067335

The first test is running TCP, the secons is running with RDMA enabled. Note RDMA runtime is higher.

@petro-rudenko
Copy link
Member

I see.

  1. Do you have flow control configuration
  2. Can you please try to run pagerank workload?
  3. Can you please share spark executor logs?

@jiameixie
Copy link

Hi,I have the same problem.The configrure is almost same. The following is part of benchlog. Can you help me

19/07/11 14:55:11 INFO cluster.YarnClientSchedulerBackend: Stopped
19/07/11 14:55:11 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/07/11 14:55:11 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(11)
19/07/11 14:55:11 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(14)
19/07/11 14:55:11 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(3)
19/07/11 14:55:11 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(12)
19/07/11 14:55:11 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(7)
19/07/11 14:55:11 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(6)
19/07/11 14:55:11 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(16)
19/07/11 14:55:11 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(8)
19/07/11 14:55:11 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(1)
19/07/11 14:55:11 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(15)
19/07/11 14:55:11 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(13)
19/07/11 14:55:11 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(5)
19/07/11 14:55:11 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(2)
19/07/11 14:55:11 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(4)
19/07/11 14:55:11 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(0)
19/07/11 14:55:11 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(9)
19/07/11 14:55:11 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(17)
19/07/11 14:55:11 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(10)
19/07/11 14:55:11 INFO ibm.disni: disconnect, id 0
19/07/11 14:55:11 INFO ibm.disni: disconnect, id 0
19/07/11 14:55:11 INFO ibm.disni: disconnect, id 0
19/07/11 14:55:11 INFO ibm.disni: disconnect, id 0
java.lang.NullPointerException
at org.apache.spark.shuffle.rdma.RdmaChannel.processRdmaCmEvent(RdmaChannel.java:345)
at org.apache.spark.shuffle.rdma.RdmaChannel.stop(RdmaChannel.java:894)
at org.apache.spark.shuffle.rdma.RdmaNode.lambda$new$0(RdmaNode.java:203)
at java.lang.Thread.run(Thread.java:748)
Exception in thread "RdmaNode connection listening thread" java.lang.RuntimeException: Exception in RdmaNode listening thread java.lang.NullPointerException
at org.apache.spark.shuffle.rdma.RdmaNode.lambda$new$0(RdmaNode.java:210)
at java.lang.Thread.run(Thread.java:748)
19/07/11 14:55:11 INFO ibm.disni: disconnect, id 0
19/07/11 14:55:11 INFO ibm.disni: disconnect, id 0
19/07/11 14:55:11 INFO ibm.disni: disconnect, id 0
19/07/11 14:55:11 INFO ibm.disni: disconnect, id 0
19/07/11 14:55:11 INFO ibm.disni: disconnect, id 0
19/07/11 14:55:11 INFO ibm.disni: disconnect, id 0
19/07/11 14:55:11 INFO ibm.disni: disconnect, id 0
19/07/11 14:55:11 INFO ibm.disni: disconnect, id 0
19/07/11 14:55:11 INFO ibm.disni: disconnect, id 0
19/07/11 14:55:11 INFO ibm.disni: disconnect, id 0
19/07/11 14:55:11 INFO ibm.disni: disconnect, id 0
19/07/11 14:55:11 INFO ibm.disni: disconnect, id 0
19/07/11 14:55:11 INFO ibm.disni: disconnect, id 0
19/07/11 14:55:11 INFO ibm.disni: disconnect, id 0
19/07/11 14:55:11 INFO rdma.RdmaBufferManager: Rdma buffers allocation statistics:
19/07/11 14:55:11 INFO rdma.RdmaBufferManager: Pre allocated 0, allocated 4611 buffers of size 4 KB
19/07/11 14:55:11 INFO ibm.disni: deallocPd, pd 1
19/07/11 14:55:11 INFO ibm.disni: destroyCmId, id 0
19/07/11 14:55:11 INFO ibm.disni: destroyEventChannel, channel 0
19/07/11 14:55:11 INFO memory.MemoryStore: MemoryStore cleared
19/07/11 14:55:11 INFO storage.BlockManager: BlockManager stopped
19/07/11 14:55:11 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
19/07/11 14:55:11 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/07/11 14:55:11 INFO spark.SparkContext: Successfully stopped SparkContext
19/07/11 14:55:11 INFO util.ShutdownHookManager: Shutdown hook called

@jiameixie
Copy link

some of the yarn log:
19/07/11 16:19:43 INFO rdma.RdmaNode: Starting RdmaNode Listening Server, listening on: /8.8.8.4:50189
19/07/11 16:19:43 INFO rdma.RdmaNode: Established connection to /8.8.8.4:60996 in 253 ms
19/07/11 16:19:43 INFO rdma.RdmaNode: Established connection to /8.8.8.4:56128 in 153 ms
19/07/11 16:19:43 INFO rdma.RdmaNode: Established connection to /8.8.8.4:43667 in 153 ms
19/07/11 16:19:43 INFO rdma.RdmaNode: Established connection to /8.8.8.4:49283 in 161 ms
19/07/11 16:19:43 INFO rdma.RdmaNode: Established connection to /8.8.8.4:48335 in 154 ms
19/07/11 16:19:43 INFO rdma.RdmaNode: Established connection to /8.8.8.4:35004 in 154 ms
19/07/11 16:19:43 INFO rdma.RdmaNode: Established connection to /8.8.8.5:53523 in 156 ms
19/07/11 16:19:43 INFO rdma.RdmaNode: Established connection to /8.8.8.4:57867 in 172 ms
19/07/11 16:19:43 INFO rdma.RdmaNode: Established connection to /8.8.8.4:35765 in 181 ms
19/07/11 16:19:45 INFO rdma.RdmaShuffleManager: RDMA write map task output for mapId: 68 to driver took 6 ms
19/07/11 16:19:45 INFO rdma.RdmaShuffleManager: RDMA write map task output for mapId: 16 to driver took 0 ms
19/07/11 16:19:45 INFO rdma.RdmaShuffleManager: RDMA write map task output for mapId: 49 to driver took 0 ms
19/07/11 16:19:45 INFO rdma.RdmaShuffleManager: RDMA write map task output for mapId: 32 to driver took 1 ms
19/07/11 16:19:45 INFO rdma.RdmaShuffleManager: RDMA write map task output for mapId: 87 to driver took 2 ms
19/07/11 16:19:46 INFO rdma.RdmaShuffleManager: RDMA write map task output for mapId: 35 to driver took 0 ms
19/07/11 16:19:46 INFO rdma.RdmaShuffleManager: RDMA write map task output for mapId: 47 to driver took 1 ms
19/07/11 16:19:46 INFO rdma.RdmaShuffleManager: RDMA write map task output for mapId: 26 to driver took 0 ms
19/07/11 16:19:46 INFO rdma.RdmaShuffleManager: RDMA write map task output for mapId: 61 to driver took 0 ms
19/07/11 16:19:46 INFO rdma.RdmaShuffleManager: RDMA write map task output for mapId: 53 to driver took 1 ms
19/07/11 16:19:47 INFO rdma.RdmaShuffleManager: RDMA write map task output for mapId: 98 to driver took 1 ms
19/07/11 16:19:47 INFO rdma.RdmaShuffleManager: RDMA write map task output for mapId: 108 to driver took 1 ms
19/07/11 16:19:47 INFO rdma.RdmaShuffleManager: RDMA write map task output for mapId: 112 to driver took 0 ms
19/07/11 16:19:48 INFO rdma.RdmaShuffleManager: Getting MapTaskOutput table for shuffleId: 0 of size 4096 from driver
19/07/11 16:19:48 INFO rdma.RdmaShuffleManager: RDMA read mapTaskOutput table for shuffleId: 0 took 3 ms
19/07/11 16:19:57 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(7)
19/07/11 16:19:57 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(6)
19/07/11 16:19:57 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(0)
19/07/11 16:19:57 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(8)
19/07/11 16:19:57 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(2)
19/07/11 16:19:57 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(3)
19/07/11 16:19:57 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(9)
19/07/11 16:19:57 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(1)
19/07/11 16:19:57 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(5)
19/07/11 16:19:57 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(12)
19/07/11 16:19:57 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(4)
19/07/11 16:19:57 ERROR rdma.RdmaNode: Failed to stop RdmaChannel during 50 ms
19/07/11 16:19:57 WARN rdma.RdmaChannel: Failed to get RDMA_CM_EVENT_DISCONNECTED: getCmEvent() failed
19/07/11 16:19:57 WARN rdma.RdmaChannel: Failed to get RDMA_CM_EVENT_DISCONNECTED: getCmEvent() failed
19/07/11 16:19:57 WARN rdma.RdmaChannel: Failed to get RDMA_CM_EVENT_DISCONNECTED: getCmEvent() failed
19/07/11 16:19:57 WARN rdma.RdmaChannel: Failed to get RDMA_CM_EVENT_DISCONNECTED: getCmEvent() failed
19/07/11 16:19:57 WARN rdma.RdmaChannel: Failed to get RDMA_CM_EVENT_DISCONNECTED: getCmEvent() failed
19/07/11 16:19:57 WARN rdma.RdmaChannel: Failed to get RDMA_CM_EVENT_DISCONNECTED: getCmEvent() failed
19/07/11 16:19:57 WARN rdma.RdmaChannel: Failed to get RDMA_CM_EVENT_DISCONNECTED: getCmEvent() failed
19/07/11 16:19:57 WARN rdma.RdmaChannel: Failed to get RDMA_CM_EVENT_DISCONNECTED: getCmEvent() failed
19/07/11 16:19:57 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(14)
19/07/11 16:19:57 ERROR rdma.RdmaNode: Failed to stop RdmaChannel during 50 ms
19/07/11 16:19:58 INFO rdma.RdmaNode: Exiting RdmaNode Listening Server
19/07/11 16:19:58 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(16)
19/07/11 16:19:58 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(13)
19/07/11 16:19:58 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(15)
19/07/11 16:19:58 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(11)
19/07/11 16:19:58 INFO rdma.RdmaChannel: Stopping RdmaChannel RdmaChannel(10)
19/07/11 16:19:58 ERROR rdma.RdmaNode: Failed to stop RdmaChannel during 50 ms
19/07/11 16:19:58 INFO rdma.RdmaBufferManager: Rdma buffers allocation statistics:
19/07/11 16:19:58 INFO rdma.RdmaBufferManager: Pre allocated 0, allocated 342 buffers of size 4 KB
19/07/11 16:19:58 INFO rdma.RdmaBufferManager: Pre allocated 0, allocated 535 buffers of size 512 KB
SLF4J: Found binding in [jar:file:/share/spark_rdma/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class

@petro-rudenko
Copy link
Member

So these errors are during the shutdown phase. Does the job itself complete successfully?

@jiameixie
Copy link

Does this mean complete successfully?
19/07/11 16:33:52 INFO DAGScheduler: ResultStage 2 (runJob at SparkHadoopMapReduceWriter.scala:88) finished in 8.207 s
19/07/11 16:33:52 INFO DAGScheduler: Job 1 finished: runJob at SparkHadoopMapReduceWriter.scala:88, took 13.799956 s
19/07/11 16:33:55 INFO SparkHadoopMapReduceWriter: Job job_20190711163338_0006 committed.

@petro-rudenko
Copy link
Member

Yes. so the job is finished. You can see job history by setting:

spark.eventLog.enabled true
spark.eventLog.dir PATH_FOR_HISTORY_FILES

@jiameixie
Copy link

In spark-defaults.conf, setting the below, there weren't logs in the directory. I run in yarn client mode. So do you have another way to check this issue?
spark.eventLog.enabled true
spark.eventLog.dir /tmp/spark_logs

@petro-rudenko
Copy link
Member

https://spark.apache.org/docs/latest/monitoring.html#viewing-after-the-fact
You need to start historyserver + set spark.eventLog.dir hdfs:///tmp/spark_logs

@jiameixie
Copy link

jiameixie commented Jul 18, 2019

Hi, Petro, I still have the problem. Can you help me analysis it? Thanks. I use 2 arm machine node to run terasort with "huge" profile (about 29G data ).

cpu info:
Architecture: aarch64
Byte Order: Little Endian
CPU(s): 56
On-line CPU(s) list: 0-55
Thread(s) per core: 1
Core(s) per socket: 28
Socket(s): 2
NUMA node(s): 2
Vendor ID: Cavium
Model: 1
Model name: ThunderX2 99xx
Stepping: 0x1
BogoMIPS: 400.00
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 32768K
NUMA node0 CPU(s): 0-27
NUMA node1 CPU(s): 28-55
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid asimdrdm

Net card:

Infiniband device 'mlx5_1' port 1 status:
default gid: fe80:0000:0000:0000:9a03:9bff:fe68:4e47
base lid: 0x0
sm lid: 0x0
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 40 Gb/sec (4X QDR)
link_layer: Ethernet

driver: mlx5_core
version: 4.6-1.0.1
firmware-version: 16.23.1020 (MT_0000000090)
expansion-rom-version:
bus-info: 0000:05:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

OS: ubuntu 18.04
java version "1.8.0_211"
hadoop version :2.7.4
spark version :2.2.3
dinsi version:v1.7

top of hibench.conf
hibench.scale.profile huge

Mapper number in hadoop, partition number in Spark

hibench.default.map.parallelism 200

Reducer nubmer in hadoop, shuffle partition number in Spark

hibench.default.shuffle.parallelism 120

hadoop.conf

Hadoop home

hibench.hadoop.home /home/linux/xjm/spark_rdma/hadoop-2.7.4

The path of hadoop executable

hibench.hadoop.executable ${hibench.hadoop.home}/bin/hadoop

Hadoop configraution directory

hibench.hadoop.configure.dir ${hibench.hadoop.home}/etc/hadoop

The root HDFS path to store HiBench data

hibench.hdfs.master hdfs://wls-arm-cavium05:9000

Hadoop release provider. Supported value: apache, cdh5, hdp

hibench.hadoop.release apache

The test result (the following is the duration time):
no-rdma: 405
rdma: 401

The following are configuration files and executor log. Did my configuration have problems?

stderr-excutor1-no-rdma.txt
stderr-excutor2-rdma.txt
stderr_excutor1-rdma.txt
bench.log
spark.conf.txt
core-site.xml.txt
hdfs-site.xml.txt
spark-defauts.conf.txt
spark-env.sh.txt

@petro-rudenko
Copy link
Member

@jiameixie, so 2 is to few to get real advantage of rdma vs tcp. So try to increase number of executors + number of cores + workload size (at least gigantic or bigdata).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants