Skip to content
This repository has been archived by the owner on Dec 20, 2022. It is now read-only.

Running HiBench with SparkRDMA

Peter Rudenko edited this page Nov 29, 2018 · 7 revisions

HiBench is a big data benchmark suite that can be used to evaluate different big data frameworks in terms of speed, throughput and system resource utilizations. It contains a set of Hadoop, Spark and streaming workloads, including Sort, WordCount, TeraSort, Sleep, SQL, PageRank, Nutch indexing, Bayes, Kmeans, NWeight and enhanced DFSIO, etc.

Experiment #1: TeraSort

  1. Environment: 17 nodes with 2x Intel Xeon E5-2697 v3 @ 2.60GHz, 30 cores per Worker, 256GB RAM, non-flash storage (HDD), Mellanox ConnectX-4 network adapter with 100GbE RoCE fabric, connected with a Mellanox Spectrum switch
  2. Apache hadoop-2.7.4 HDFS (1 namenode, 16 datanodes)
  3. Spark-2.2 standalone 17 nodes
  4. Setup HiBench
  5. Configure Hadoop and Spark settings in HiBench conf directory
  6. In HiBench/conf/hibench.conf set:
hibench.scale.profile bigdata
# Mapper number in hadoop, partition number in Spark
hibench.default.map.parallelism         1000

# Reducer nubmer in hadoop, shuffle partition number in Spark
hibench.default.shuffle.parallelism     15000
  1. Set in HiBench/conf/workloads/micro/terasort.conf:
hibench.terasort.bigdata.datasize               1890000000
  1. Run HiBench/bin/workloads/micro/terasort/prepare/prepare.sh and HiBench/bin/workloads/micro/terasort/spark/run.sh
  2. Open HiBench/report/hibench.report:
Type               Date          Time      Input_data_size      Duration(s)          Throughput(bytes/s)  Throughput/node
ScalaSparkTerasort 2018-03-26    19:13:52  189000000000         79.931               2364539415            2364539415
  1. Add to HiBench/conf/spark.conf:
spark.driver.extraClassPath /PATH/TO/spark-rdma-3.1-for-spark-SPARK_VERSION-jar-with-dependencies.jar
spark.executor.extraClassPath /PATH/TO/spark-rdma-3.1-for-spark-SPARK_VERSION-jar-with-dependencies.jar
spark.shuffle.manager org.apache.spark.shuffle.rdma.RdmaShuffleManager
spark.shuffle.compress false
spark.shuffle.spill.compress false
spark.broadcast.compress false
spark.broadcast.checksum false
spark.locality.wait 0
  1. Run HiBench/bin/workloads/micro/terasort/spark/run.sh
  2. Open HiBench/report/hibench.report:
Type               Date          Time      Input_data_size      Duration(s)          Throughput(bytes/s)  Throughput/node
ScalaSparkTerasort 2018-03-26    19:13:52  189000000000         79.931               2364539415            2364539415
ScalaSparkTerasort 2018-03-26    19:17:13  189000000000         52.166               3623049495            3623049495
  1. Overall improvement:

Experiment #2: PageRank

  1. Environment: 6 nodes with 2x Intel Xeon E5-2697 v3 @ 2.60GHz, 30 cores per Worker, 256GB RAM, non-flash storage (HDD), Mellanox ConnectX-4 network adapter with 100GbE RoCE fabric, connected with a Mellanox Spectrum switch
  2. Apache hadoop-2.7.4 HDFS (1 namenode, 5 datanodes)
  3. Spark-2.2 in YARN mode with 6 nodes
  4. Setup HiBench
  5. Configure Hadoop and Spark settings in HiBench conf directory
spark.executor.memory  160g
hibench.yarn.executor.num     5
hibench.yarn.executor.cores   20
  1. In HiBench/conf/hibench.conf set:
hibench.scale.profile gigantic
hibench.default.map.parallelism         8000
hibench.default.shuffle.parallelism     8000
spark.default.parallelism               15000
  1. Run HiBench/bin/workloads/websearch/pagerank/prepare/prepare.sh and HiBench/bin/workloads/websearch/pagerank/spark/run.sh
  2. Open HiBench/report/hibench.report:
Type               Date          Time      Input_data_size      Duration(s)          Throughput(bytes/s)  Throughput/node
ScalaSparkPagerank 2018-10-03    15:07:28  19933175464          2184.023             9126815              1825363
ScalaSparkPagerank 2018-10-03    15:07:28  19933175464          1086.578             18344899             3668979
  1. Overall improvement:

Pagerank results