Skip to content

TPC-H queries in spark SQL using native DataFrames API

License

Notifications You must be signed in to change notification settings

AgilData/tpch-spark

 
 

Repository files navigation

tpch-spark

TPC-H queries implemented in Spark using the DataFrames API (introduced in Spark 1.3.0)

Savvas Savvides

ssavvides@us.ibm.com

savvas@purdue.edu

Running

First compile using:

sbt package

Make sure you set the INPUT_DIR and OUTPUT_DIR in TpchQuery class before compiling to point to the location the of the input data and where the output should be saved.

You can then run a query using:

spark-submit --class "main.scala.TpchQuery" --master MASTER target/scala-2.10/spark-tpc-h-queries_2.10-1.0.jar ##

where ## is the number of the query to run e.g 1, 2, ..., 22 and MASTER specifies the spark-mode e.g local, yarn, standalone etc...

Other Implementations

  1. Data generator (http://www.tpc.org/tpch/)

  2. TPC-H for Hive (https://issues.apache.org/jira/browse/hive-600)

  3. TPC-H for PIG (https://github.com/ssavvides/tpch-pig)

About

TPC-H queries in spark SQL using native DataFrames API

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C 54.8%
  • Scala 27.1%
  • Roff 6.5%
  • Shell 5.9%
  • Makefile 2.3%
  • C++ 2.3%
  • Other 1.1%