tpch-spark

TPC-H queries implemented in Spark using the DataFrames API (introduced in Spark 1.3.0)

Savvas Savvides

ssavvides@us.ibm.com

Running

First compile using:

sbt package

Make sure you set the INPUT_DIR and OUTPUT_DIR in TpchQuery class before compiling to point to the location the of the input data and where the output should be saved.

You can then run a query using:

spark-submit --class "main.scala.TpchQuery" --master MASTER target/scala-2.10/spark-tpc-h-queries_2.10-1.0.jar ##

where ## is the number of the query to run e.g 1, 2, ..., 22 and MASTER specifies the spark-mode e.g local, yarn, standalone etc...

Other Implementations

Data generator (http://www.tpc.org/tpch/)
TPC-H for Hive (https://issues.apache.org/jira/browse/hive-600)
TPC-H for PIG (https://github.com/ssavvides/tpch-pig)

Name		Name	Last commit message	Last commit date
Latest commit History 242 Commits
answers		answers
charts		charts
dbgen		dbgen
postgres		postgres
spark-kudu		spark-kudu
.credentials		.credentials
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
cloud-formation.json		cloud-formation.json
create-stack.sh		create-stack.sh
docker_setup.sh		docker_setup.sh
docker_teardown.sh		docker_teardown.sh
init.sh		init.sh
poll.sh		poll.sh
run-master.sh		run-master.sh
run-populate.sh		run-populate.sh
run-worker.sh		run-worker.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tpch-spark

Running

Other Implementations

About

Releases

Packages

Languages

License

AgilData/tpch-spark

Folders and files

Latest commit

History

Repository files navigation

tpch-spark

Running

Other Implementations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages