spark-driver-demo

The code in this repo, and the instructions below use the new Cassandra/Spark integration open sourced by datastax.

See https://github.com/riptano/spark-driver for more details

It uses a set of csv files containing historical MLB statistics. loads them int using the cqlsh COPY command, and then demonstrates use of spark from scala, joins from shark (using DataStax Enterprise),

set up

if the spark driver is not yet on maven,

git clone https://github.com/riptano/spark-driver

cd spark-driver

sbt publishLocal #this installs it into the local .m2 directory and allow maven (and sbt) to treat it as it were publicly published

cd ..

install Cassandra or DataStax Enterprise

  git clone git@github.com:DataStaxCodeSamples/spark-driver-demo.git

  cd spark-driver-demo
  
  cd data

  wget http://seanlahman.com/files/database/lahman-csv_2014-02-14.zip

  unzip lahman-csv_2014-02-14.zip
  
  cqlsh < ../src/main/resources/cql/schema.cql 

  cqlsh < ../src/main/resources/cql/load_data.cql
  
  cd ..

  ./bin/sbt run #runs src/main/scala/test.scala

if you have dse

  dse shark
  
  use cassandra_spark_mlbdata;

shark> select master.*,schoolsplayers.schoolid from master join schoolsplayers where schoolsplayers.playerid=master.playerid limit 50;

 
  dse spark

  val allstarful = sc.cassandraTable("demo", "allstarfull")

  println("count: " + allstarful.count)

with either dse or cassandra

cqlsh> select count(*) from cassandra_spark_mlbdata.appearances_by_player_year limit 10000000;

    count
   -------
    98146

(1 rows)

sbt consoleQuick
:paste
 val sparkConf = new SparkConf(true)
  .set("cassandra.connection.host", "127.0.0.1")
  .setMaster("local[2]")
  .setAppName("spark-driver-demo")
  
  val sc = new SparkContext("local[2]", "test", sparkConf)
  val allstarful = sc.cassandraTable("cassandra_spark_mlbdata", "allstarfull")
^D

how is my data divided up?

 allstarful.partitions;
res5: Array[org.apache.spark.Partition] = Array(CassandraPartition(0,Set(/127.0.0.1),Vector(CqlTokenRange(token("playerid") > -1759687600836599613), CqlTokenRange(token("playerid") <= -1759687600836599613)),1920))

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
bin		bin
data		data
project		project
src/main		src/main
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spark-driver-demo

set up

About

Releases

Packages

Contributors 2

Languages

License

DataStaxCodeSamples/spark-driver-demo

Folders and files

Latest commit

History

Repository files navigation

spark-driver-demo

set up

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages