Spark

Spark is an open-source distributed computing system written in Scala. The project was started by Ph.D. students from the AMPLab and is an integral part of the Berkeley Data Analytics Stack.

Like Hadoop MapReduce, Spark is designed to run functions over large collections of data, by supporting a simplified set of high-level data processing operations akin to the iterators we've been learning about in class. One of the most common uses of such systems is to implement parallel query processing in high level languages such as SQL. In fact, many recent research and development efforts in Spark have gone towards supporting a scalable and interactive relational database abstraction.

We'll be using, modifying, and studying aspects of Spark in this class to understand key concepts of modern data systems. More importantly you will see that the ideas we're covering in class -- some of which are decades old -- are still very relevant today. Specifically, we will be adding features to Spark SQL.

One key limitation of Spark SQL is that it is currently a main-memory-only system. As part of this class, we will extend it to include some out-of-core algorithms as well.

Scala

Scala is a statically-typed language that supports many different programming paradigms. Its flexibility, power, and portability have become especially useful in distributed-systems research.

Scala resembles Java, but it possesses a much broader set of syntax features to facilitate multiple paradigms. Knowing Java will help you understand some Scala code, but not much of it, and not knowing Scala will prevent you from fully taking advantage of its expressive power. Because you must write code in Scala, we strongly recommend you to acquire at least a passing familiarity with the language.

You might find the following tutorials to be useful:

Twitter's Scala School
Getting Started by Scala-Lang
A Scala Tutorial for Java Programmers
TutorialsPoint's Scala Tutorial

Additional Reading for the Interested

Spark: Cluster Computing with Working Sets (HotCloud 2009)
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing (NSDI 2012)
Spark SQL: Relational Data Processing in Spark (SIGMOD 2015)
Shark: SQL and Rich Analytics at Scale (SIGMOD 2013)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spark.md

spark.md

Spark

Scala

Additional Reading for the Interested

Files

spark.md

Latest commit

History

spark.md

File metadata and controls

Spark

Scala

Additional Reading for the Interested