Skip to content

Commit

Permalink
Update README w/some notes on motivation. Fixes apache#20.
Browse files Browse the repository at this point in the history
  • Loading branch information
Josh Wills authored and peihe committed Mar 14, 2016
1 parent 937a2de commit 60f3284
Showing 1 changed file with 27 additions and 5 deletions.
32 changes: 27 additions & 5 deletions runners/spark/README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,34 @@
spark-dataflow
==============

Spark-dataflow is an early prototype. If this project interests you, you should know that we
encourage outside contributions. So, hack away! To get an idea of what we have already identified as
## Intro

Spark-dataflow allows users to execute data pipelines written against the Google Cloud Dataflow API
with Apache Spark. Spark-dataflow is an early prototype, and we'll be working on it continuously.
If this project interests you, we welcome issues, comments, and (especially!) pull requests.
To get an idea of what we have already identified as
areas that need improvement, checkout the issues listed in the github repo.

Spark-dataflow allows users to execute dataflow pipelines with Spark. Executing a pipeline on a
spark cluster is easy: Depend on spark-dataflow in your project and execute your pipeline in a
program by calling `SparkPipelineRunner.run`.
## Motivation

We had two primary goals when we started working on Spark-dataflow:

1. *Provide portability for data pipelines written for Google Cloud Dataflow.* Google makes
it really easy to get started writing pipelines against the Dataflow API, but they wanted
to be sure that creating a pipeline using their tools would not lock developers in to their
platform. A Spark-based implementation of Dataflow means that you can take your pipeline
logic with you wherever you go. This also means that any new machine learning and anomaly
detection algorithms that are developed against the Dataflow API are available to everyone,
regardless of their underlying execution platform.

2. *Experiment with new data pipeline design patterns.* The Dataflow API has a number of
interesting ideas, especially with respect to the unification of batch and stream data
processing into a single API that maps into two separate engines. The Dataflow streaming
engine, based on Google's [Millwheel](http://research.google.com/pubs/pub41378.html), does
not have a direct open source analogue, and we wanted to understand how to replicate its
functionality using frameworks like Spark Streaming.

## Getting Started

The Maven coordinates of the current version of this project are:

Expand Down Expand Up @@ -35,3 +56,4 @@ would do the following:
SparkPipelineOptions options = SparkPipelineOptionsFactory.create();
options.setSparkMaster("spark://host:port");
EvaluationResult result = SparkPipelineRunner.create(options).run(p);

0 comments on commit 60f3284

Please sign in to comment.