Update README w/some notes on motivation. Fixes apache#20.

peihe · Mar 14, 2016 · 60f3284 · 60f3284
1 parent 937a2de
commit 60f3284
Showing 1 changed file with 27 additions and 5 deletions.
diff --git a/runners/spark/README.md b/runners/spark/README.md
@@ -1,13 +1,34 @@
 spark-dataflow
 ==============
 
-Spark-dataflow is an early prototype. If this project interests you, you should know that we
-encourage outside contributions. So, hack away! To get an idea of what we have already identified as
+## Intro
+
+Spark-dataflow allows users to execute data pipelines written against the Google Cloud Dataflow API
+with Apache Spark. Spark-dataflow is an early prototype, and we'll be working on it continuously.
+If this project interests you, we welcome issues, comments, and (especially!) pull requests.
+To get an idea of what we have already identified as
 areas that need improvement, checkout the issues listed in the github repo.
 
-Spark-dataflow allows users to execute dataflow pipelines with Spark. Executing a pipeline on a
-spark cluster is easy: Depend on spark-dataflow in your project and execute your pipeline in a
-program by calling `SparkPipelineRunner.run`.
+## Motivation
+
+We had two primary goals when we started working on Spark-dataflow:
+
+1. *Provide portability for data pipelines written for Google Cloud Dataflow.* Google makes
+it really easy to get started writing pipelines against the Dataflow API, but they wanted
+to be sure that creating a pipeline using their tools would not lock developers in to their
+platform. A Spark-based implementation of Dataflow means that you can take your pipeline
+logic with you wherever you go. This also means that any new machine learning and anomaly
+detection algorithms that are developed against the Dataflow API are available to everyone,
+regardless of their underlying execution platform.
+
+2. *Experiment with new data pipeline design patterns.* The Dataflow API has a number of
+interesting ideas, especially with respect to the unification of batch and stream data
+processing into a single API that maps into two separate engines. The Dataflow streaming
+engine, based on Google's [Millwheel](http://research.google.com/pubs/pub41378.html), does
+not have a direct open source analogue, and we wanted to understand how to replicate its
+functionality using frameworks like Spark Streaming.
+
+## Getting Started
 
 The Maven coordinates of the current version of this project are:
 
@@ -35,3 +56,4 @@ would do the following:
     SparkPipelineOptions options = SparkPipelineOptionsFactory.create();
     options.setSparkMaster("spark://host:port");
     EvaluationResult result = SparkPipelineRunner.create(options).run(p);
+