From 60f32842b70d8aee9dd3974f582c00a3b5f727ca Mon Sep 17 00:00:00 2001
From: Josh Wills <jwills@cloudera.com>
Date: Tue, 27 Jan 2015 09:17:16 -0800
Subject: [PATCH] Update README w/some notes on motivation. Fixes #20.

---
 runners/spark/README.md | 32 +++++++++++++++++++++++++++-----
 1 file changed, 27 insertions(+), 5 deletions(-)

diff --git a/runners/spark/README.md b/runners/spark/README.md
index 708ed1fc894a7..611c062e50b43 100644
--- a/runners/spark/README.md
+++ b/runners/spark/README.md
@@ -1,13 +1,34 @@
 spark-dataflow
 ==============
 
-Spark-dataflow is an early prototype. If this project interests you, you should know that we
-encourage outside contributions. So, hack away! To get an idea of what we have already identified as
+## Intro
+
+Spark-dataflow allows users to execute data pipelines written against the Google Cloud Dataflow API
+with Apache Spark. Spark-dataflow is an early prototype, and we'll be working on it continuously.
+If this project interests you, we welcome issues, comments, and (especially!) pull requests.
+To get an idea of what we have already identified as
 areas that need improvement, checkout the issues listed in the github repo.
 
-Spark-dataflow allows users to execute dataflow pipelines with Spark. Executing a pipeline on a
-spark cluster is easy: Depend on spark-dataflow in your project and execute your pipeline in a
-program by calling `SparkPipelineRunner.run`.
+## Motivation
+
+We had two primary goals when we started working on Spark-dataflow:
+
+1. *Provide portability for data pipelines written for Google Cloud Dataflow.* Google makes
+it really easy to get started writing pipelines against the Dataflow API, but they wanted
+to be sure that creating a pipeline using their tools would not lock developers in to their
+platform. A Spark-based implementation of Dataflow means that you can take your pipeline
+logic with you wherever you go. This also means that any new machine learning and anomaly
+detection algorithms that are developed against the Dataflow API are available to everyone,
+regardless of their underlying execution platform.
+
+2. *Experiment with new data pipeline design patterns.* The Dataflow API has a number of
+interesting ideas, especially with respect to the unification of batch and stream data
+processing into a single API that maps into two separate engines. The Dataflow streaming
+engine, based on Google's [Millwheel](http://research.google.com/pubs/pub41378.html), does
+not have a direct open source analogue, and we wanted to understand how to replicate its
+functionality using frameworks like Spark Streaming.
+
+## Getting Started
 
 The Maven coordinates of the current version of this project are:
 
@@ -35,3 +56,4 @@ would do the following:
     SparkPipelineOptions options = SparkPipelineOptionsFactory.create();
     options.setSparkMaster("spark://host:port");
     EvaluationResult result = SparkPipelineRunner.create(options).run(p);
+