From 60f32842b70d8aee9dd3974f582c00a3b5f727ca Mon Sep 17 00:00:00 2001 From: Josh Wills Date: Tue, 27 Jan 2015 09:17:16 -0800 Subject: [PATCH] Update README w/some notes on motivation. Fixes #20. --- runners/spark/README.md | 32 +++++++++++++++++++++++++++----- 1 file changed, 27 insertions(+), 5 deletions(-) diff --git a/runners/spark/README.md b/runners/spark/README.md index 708ed1fc894a7..611c062e50b43 100644 --- a/runners/spark/README.md +++ b/runners/spark/README.md @@ -1,13 +1,34 @@ spark-dataflow ============== -Spark-dataflow is an early prototype. If this project interests you, you should know that we -encourage outside contributions. So, hack away! To get an idea of what we have already identified as +## Intro + +Spark-dataflow allows users to execute data pipelines written against the Google Cloud Dataflow API +with Apache Spark. Spark-dataflow is an early prototype, and we'll be working on it continuously. +If this project interests you, we welcome issues, comments, and (especially!) pull requests. +To get an idea of what we have already identified as areas that need improvement, checkout the issues listed in the github repo. -Spark-dataflow allows users to execute dataflow pipelines with Spark. Executing a pipeline on a -spark cluster is easy: Depend on spark-dataflow in your project and execute your pipeline in a -program by calling `SparkPipelineRunner.run`. +## Motivation + +We had two primary goals when we started working on Spark-dataflow: + +1. *Provide portability for data pipelines written for Google Cloud Dataflow.* Google makes +it really easy to get started writing pipelines against the Dataflow API, but they wanted +to be sure that creating a pipeline using their tools would not lock developers in to their +platform. A Spark-based implementation of Dataflow means that you can take your pipeline +logic with you wherever you go. This also means that any new machine learning and anomaly +detection algorithms that are developed against the Dataflow API are available to everyone, +regardless of their underlying execution platform. + +2. *Experiment with new data pipeline design patterns.* The Dataflow API has a number of +interesting ideas, especially with respect to the unification of batch and stream data +processing into a single API that maps into two separate engines. The Dataflow streaming +engine, based on Google's [Millwheel](http://research.google.com/pubs/pub41378.html), does +not have a direct open source analogue, and we wanted to understand how to replicate its +functionality using frameworks like Spark Streaming. + +## Getting Started The Maven coordinates of the current version of this project are: @@ -35,3 +56,4 @@ would do the following: SparkPipelineOptions options = SparkPipelineOptionsFactory.create(); options.setSparkMaster("spark://host:port"); EvaluationResult result = SparkPipelineRunner.create(options).run(p); +