History and Motivation

Before Summingbird, users that wanted to write production streaming aggregations would typically write their logic using a Hadoop DSL like Pig or Scalding. These tools offer nice distributed system abstractions; Pig resembles familiar SQL, while Scalding takes Summingbird’s tack by mimicking the Scala collections API. By running these jobs on some regular schedule (typically hourly or daily), users could build timeseries dashboards with very reliable error bounds, at the cost of high latency. Hadoop jobs often take hours to complete, and ingesting the results of an aggregation into a Hadoop-aware serving layer like ElephantDB adds another hour or so the process.

What about Realtime?

The ability to easily process realtime versions of the same data streams entered the picture in late 2011 with Twitter’s release of Storm. Storm makes it easy to process data with very low latencies by sacrificing Hadoop’s lovely fault tolerant guarantees.

For a number of reasons, it turns out to be quite painful to run a fully realtime system.

Recomputation over months of historical logs is tough
Random-write databases are hard to maintain

The types of aggregations one can perform in Storm are very similar to what’s possible in Hadoop, but the systems issues are all different.

Enter Summingbird

Summingbird began as an investigation into a hybrid system that could run a streaming aggregation in both Hadoop and Storm. The hybrid model allows most data to be processed by Hadoop and served out of a read-only store like ElephantDB. Only data that Hadoop hasn’t yet been able to process, data that falls within the latency window, would be served out of a datastore populated in realtime by Storm. The error of the realtime layer is bounded, as Hadoop will eventually get around to processing the same data and smoothing out any error introduced.

The hybrid approach has the following practical problems:

Two sets of aggregation logic have to be kept in sync in two different systems.
Keys and Values must be serialized consistently between each system and the client.
The client is responsible for reading from both datastores, performing a final aggregation and serving the combined results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

History and Motivation

What about Realtime?

Enter Summingbird

Getting Help

Background

Documentation

Implementation Details

Clone this wiki locally