-
Notifications
You must be signed in to change notification settings - Fork 706
Automatic Orderings, Monoids and Arbitraries
Scalding uses implicit Ordering[K]
instances to do the sort for keys. If your key type is a case class, primitive, collection, or recursion of these types you can automatically generate the Ordering
using a macro:
import com.twitter.scalding.serialization.macros.impl.BinaryOrdering._
If you are working with scrooge thrift data, you might instead use:
import com.twitter.scalding.thrift.macros.Macros._
from the scalding-thrift-macros
package.
To just create an ordering for a particular type:
import com.twitter.scalding.serialization.macros.impl.BinaryOrdering
case class MyClass(i: Int, s: String)
implicit val myOrd = BinaryOrdering.ordSer[MyClass]
The above macro actually creates an OrderedSerialization[T]
which extends Ordering[T]
with a Serialization[T]
and a means to compare serialized data directly without allocating objects in the sort. When these are used with scalding we have seen 20-80% decreases in running time of jobs bigger keys (the bigger the key and the more strings in the key, the bigger the win).
Algebird has similar macros to provide automatic instances for case classes:
import com.twitter.algebird.macros.caseclass._ // get the algebras if all the elements of the case class have one
If you add then algebird-test
package to your dependencies you can also access:
import com.twitter.algebird.macros.ArbitraryCaseClassMacro
case class Foo(i: Int, s: String)
implicit val fooArb: Arbitrary[Foo] = ArbitraryCaseClassMacro.arbitrary[Foo]
- Scaladocs
- Getting Started
- Type-safe API Reference
- SQL to Scalding
- Building Bigger Platforms With Scalding
- Scalding Sources
- Scalding-Commons
- Rosetta Code
- Fields-based API Reference (deprecated)
- Scalding: Powerful & Concise MapReduce Programming
- Scalding lecture for UC Berkeley's Analyzing Big Data with Twitter class
- Scalding REPL with Eclipse Scala Worksheets
- Scalding with CDH3U2 in a Maven project
- Running your Scalding jobs in Eclipse
- Running your Scalding jobs in IDEA intellij
- Running Scalding jobs on EMR
- Running Scalding with HBase support: Scalding HBase wiki
- Using the distributed cache
- Unit Testing Scalding Jobs
- TDD for Scalding
- Using counters
- Scalding for the impatient
- Movie Recommendations and more in MapReduce and Scalding
- Generating Recommendations with MapReduce and Scalding
- Poker collusion detection with Mahout and Scalding
- Portfolio Management in Scalding
- Find the Fastest Growing County in US, 1969-2011, using Scalding
- Mod-4 matrix arithmetic with Scalding and Algebird
- Dean Wampler's Scalding Workshop
- Typesafe's Activator for Scalding