-
Notifications
You must be signed in to change notification settings - Fork 4
Home
Sizzle is an open source implementation of the Sawzall programming language designed for interoperation with the Hadoop MapReduce and DFS stack. It is implemented in pure Java, is easily extendible, and the programs produced by it will run anywhere that has a recent Hadoop installed, even if Sizzle is not also installed.
Up until a few days ago, there was no publicly available implementation of Sawzall.
About six months ago, I asked some of the authors of Interpreting the Data: Parallel Analysis with Sawzall http://code.google.com/p/szl/wiki/Interpreting_the_Data for more specific details about how Sawzall worked than was explained in that high-level document. Mr. Pike explained that he intended to open the source to Sawzall; however, when I didn't hear from him for several months I started my own implementation.
The Sizzle release v0.0 available at https://github.com/anthonyu/Sizzle has:
- 100% compatibility with the syntax described in the Sawzall paper
- Nearly 0% compatibility with the syntax not described in that paper. It just wasn't available to me until the szl release this week.
- Pretty much all the useful Sawzall intrinsic functions described by http://szl.googlecode.com/svn/doc/sawzall-intrinsics.html are implemented, with the most useful of them thouroughly tested. Currently missing are:
- the protobuf, resourcestats and additionalinput functions, because I haven't yet personally found a need for them,
- the convert function, because explicit and implicit casting works just as well
- the sortx, new and regex functions, because I didn't have time to finish them
- All of the aggregators discussed in http://code.google.com/p/szl/wiki/Sawzall_Table_Types have been implemented and tested thourougly, with the exception of:
- the sample aggregators, as they require an initial statistics generation pass over the data that Sizzle doesn't yet support.
- the set and recordio aggregators, as I have no idea what they are supposed to do yet
- A complete runtime, allowing you to run Sawzall program on any recent Hadoop cluster
If you are looking to run Sawzall programs on a single machine, then it's won't be: szl is currently more complete and better tested. However, it does not come with a MapReduce system and does not interoperate with Hadoop, so you won't be easily running szl on more than one machine at a time for now.
For those who use Hadoop on the other hand, Sizzle is the only game in town because it makes it possible to run non-trivial Sawzall progams on large computing clusters today, without needing to have access to the MapReduce clusters down at the Googleplex.
The Sizzle compiler and runtime was designed from the start to interoperate with Hadoop, and does so seamlessly.
In the long term, even after szl is integrated with Hadoop, Sizzle will still be a better choice for most as it is more easily extended, and since it is native Java, more easily modified by its user base of Java developers.
Run ant in the top level directory.
E.g:
bash$ ant
It's as simple as running:
java -jar location of the sizzle compiler jar -h location of hadoop distribution -i a file containing Sawzall source code
E.g.:
bash$ java -jar /path/to/sizzle/dist/sizzle-compiler.jar -h /path/to/hadoop-0.21.0 -i Simple.szl
This compilation step will output a jar file, in this case named 'Simple.jar', which contains everything necessary to run your Sawzall program on your local machine or a multi-node Hadoop cluster.
See also: Compiling Sizzle Programs
It's as simple as running:
hadoop jar output of the Sizzle compiler main class input file output file
E.g., to continue the previous example:
bash$ hadoop jar Simple.jar sizzle.Simple input output
Which will run the program Simple on the file input and place its results in file output.
See also: Running Sizzle Programs
It's as simple as writing a public static
Java method and decorating it with the
sizzle.functions.FunctionSpec
annotation. For example, the following code implements and exports a
function named 'getenv' that takes a single 'string' argument and returns a 'string.'
@FunctionSpec(name = "getenv", returnType = "string", formalParameters = { "string" })
public static String getenv(String variable) {
return System.getenv(variable);
}
Specify the jar containing that function's enclosing whenever you compile a Sizzle program, and it will be made available to your Sawzall code.
See also: Extending Sizzle
It's as simple as writing a class that extends sizzle.aggregators.Aggregator
and decorating it
with the sizzle.aggregators.AggregatorSpec
annotation.
For example, the following code implements and exports an aggregator named 'log' that logs any data emitted to it via Log4J:
import sizzle.aggregators.Aggregator;
import org.apache.log4j.Logger;
@AggregatorSpec(name = "log")
public class LogAggregator extends Aggregator {
private static Logger logger = Logger.getLogger(LogAggregator.class);
@Override
public void aggregate(final String data, final String metadata) throws IOException {
logger.info(data);
}
}
Specify the jar containing that function's enclosing whenever you compile a Sizzle program, and it will be made available to your Sawzall code.
See also: Extending Sizzle
You name it. Sizzle is in need of your bug reports, test cases, documentation, examples and the implementation of any missing features. Stake your claim by filing an issue in github, then send me a pull request when you are ready.
Your contributions will be greatly appreciated!
Happy sizzling!