From 76a96ba0439ecb86fb4795adaa8a57e88bdd73ee Mon Sep 17 00:00:00 2001 From: Paco Zamora Martinez Date: Sat, 3 May 2014 17:48:02 +0200 Subject: [PATCH] Added minor doc in readme --- README.md | 194 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 194 insertions(+) diff --git a/README.md b/README.md index f58090e..38fee75 100644 --- a/README.md +++ b/README.md @@ -14,3 +14,197 @@ This software depends in: - [pakozm/luamongo](https://github.com/pakozm/luamongo/), a fork of [moai/luamongo](https://github.com/moai/luamongo) for Lua 5.2 and with minor improvements. + +Installation +------------ + +Copy the `mapreduce` directory to a place visible from your `LUA_PATH` +environment variable. In the same way, in order to test the example, you need to +put the `examples` directory visible through your `LUA_PATH`. It is possible to +add the active directory by writing in the terminal: + +``` +$ export LUA_PATH='?.lua;?/init.lua' +``` + +Usage +----- + +Two Lua scripts have been prepared for fast running of the software. + +- `execute_server.lua` runs the master server for your map-reduce operation. + Only **one instance** of this script is needed. Note that this software + receives the **map-reduce task** splitted into several Lua modules. These + modules had to be visible in the `LUA_PATH` of the server and all the workers + that you execute. This script receives 7 mandatory arguments: + + 1. The connection string, normally `localhost` or `localhost:21707`. + 2. The name of the database where the work will be done. + 3. A Lua module which contains the **task** function data. + 4. A Lua module which contains the **map** function data. + 5. A Lua module which contains the **partition** function data. + 6. A Lua module which contains the **reduce** function data. + 7. A Lua module which contains the **final** function data. + +- `execute_worker.lua` runs the worker, which is configured by default to + execute one map-reduce task and finish its operation. One task doesn't mean + one job. A **map-reduce task** is performed as several individual **map/reduce + jobs**. A worker waits until all the possible map or reduce jobs are completed + to consider a task as finished. This script receives two arguments: + + 1. The connection string, as above. + 2. The name of the database where the work will be done, as above. + +A simple word-count example is available in the repository. There are two +shell-scripts: `execute_server_example.sh` and `execute_worker_example.sh`; +which are ready to run the word-count example in only one machine, with one or +more worker instances. The execution of the example looks like this: + +**SERVER** +``` +$ ./execute_example_server.sh > output +# Preparing MAP +# MAP execution + 100.0 % +# Preparing REDUCE +# MERGE AND PARTITIONING + 100.0 % +# CREATING JOBS +# STARTING REDUCE +# REDUCE execution + 100.0 % +# FINAL execution +``` + +**WORKER** +``` +$ ./execute_example_worker.sh +# NEW TASK READY +# EXECUTING MAP JOB _id: "1" +# FINISHED +# EXECUTING MAP JOB _id: "2" +# FINISHED +# EXECUTING MAP JOB _id: "3" +# FINISHED +# EXECUTING MAP JOB _id: "4" +# FINISHED +# EXECUTING REDUCE JOB _id: "121" +# FINISHED +# EXECUTING REDUCE JOB _id: "37" +# FINISHED +... +``` + +Map-reduce task example: word-count +----------------------------------- + +The example is composed by one Lua module for each of the map-reduce functions, +and are available at the directory `examples/WordCount/`. All the modules has +the same structure, they return a Lua table with two fields: + +- **init** function, which receives a table of arguments and allows to configure + your module options, in case that you need any option. + +- **func** function, which implements the necessary Lua code. + +A map-reduce task is divided, at least, in the following modules: + +- **taskfn.lua** is the script which defines how the data is divided in order to + create **map jobs**. The **func** field is executed as a Lua *coroutine*, so, + every map job will be created by calling `corotuine.yield(key,value)`. + +```Lua +-- arg is for configuration purposes, it is allowed in any of the scripts +local init = function(arg) + -- do whatever you need for initialization parametrized by arg table +end +return { + init = init, + func = function() + coroutine.yield(1,"mapreduce/server.lua") + coroutine.yield(2,"mapreduce/worker.lua") + coroutine.yield(3,"mapreduce/test.lua") + coroutine.yield(4,"mapreduce/utils.lua") + end +} +``` + +- **mapfn.lua** is the script where the map function is implemented. The + **func** field is executed as a standard Lua function, and receives tow + arguments `(key,value)` generated by one of the yields at your `taskfn` + script. Map results are produced by calling the global function + `emit(key,value)`. + +```Lua +return { + init = function() end, + func = function(key,value) + for line in io.lines(value) do + for w in line:gmatch("[^%s]+") do + emit(w,1) + end + end + end +} +``` + +- **partitionfn.lua** is the script which describes how the map results are + grouped and partitioned in order to create **reduce jobs**. The **func** field + is a hash function which receives an emitted key and returns an integer + number. Depending in your hash function, more or less reducers will be needed. + +```Lua +return { + init = function() end, + func = function(key) + return key:byte(#key) -- last character (numeric byte) + end +} +``` + +- **reducefn.lua** is the script which implements the reduce function. The + **func** field is a function which receives a pair `(key,values)` where the + `key` is one of the emitted keys, and the `values` is a Lua array (table with + integer and sequential keys starting at 1) with all the available map values + for the given key. The system could reuse the reduce function several times, + so, it must be idempotent. The reduce results will be grouped following the + partition function. For each possible partition, a GridFS file will be created + in a collection called `dbname_fs` where dbname is the database name defined + above. + +```Lua +return { + init = function() end, + func = function(key,values) + local count=0 + for _,v in ipairs(values) do count = count + v end + return count + end +} +``` + +- **finalfn.lua** is the script which implements how to take the results + produced by the system. The **func** field is a function which receives a + Lua pairs iterator, and returns a boolean indicating if to destroy or not + the GridFS collection data. If the returned value is `true`, the results + will be removed. If the returned value is `false` or `nil`, the results + will be available after the execution of your map-reduce task. + +```Lua +return { + init = function() end, + func = function(it) + for key,value in it do + print(value,key) + end + return true -- indicates to remove mongo gridfs result files + end +} +``` + +Last notes +---------- + +This software is in development. More documentation will be added to the +wiki pages, while we have time to do that. Collaboration is open, and all your +contributions will be welcome.