Skip to content

Commit

Permalink
Merged with devel
Browse files Browse the repository at this point in the history
  • Loading branch information
pakozm committed May 6, 2014
2 parents c37de86 + e2fc4ee commit 4178cf7
Show file tree
Hide file tree
Showing 15 changed files with 590 additions and 417 deletions.
71 changes: 43 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,10 +154,21 @@ return {
number. Depending in your hash function, more or less reducers will be needed.

```Lua
-- string hash function: http://isthe.com/chongo/tech/comp/fnv/
local NUM_REDUCERS = 10
local FNV_prime = 16777619
local offset_basis = 2166136261
local MAX = 2^32
return {
init = function() end,
func = function(key)
return key:byte(#key) -- last character (numeric byte)
-- compute hash
local h = offset_basis
for i=1,#key do
h = (h * FNV_prime) % MAX
h = bit32.bxor(h, key:byte(i))
end
return h % NUM_REDUCERS
end
}
```
Expand Down Expand Up @@ -210,24 +221,27 @@ with *1,965,734 lines* and *49,158,635 running words*. The data has been splitte
in 197 files with a maximum of *10,000* lines per file. The task is executed
in *one machine* with *four cores*. The machine runs a MongoDB server, a
lua-mapreduce server and four lua-mapreduce workers. **Note** that this task
is not fair because the process could be done in main memory.
is not fair because the data could be stored in the local filesystem.

The output of lua-mapreduce was:

```
$ ./execute_BIG_server.sh > output
# Iteration 1
# Preparing MAP
# MAP execution
100.0 %
# Preparing REDUCE
# Merge and partitioning
100.0 %
# Creating jobs
# REDUCE execution
100.0 %
# FINAL execution
# 70 seconds
# Preparing Map
# Map execution, size= 197
100.0 %
# Preparing Reduce
# Reduce execution, num_files= 1970 size= 10
100.0 %
# Final execution
# Map sum(cpu_time) 99.278813
# Reduce sum(cpu_time) 57.789231
# Sum(cpu_time) 157.068044
# Map real time 42
# Reduce real time 22
# Real time 64
# Total iteration time 66 seconds
```

**Note:** using only one worker takes: 117 seconds
Expand All @@ -246,23 +260,24 @@ A naive word-count version implemented in Lua takes:

```
$ time cat /home/experimentos/CORPORA/EUROPARL/en-splits/* | \
lua naive.lua > output-naivetime
real 0m17.604s
user 0m17.064s
sys 0m1.445s
lua misc/naive.lua > output-naivetime
real 0m26.125s
user 0m17.458s
sys 0m0.324s
```

Looking to these numbers, it is clear that the better is to work in
main memory, as in the naive Lua implementation, which needs only
18 seconds. The map-reduce approach takes 70 seconds with four
workers and 117 seconds with only one worker. These last two numbers
are comparable with the naive shellscript implementation using pipes,
which takes 141 seconds. Concluding, the preliminar lua-mapreduce
implementation, using MongoDB for communication and disk files as
auxiliary storage, is between a **1.2** and **2** times faster than a
shellscript implementation using pipes. In the future, a larger
data task will be choosen to compare this implementación with raw
map-reduce in MongoDB and/or Hadoop.
Looking to these numbers, it is clear that the better is to work in main memory
and in local storage filesystem, as in the naive Lua implementation, which needs
only 17 seconds (user time), but uses local disk files. The map-reduce approach
takes 64 seconds (real time) with four workers and 146 seconds (user time) with
only one worker. These last two numbers are comparable with the naive
shellscript implementation using pipes, which takes 143 seconds (user
time). Concluding, the preliminar lua-mapreduce implementation, using MongoDB
for communication and GridFS for auxiliary storage, is up to **2** times faster
than a shellscript implementation using pipes. Both implementations sort the
data in order to aggregate the results. In the future, a larger data task will
be choosen to compare this implementation with raw map-reduce in MongoDB and/or
Hadoop.

Last notes
----------
Expand Down
6 changes: 4 additions & 2 deletions examples/WordCount/finalfn.lua
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
local it = 0
return {
init = function() end,
func = function(it)
for key,value in it do
func = function(pairs_iterator)
it = it + 1
for key,value in pairs_iterator do
print(value,key)
end
return true -- indicates to remove mongo gridfs result files
Expand Down
13 changes: 12 additions & 1 deletion examples/WordCount/partitionfn.lua
Original file line number Diff line number Diff line change
@@ -1,6 +1,17 @@
-- string hash function: http://isthe.com/chongo/tech/comp/fnv/
local NUM_REDUCERS = 10
local FNV_prime = 16777619
local offset_basis = 2166136261
local MAX = 2^32
return {
init = function() end,
func = function(key)
return key:byte(#key) -- last character (numeric byte)
-- compute hash
local h = offset_basis
for i=1,#key do
h = (h * FNV_prime) % MAX
h = bit32.bxor(h, key:byte(i))
end
return h % NUM_REDUCERS
end
}
2 changes: 1 addition & 1 deletion execute_BIG_server.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#!/bin/bash
LUA_PATH="?.lua"
LUA_PATH="?.lua;?/init.lua"
lua execute_server.lua django wordcountBIG \
examples.WordCountBig.taskfn \
examples.WordCount.mapfn \
Expand Down
2 changes: 1 addition & 1 deletion execute_BIG_worker.sh
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
#!/bin/bash
LUA_PATH="?.lua"
LUA_PATH="?.lua;?/init.lua"
lua execute_worker.lua django wordcountBIG
7 changes: 3 additions & 4 deletions execute_server.lua
Original file line number Diff line number Diff line change
Expand Up @@ -33,9 +33,8 @@ local function normalize(name)
return name:gsub("/","."):gsub("%.lua$","")
end
--
local server = require "mapreduce.server"
local utils = require "mapreduce.utils"
local s = server.new(connection_string, dbname)
local mapreduce = require "mapreduce"
local s = mapreduce.server.new(connection_string, dbname)
s:configure{
taskfn = normalize(taskfn),
mapfn = normalize(mapfn),
Expand All @@ -49,5 +48,5 @@ s:configure{
final_args = arg,
result_ns = result_ns,
}
utils.sleep(4)
mapreduce.utils.sleep(4)
s:loop()
4 changes: 2 additions & 2 deletions execute_worker.lua
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,6 @@
--
local connection_string = arg[1]
local dbname = arg[2]
local worker = require "mapreduce.worker"
local w = worker.new(connection_string, dbname)
local mapreduce = require "mapreduce"
local w = mapreduce.worker.new(connection_string, dbname)
w:execute()
4 changes: 4 additions & 0 deletions mapreduce/cnn.lua
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,10 @@ function cnn:gridfs()
return gridfs
end

function cnn:grid_file_builder()
return mongo.GridFileBuilder.New(self:connect(), self.gridfs_dbname)
end

function cnn:get_dbname()
return self.dbname
end
Expand Down
2 changes: 2 additions & 0 deletions mapreduce/init.lua
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
local worker = require "mapreduce.worker"
local server = require "mapreduce.server"
local utils = require "mapreduce.utils"

local mapreduce = {
_VERSION = "0.1",
_NAME = "mapreduce",
worker = worker,
server = server,
utils = utils,
}

-- integrity test
Expand Down
Loading

0 comments on commit 4178cf7

Please sign in to comment.