Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
pakozm committed May 3, 2014
1 parent 76a96ba commit 90e2d6d
Showing 1 changed file with 62 additions and 0 deletions.
62 changes: 62 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -202,6 +202,68 @@ return {
}
```

Performance notes
-----------------

Word-count example using [Europarl v7 English data](http://www.statmt.org/europarl/),
with *1,965,734 lines* and *49,158,635 running words*. The data has been splitted
in 197 files with a maximum of *10,000* lines per file. The task is executed
in *one machine* with *four cores*. The machine runs a MongoDB server, a
lua-mapreduce server and four lua-mapreduce workers. **Note** that this task
is not fair because the process could be done in main memory.

The output of lua-mapreduce was:

```
$ ./execute_BIG_server.sh > output
# Iteration 1
# Preparing MAP
# MAP execution
100.0 %
# Preparing REDUCE
# Merge and partitioning
100.0 %
# Creating jobs
# REDUCE execution
100.0 %
# FINAL execution
# 70 seconds
```

**Note:** using only one worker takes: 117 seconds

A naive word-count version implemented with pipes and shellscripts takes:

```
$ time cat /home/experimentos/CORPORA/EUROPARL/en-splits/* | \
tr ' ' '\n' | sort | uniq -c > output-pipes
real 2m21.272s
user 2m23.339s
sys 0m2.951s
```

A naive word-count version implemented in Lua takes:

```
$ time cat /home/experimentos/CORPORA/EUROPARL/en-splits/* | \
lua naive.lua > output-naivetime
real 0m17.604s
user 0m17.064s
sys 0m1.445s
```

Looking to these numbers, it is clear that the better is to work in
main memory, as in the naive Lua implementation, which needs only
18 seconds. The map-reduce approach takes 70 seconds with four
workers and 117 seconds with only one worker. These last two numbers
are comparable with the naive shellscript implementation using pipes,
which takes 141 seconds. Concluding, the preliminar lua-mapreduce
implementation, using MongoDB for communication and disk files as
auxiliary storage, is between a **17%** and **50%** faster than a
shellscript implementation using pipes. In the future, a larger
data task will be choosen to compare this implementation with raw
map-reduce in MongoDB and/or Hadoop.

Last notes
----------

Expand Down

0 comments on commit 90e2d6d

Please sign in to comment.