This is an implementation of MIT's MapReduce lab.
We are trying to design a mechanism for data processing on large-scale distributed systems. Detailed background for the problem is explained in this paper. Aside from basic design implementation, we added our own load balancing design to optimize the process.
One "master" server contains all tasks contained in a queue. "Worker" servers constantly send requests through Remote Procedural Call (RPC) to master asking for tasks, then process tasks independently and concurrently, then each reports the results back. If the reply received from the master sets “tasksAllDone” to be true, the worker exits.
Each worker works on one "task" which is an input file. On a large dataset of uneven input files, the time required to finish a task depends on the size of a file. Since the total amount of time to finish the entire job depends on the last worker finishing its task, workers who have lighter tasks have to wait for the last worker. The more uneven the file sizes, the longer the wait time.
Based on the assumption that if we have uneven input files, the performance would be bad, we implemented a basic load balancer. The idea is that the load balancer splits input files into even sized chunks, so that each worker can take up similar amounts of work. In this way, the situation where a few workers are waiting for one should be less frequent.
Chunk implementation:
- scan the total input files, calculate the sum of the bytes of the input files.
- split the files into chunks with start offset and chunk size.
- pass that chunk info as part of Task. When the worker receives the task, the worker only processes the chunk using the offset and chunksize.
To test the effectiveness of chunking, we generated uneven distributed input files from a long source text file. We split the source file into 4 uneven files, which are fed to the master.
The ratio of the 4 files is decided by the last four numbers in fibonacci sequence, where the last number is determined by user input. The four files always conform to the ratio a, b, a+b, a+ 2b. However, the larger the user supplied input is, the more disproportional a and b are, so the more uneven the sequence is.
We score uneveness on a scale as follows: [4 15 50 100 150 200 250 300 350 400 450 500], with 4 being the most even file size distribution and 500 being the most uneven. Then for each unevenness score, we ran 10 trials and recorded the median time. We choose median over mean result in order to avoid any irregular run time.
First, we recorded total runtime of MapReduce without load balancing for each file size distribution. Below is the median of the results of 10 trials.
Then, we ran MapReduce with load balancing, in which we ran 10 trials for each chunk size and calculated the median runtime. Results are recorded in the graph and table below. Results with chunk size 0 are taken from Graph 1 (no load balancing) for comparison.
When running MapReduce without a load balancer, we see that runtime varies between different file size distributions. When file sizes are the most even (4), the program takes the least amount of time, while unevenness of 150,300,400 tend to be the local maxima in runtime. We hypothesize that this is because when file sizes are unbalanced, workers assigned shorter files have to wait for workers assigned longer files. When file sizes are naturally balanced, no worker is idle for too long and each will end up finishing the task around the same time.
When we add load balancing, the runtime of the program decreases significantly. As we can see, with each successive increase in chunk size, the runtime of MapReduce decreases until around 5.01 seconds at chunk size of 32000 byte, where no more improvement can be made. At the optimal chunk size, we are able to cut down up to four times the runtime of MapReduce without load balancing.
With the addition of load balance, runtime also becomes uniform across all file size distributions. As the total amount of data is the same in all cases, adding a load balancer successfully distributed work evenly across all workers. This experiment shows that load balancing by chunking input files is a good strategy to optimize a distributed system.