Catla for Hadoop and Spark (Catla-HS) is a self-tuning system for Hadoop parameters to improve the performance of MapReduce jobs on both Hadoop and Spark clusters with plenty of advanced tools such as the machine learning support and performance visualization tool. Catla-HS is an improved version of Catla, which is our previous work that only focused on Hadoop cluster.
This redesigned project is template-driven, making it very flexible to perform complicated job execution, monitoring and self-tuning for MapReduce performance that addressed more modern solutions like Spark. Besides, the project provides prediction and visualization tools that are easy to use for designing jobs, analyzing, visualizing and predicting the performance of MapReduce jobs.
- Task Runner: To submit a single MapReduce job to a Hadoop and Spark cluster and obtain its analyzing results and logs after the job is completed.
- Project Runner: To submit a group of MapReduce jobs in an organized project folder and monitor the status of its running until completion; eventually, all analyzing results and their logs that contain information of running time in all MapReduce phrases are downloaded into specified location path in its project folder.
- Optimizer Runner: To create a series of MapReduce jobs with different combinations of parameter values according to parameter configuration files and obtain the optimal parameter values with least time cost after the tuning process is finished. Two tuning processes, namely direct search and derivative-free optimization (DFO) techniques, are supported.
- Predictor Runner: To provide multiple prediction models that helps fit the tuning results and predict future performance change of MapReduce jobs. New
- Performance visualization tool: A tool that helps users analyze, visualize and decision making according to collected data of tuning jobs. New
- Performance analysis tool: To support aggregation of MapReduce job profiles and provides a summary of time cost of each phrase in the job. New
- Machine Learning mining tool: To support modeling based on existing machine learning techniques using tuning data and metric data from the tuning process. New
- CatlaUI: CatlaUI provides user-friendly GUI to perform important functions of Catla-HS. here
Below lists some typical uses of Catla-HS.
with Cata-HS.jar in Terminal
java -jar Catla-HS.jar -tool project -dir /your-example-folder/project_wordcount -task pipeline -download true -sequence true
Example 1: Submit a MapReduce job
String[] args=new String[] {
"-tool","task",
"-dir","\\YOUR-FOLDER\\task_wordcount"
};
CatlaRunner.main(args);
Example 2: Submit a composite MapReduce tasks with mutiple jobs
String[] args=new String[] {
"-tool","project",
"-dir","\\YOUR-FOLDER\\project_wordcount",
"-task","pipeline",
"-download","true",
"-sequence","true"
};
CatlaRunner.main(args);
Example 3: Tuning using Exhaustive Search
String[] args = new String[] {
"-tool","tuning",
"-dir", "\\YOUR-FOLDER\\tuning_similarity",
"-clean", "true",
"-group", "wordcount",
"-upload","false",
"-uploadjar","true"
};
CatlaRunner.main(args);
Example 4: Tuning using BOBYQA (a method of derivative-free optimization)
String[] args = new String[] {
"-tool","optimizer",
"-dir", "\\YOUR-FOLDER\\tuning_wordcount",
"-clean", "true",
"-group", "wordcount",
"-upload","true",
"-uploadjar","true",
"-maxinter","1000",
"-optimizer","BOBYQA",
"-BOBYQA-initTRR","20",
"-BOBYQA-stopTRR","1.0e-4"
};
CatlaRunner.main(args);
Advanced usage please see here
Fig. 3 Three-dimensional surface plot of running time of a MapReduce job over two Hadoop configuration parameters using the exhaustive search method on Hadoop
Fig. 4 Two-dimensional plot of running time of a MapReduce job over one Hadoop configuration parameters using the exhaustive search method on Spark Fig. 5 Change of running time of a MapReduce job over number of iterations when tuning using a BOBYQA optimizer
Other DFO-based algorithms supported include:
- Powell's method
- CMA-ES
- Simplex methods
In Catla-HS, there is an additional component called PredictorRunner
to facilitate performance change's fitting and predition. With the use of multiple fitting analysis, we can establish the prediction model for evaluating MapReduce job performance.
The component currently supports:
- linear fitting
- multivariate linear fitting
- logarithmic fitting
- exponential fitting
- polynomial fitting
An example is below:
This project is established upon the project Apache Hadoop, Apache Commons Math3 and Apache MINA SSHD under APACHE LICENSE, VERSION 2.0.
We also used XCharts for visualizing the results.
We currently used Java-ML for implementing several machine learning algorithms for Catla-HS.
Donghua Chen, "An Open-Source Project for MapReduce Performance Self-Tuning," arXiv:1912.12456 [cs.DC], Dec. 2019.
OR
@misc{chen2019opensource,
title={An Open-Source Project for MapReduce Performance Self-Tuning},
author={Donghua Chen},
year={2019},
eprint={1912.12456},
archivePrefix={arXiv},
primaryClass={cs.DC}
}
See the LICENSE file for license rights and limitations (GNU GPLv3).