Skip to content

Catla for Hadoop and Spark (Catla-HS): An open-source system to support tuning MapReduce performance on Hadoop and Spark clusters.

License

Notifications You must be signed in to change notification settings

dhchenx/Catla-HS

Repository files navigation

Catla for Hadoop and Spark

Catla-HS

Catla for Hadoop and Spark (Catla-HS) is a self-tuning system for Hadoop parameters to improve the performance of MapReduce jobs on both Hadoop and Spark clusters with plenty of advanced tools such as the machine learning support and performance visualization tool. Catla-HS is an improved version of Catla, which is our previous work that only focused on Hadoop cluster.

This redesigned project is template-driven, making it very flexible to perform complicated job execution, monitoring and self-tuning for MapReduce performance that addressed more modern solutions like Spark. Besides, the project provides prediction and visualization tools that are easy to use for designing jobs, analyzing, visualizing and predicting the performance of MapReduce jobs.

Architecture

CatlaHS architecture

Fig.1 Architecture of CatlaHS

Components

  1. Task Runner: To submit a single MapReduce job to a Hadoop and Spark cluster and obtain its analyzing results and logs after the job is completed.
  2. Project Runner: To submit a group of MapReduce jobs in an organized project folder and monitor the status of its running until completion; eventually, all analyzing results and their logs that contain information of running time in all MapReduce phrases are downloaded into specified location path in its project folder.
  3. Optimizer Runner: To create a series of MapReduce jobs with different combinations of parameter values according to parameter configuration files and obtain the optimal parameter values with least time cost after the tuning process is finished. Two tuning processes, namely direct search and derivative-free optimization (DFO) techniques, are supported.
  4. Predictor Runner: To provide multiple prediction models that helps fit the tuning results and predict future performance change of MapReduce jobs. New
  5. Performance visualization tool: A tool that helps users analyze, visualize and decision making according to collected data of tuning jobs. New
  6. Performance analysis tool: To support aggregation of MapReduce job profiles and provides a summary of time cost of each phrase in the job. New
  7. Machine Learning mining tool: To support modeling based on existing machine learning techniques using tuning data and metric data from the tuning process. New
  8. CatlaUI: CatlaUI provides user-friendly GUI to perform important functions of Catla-HS. here

Flowchart of tuning

Catla-HS usage

Fig.2 Usag of Catla-HS that support both Hadoop and Spark

Advanced example?

Usage

Below lists some typical uses of Catla-HS.

(1) Shell

with Cata-HS.jar in Terminal

java -jar Catla-HS.jar -tool project -dir /your-example-folder/project_wordcount -task pipeline -download true -sequence true

(2) Execute using CatlaRunner

Example 1: Submit a MapReduce job

	String[] args=new String[] {
				"-tool","task",
				"-dir","\\YOUR-FOLDER\\task_wordcount"
		};
		
		CatlaRunner.main(args);

Example 2: Submit a composite MapReduce tasks with mutiple jobs

		String[] args=new String[] {
				"-tool","project",
				"-dir","\\YOUR-FOLDER\\project_wordcount",
				"-task","pipeline",
				"-download","true",
				"-sequence","true"
		};
		
		CatlaRunner.main(args);

Example 3: Tuning using Exhaustive Search

		String[] args = new String[] { 
					"-tool","tuning",
					"-dir", "\\YOUR-FOLDER\\tuning_similarity",
					"-clean", "true", 
					"-group", "wordcount", 
					"-upload","false", 
					"-uploadjar","true"
					
				};
			
			CatlaRunner.main(args);

Example 4: Tuning using BOBYQA (a method of derivative-free optimization)

String[]	args = new String[] { 
					"-tool","optimizer",
					"-dir", "\\YOUR-FOLDER\\tuning_wordcount",
					"-clean", "true", 
					"-group", "wordcount", 
					"-upload","true",
					"-uploadjar","true",
					"-maxinter","1000",
					"-optimizer","BOBYQA",
					"-BOBYQA-initTRR","20",
					"-BOBYQA-stopTRR","1.0e-4"
				};
			
			CatlaRunner.main(args);

Advanced usage please see here

Analysis results using Catla-HS

(1) Exhaustive search

exhaustive search for Hadoop


Fig. 3 Three-dimensional surface plot of running time of a MapReduce job over two Hadoop configuration parameters using the exhaustive search method on Hadoop

exhaustive search for Spark

Fig. 4 Two-dimensional plot of running time of a MapReduce job over one Hadoop configuration parameters using the exhaustive search method on Spark

(2) Derivative-free optimization-based search

BOBYQA optimizer

Fig. 5 Change of running time of a MapReduce job over number of iterations when tuning using a BOBYQA optimizer

Other DFO-based algorithms supported include:

  1. Powell's method
  2. CMA-ES
  3. Simplex methods

Fitting model

In Catla-HS, there is an additional component called PredictorRunner to facilitate performance change's fitting and predition. With the use of multiple fitting analysis, we can establish the prediction model for evaluating MapReduce job performance.

The component currently supports:

  1. linear fitting
  2. multivariate linear fitting
  3. logarithmic fitting
  4. exponential fitting
  5. polynomial fitting

An example is below:

Example of fitting model

Credits

This project is established upon the project Apache Hadoop, Apache Commons Math3 and Apache MINA SSHD under APACHE LICENSE, VERSION 2.0.

We also used XCharts for visualizing the results.

We currently used Java-ML for implementing several machine learning algorithms for Catla-HS.

Citation

Donghua Chen, "An Open-Source Project for MapReduce Performance Self-Tuning," arXiv:1912.12456 [cs.DC], Dec. 2019.

OR

@misc{chen2019opensource,
    title={An Open-Source Project for MapReduce Performance Self-Tuning},
    author={Donghua Chen},
    year={2019},
    eprint={1912.12456},
    archivePrefix={arXiv},
    primaryClass={cs.DC}
}

LICENSE

See the LICENSE file for license rights and limitations (GNU GPLv3).

About

Catla for Hadoop and Spark (Catla-HS): An open-source system to support tuning MapReduce performance on Hadoop and Spark clusters.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages