Skip to content

ayudovin/Spark-Optimization-Tutorial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spark-Optimization-Tutorial

Today, enterprises seek both cost- and time-efficient solutions that will deliver unsurpassed performance and user experience. In this regard, there is always a room for optimization. This report aims to cover basic principles and techniques of the Apache Spark optimization process. By applying the methods covered in this paper, engineers will be able to save time for development tasks, bringing more value to the product.

Apache Spark is one of the popular technologies for processing, managing, and analyzing big data. It is a unified analytics engine with built-in modules for streaming, SQL, machine learning, and graph processing.

In this report, we will dive deep into two popular modules of Apache Spark—Spark Core and Spark SQL.

Spark Core is the basis of Apache Spark and provides distributed task dispatching, scheduling, and basic I/O functionalities.

Spark SQL can be divided into two parts: DataFrame API and pure Spark SQL. Despite the fact that DataFrame and pure Spark SQL are basically the same technologies under the hood, they still have quite a few differences, which will be explored in the comparison.

In order to measure the investigation results, we used the following queries: two JOINs, GROUP BY, COUNT(), ORDER BY, and WHERE. Data samples for this investigation were taken from Stack Overflow. The repository with the code and examples can be found in this GitHub repository.

To establish baselines, the queries were executed on Spark Core, DataFrame API, and Spark SQL at the default configuration. Afterwards, several optimization techniques were applied to each of the configurations and then the queries were executed again. Query runtimes were measured in minutes.

Link to the report: https://www.altoros.com/research-papers/essential-optimization-methods-to-make-apache-spark-work-faster/

About

Spark Optimization Tutorial

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages