Performance Profiling of Distributed Data Processing Frameworks to Inform Suitability for Analytic Jobs

The amount of collected data has increased exponentially since devices have become more connected. On average, there are 500 million tweets sent, 5 billion online searches, 294 billion emails sent, and 4 petabytes of data created on Facebook every day. This has compelled companies to perform analytics on the collected data to improve business decisions that would generate higher profits and keep consumers satisfied. There are numerous tools that have their own way of loading and storing the data to perform analysis. With multiple options available for analytical engines, it is difficult to determine what engine is the most efficient in terms of time, memory usage, and resource management.

How does the choice of analytical engines impact completion times? We will be looking at three popular engines: Hadoop, Spark, and Tensorflow. We will explore the performance effects for regression model fitting algorithms for the aforementioned engines. Different indicators such as number of page faults, throughput, completion times, and network I/O will be used to determine each engine’s impact on analytical performance.

This paper seeks to answer the question: given a dataset with certain types of characteristics, which engine would complete the task faster while effectively using computational resources?

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
data_gen		data_gen
hadoop		hadoop
manual		manual
spark		spark
.gitignore		.gitignore
README.md		README.md
Thesis Scratch Work.ipynb		Thesis Scratch Work.ipynb
results.csv		results.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Performance Profiling of Distributed Data Processing Frameworks to Inform Suitability for Analytic Jobs

About

Releases

Packages

Languages

cskor/Honors-Thesis

Folders and files

Latest commit

History

Repository files navigation

Performance Profiling of Distributed Data Processing Frameworks to Inform Suitability for Analytic Jobs

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages