Skip to content

Performance Profiling of Distributed Data Processing Frameworks to Inform Suitability for Analytic Jobs

Notifications You must be signed in to change notification settings

cskor/Honors-Thesis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Performance Profiling of Distributed Data Processing Frameworks to Inform Suitability for Analytic Jobs

The amount of collected data has increased exponentially since devices have become more connected. On average, there are 500 million tweets sent, 5 billion online searches, 294 billion emails sent, and 4 petabytes of data created on Facebook every day. This has compelled companies to perform analytics on the collected data to improve business decisions that would generate higher profits and keep consumers satisfied. There are numerous tools that have their own way of loading and storing the data to perform analysis. With multiple options available for analytical engines, it is difficult to determine what engine is the most efficient in terms of time, memory usage, and resource management.

How does the choice of analytical engines impact completion times? We will be looking at three popular engines: Hadoop, Spark, and Tensorflow. We will explore the performance effects for regression model fitting algorithms for the aforementioned engines. Different indicators such as number of page faults, throughput, completion times, and network I/O will be used to determine each engine’s impact on analytical performance.

This paper seeks to answer the question: given a dataset with certain types of characteristics, which engine would complete the task faster while effectively using computational resources?

About

Performance Profiling of Distributed Data Processing Frameworks to Inform Suitability for Analytic Jobs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published