Big data analytics performed with Spark and Hadoop on RITA airlines dataset (8.3 GB)
The goal of the project is to infer qualitative data regarding USA flights during the years 1994-2008. The data can be downloaded from stat-computing.org.
Using both Hadoop and Spark provide the following information:
-
The percentage of canceled flights per day, throughout the entire data set
-
Weekly percentages of delays that are due to weather, throughout the entire data set
-
The percentage of flights belonging to a given "distance group" that were able to halve their departure delays by the time they arrived at their destinations. Distance groups assort flights by their total distance in miles. Flights with distances that are less than 200 miles belong in group 1, flights with distances that are between 200 and 399 miles belong in group 2, flights with distances that are between 400 and 599 miles belong in group 3, and so on. The last group contains flights whose distances are between 2400 and 2599 miles.
-
A weekly "penalty" score for each airport that depends on both the its incoming and outgoing flights. The score adds 0.5 for each incoming flight that is more than 15 minutes late, and 1 for each outgoing flight that is more than 15 minutes late.
-
Also provide an additional data analysis defined by your group.
Analysis chosen: The yearly number of flights provided by the carriers for each route.
Use charts to present the information that you have extracted from the data sets.
We address the problem exploiting Spark framework to manage the high volume of the available data. To display our analysis in an effective way, we make use of Jupyter Notebooks, enabling us to show the flow of data transformations to extract our analytics, together with the data visualization.
Checkout the reports to look at the performed analytics.
Checkout the documentation to initialize and run this analytics.
USA airlines statistics presentation - Arcari, Cilloni, Gregori
This project has been developed for the Middleware Technologies for Distributed Systems course (A.Y. 2017/2018) at Politecnico di Milano. Look at the polimi-mt-acg page for other projects.