Access the project via Databricks here
Flight delays create problems in scheduling for airlines and airports, leading to passenger inconvenience, and huge economic losses. As a result, there is growing interest in predicting flight delays beforehand in order to optimize operations and improve customer satisfaction. The objective of this playground project is to predict flight departure delays two hours ahead of departure at scale. The project includes an exploration of a series of data transformation and ML pipelines in Apache Spark (using Databricks). It concludes with some challenges faced along the journey and some key lessons learned.
The Databricks notebook is connected with AWS where it can create and manage compute and VPC resources. Data access in the notebook was through a mounted S3 bucket on AWS.
Datasets used in the project include the following:
- flight dataset from the US Department of Transportation containing flight information from 2015 to 2019
(31,746,841 x 109 dataframe) - weather dataset from the National Oceanic and Atmospheric Administration repository containing weather information from 2015 to 2019
(630,904,436 x 177 dataframe) - airport dataset from the US Department of Transportation
(18,097 x 10 dataframe)
The project can be directly accessed via Spark Playground - Flight Delay Prediction. This repository also contains the .dbc and .py versions of the Databricks notebook.