Predicting the arrival delay time of a commercial flights using Apache Spark MLlib
Getting started • Validation process • Authors • License
The easiest way to run this project is by cloning the project locally, create a fat jar using Maven and executing the shell script that can be found on the project's root directory.
mvn clean package
./run.sh
It is possible to active/deactivate the explore stage with the --explore
flag (add/remove this flag inside the
run.sh
script).
The output should be similar to the following one:
You can also import it to your favourite IDE, but keep in mind that the program requires one argument, which is the dataset to process. You can find multiple valid datasets at this link: Airline On-Time Statistics and Delay Causes.
Be aware that it can take a lot of time with a large dataset (14 models are trained with 10 folds cross-validation).
This is why we included a small tuning.csv
file in the raw
folder. Please, consider using this dataset to check
that the program works properly.
The general workflow on the program is shown in the image below:
Hyperparameter tuning and model selection are carried out using cross-validation on the training dataset. In this stage,
a grid search is performed using two different models: Linear Regression and Random Forest (you can add your own
extending the CVTuningPipeline
class). Finally, the test error of the best model is obtained using the test set.
- Fernando Díaz
- Giorgio Ruffa
This project is licensed under the MIT License - see the LICENSE.md file for details