sparkml-flights-delay

Predicting the arrival delay time of a commercial flights using Apache Spark MLlib

Getting started • Validation process • Authors • License

Getting started

The easiest way to run this project is by cloning the project locally, create a fat jar using Maven and executing the shell script that can be found on the project's root directory.

mvn clean package
./run.sh

It is possible to active/deactivate the explore stage with the --explore flag (add/remove this flag inside the run.sh script).

The output should be similar to the following one:

You can also import it to your favourite IDE, but keep in mind that the program requires one argument, which is the dataset to process. You can find multiple valid datasets at this link: Airline On-Time Statistics and Delay Causes.

Be aware that it can take a lot of time with a large dataset (14 models are trained with 10 folds cross-validation). This is why we included a small tuning.csv file in the raw folder. Please, consider using this dataset to check that the program works properly.

Validation process

The general workflow on the program is shown in the image below:

Hyperparameter tuning and model selection are carried out using cross-validation on the training dataset. In this stage, a grid search is performed using two different models: Linear Regression and Random Forest (you can add your own extending the CVTuningPipeline class). Finally, the test error of the best model is obtained using the test set.

Authors 🇪🇸 💙 🇮🇹

Fernando Díaz
Giorgio Ruffa

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
raw		raw
src/main		src/main
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
pom.xml		pom.xml
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sparkml-flights-delay

Predicting the arrival delay time of a commercial flights using Apache Spark MLlib

Getting started

Validation process

Authors 🇪🇸 💙 🇮🇹

License

About

Releases

Packages

Contributors 2

Languages

License

fediazgon/sparkml-flights-delay

Folders and files

Latest commit

History

Repository files navigation

sparkml-flights-delay

Predicting the arrival delay time of a commercial flights using Apache Spark MLlib

Getting started

Validation process

Authors 🇪🇸 💙 🇮🇹

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages