Kaggle’s March Madness prediction competition is an accessible introduction to machine learning. If you happen to like college basketball, you’ll like that in this competition you can’t bust your bracket, since you make a prediction for every game. Plus this year there’s a big prize pool, and luck plays a big enough role that you can be a legit contender fairly easily.
In 2016, my simple process using tidyverse functions in R placed in the top 10%. I refined it a bit for 2017 and finished in the top 25%.
I’m sharing my code and process here for others to use as a starting point. My approach is similar to that of the 2014 winners, Gregory Matthews and Michael Lopez. They published a paper about the role that luck plays in this competition, putting their model in perspective. A takeaway: take my model, tweak it a bit to generate some distance from the field, and you are competitive to win!
In the Kaggle competition, you estimate how likely it is that Team A beats Team B, for each of the 2,278 possible matchups in the tournament. My guide documents a set of scripts for each step of:
- Deciding on possible input parameters
- Scraping the input data with the
rvest
package - Cleaning and joining data sources to get tidy, prediction-ready data
- Training and evaluating machine learning models on the data
- Making and submitting predictions
This code is public, please reuse it. It’s under an MIT license. Please acknowledge its role in any write-up or discussion of work that relies on it. And if you win a cash prize from Kaggle using this, congratulations! I wouldn’t turn down a thank-you gift ;)
Thanks to contributors @MHenderson and @BillPetti.
Let me know what you think, either on twitter @samfirke or compose a friendly e-mail to: