Please don't go through this if you haven't already had a go yourself. You can take back/suggest further optimisations from here.
Big data, ml labs for berkeleyX
These are the labs I completed as part of the big data and ML courses offered by Berkeley. The labs include:
- Word count on the complete works of William Shakespeare. Text file: http://www.gutenberg.org/ebooks/100
- Apache server log analysis on a subset of NASA-HTTP web server log. Dataset: http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html
- Text analysis and entity resolution on Google and Amazon datasets. Dataset: https://code.google.com/p/metric-learning/
- Building a movie recommedation system for myself. Dataset: http://grouplens.org/datasets/movielens/
- Working with NumPy and python lambda expressions
- Predict release year of songs based on certain attributes. Dataset: http://labrosa.ee.columbia.edu/millionsong/
- Create a Click Through Rate (CTR) prediction pipeline. Dataset used in the following challenge: https://www.kaggle.com/c/criteo-display-ad-challenge