Here is a small example data set (CSV sample.zip - 40 MB unzipped, 2 MB zipped) containing 66k records/rows and 295 features/columns. The target variable is the last column (the 296th) with values/classes A,B,C,D,E. Obviously we challenge you to analyze the data and to build some initial ML model which predicts the classes.
Visualize the main characteristics of the datset and try to highlight potential helpful structures in the data.
Train and evaluate different models and please explain briefly your choices for the models and their pros & cons.
We are of course interested in the overall performance, but much more in the performance per class and especially in the under represented ones.
If possible add also some critical thinking and next possible steps. But mainly explain why your results are good and what insights we can obtain from it.
- well commented and easy to follow code
- send us straight .py python files (no ipython notebooks)
- work with classes and functions, show a bit of your programming skills ;-)
- PDF (max 3-4 pages) with brief steps taken, some plots and results
Our goal here is:
- see how you approach such a problem
- get an idea of your programming skills and ML knowledge
- see how you can summarize and present results