Course designer and main instructor: Szilárd Pafka
Tools for data acquisition, transformation and analysis, data visualization, machine learning and tools for reproducible data analysis, collaboration and model deployment used by data scientists in practice. Advanced R packages, analytical databases, high-performance machine learning libraries, big data tools.
404 – Statistical Computing and Programming
405 – Data Management
Despite the new name and the recent hype, Data Science is hardly new, it has in fact solid foundations in statistics and computing technology that go back several decades. The 405 – Data Management course lies down the foundations necessary for most data science tasks (data acquisition, exploratory data analysis, data cleaning and transformation, databases, data visualization and data mining) while the 404 – Statistical Computing and Programming course prepares students to use R, the most widely used tool for data science. This proposed course will build on all this knowledge and it will discuss more advanced topics and present more tools (R packages, scalable machine learning libraries, high-performance analytical databases, big data technologies) that are used by data scientists in daily practice. The course will also discuss software engineering tools/techniques that are important when working on data science projects or when the results of the projects (models, data visualization dashboards etc.) are deployed in production. While this instructor is a great proponent of a good balance of theory and practice, and furthermore in the context of data science advocates a good balance of statistics and computing technology, this course will try to complement the other existing courses in the MAS program, and therefore it might appear as overly tilted towards software systems. Therefore, it is important to restate here that good statistical and theoretical foundations (e.g. cognitive science for data visualization or a good understanding of machine learning algorithms etc.) are also crucial when conducting data science in practice.
Week 1 [4/5]: Overview of data science. The elements of a data science project. Overview of tools (R/Python, databases, machine learning libraries, big data tools, workflow/reproducibility etc)
Week 2 [4/12]: The Unix toolbox for manipulating files/text and automating tasks. Cloud computing for scaling up data science.
Week 3 [4/19]: Tools for reproducible research/productive data analysis and collaboration (Rmarkdown, Jupyter notebooks, git/Github).
Week 4 [4/26]: Tools for data visualization: ggplot2, shiny (interactive web applications with R) / shiny dashboards
Week 5 [5/3]: Foundations for supervised learning (classification/regression): basic algorithms, overfitting, train and test sets, cross-validation, bias-variance tradeoff, regularization, ROC curve for binary classification (various R packages)
Week 6 [5/10]: Tools for supervised learning 1 (GLM, Lasso, random forest, gradient boosted machines) (R packages, Vowpal Wabbit, xgboost, H2O)
Week 7 [5/17]: Analytical databases (columnar/MPP relational databases), SQL. NoSQL databases (key-value stores, document databases). “Big data” technologies (Hadoop, HDFS, Map-reduce, Hive, Impala, Spark, EMR etc.)
Week 8 [5/24]: Tools for supervised learning 2 (support vector machines, neural networks, deep learning, ensembles) (R packages, H2O, libraries for deep learning on GPUs)
Week 9 [5/31]: Tools for unsupervised learning (K-means clustering, hierarchical clustering) (R packages)
Week 10 [6/7]: Course discussions/Q&A session. Summary of the course, conclusions, final thoughts etc.
Szilárd Pafka
Eduardo Ariño de la Rubia
Yasmin Lucero
TA:
Medha Uppala
Class announcements and student Q&A will be done via github issues.
Class Participation 10%
Homework (4 assignments) 60%
Final Exam 30%
Sample exam questions here
(Click on link to open/download paper/free book PDF)
David Donoho: 50 years of Data Science
Sean Kandel, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer: Enterprise Data Analysis and Visualization: An Interview Study
Rexer Analytics: 2015 Data Science Survey (Summary Report)
Leo Breiman: Statistical Modeling: The Two Cultures
Rich Caruana, Alexandru Niculescu-Mizil: An Empirical Comparison of Supervised Learning Algorithms
John M. Chambers: Software for Data Analysis: Programming with R, Springer, 2008
Hadley Wickham: Advanced R, Chapman & Hall/CRC, 2015
W.N. Venables, B.D. Ripley, Modern Applied Statistics with S, Springer, 4th ed., 2003
William S. Cleveland, The Elements of Graphing Data, Hobart Press, 1994
William S. Cleveland, Visualizing Data, Hobart Press, 1993
Edward R. Tufte, The Visual Display of Quantitative Information, Graphics Press, 2nd ed., 2001
Stephen Few, Show Me the Numbers: Designing Tables and Graphs to Enlighten, 2nd ed., Analytics Press, 2012
Hadley Wickham, ggplot2: Elegant Graphics for Data Analysis, Springer, 2009
Dorian Pyle: Data Preparation for Data Mining, Morgan Kaufmann, 1999
Micheline Kamber, Jiawei Han, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2nd ed., 2005
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani: An Introduction to Statistical Learning with Applications in R, Springer, 2013
Trevor Hastie, Robert Tibshirani, Jerome Friedman: The Elements of Statistical Learning, 2nd. ed., Springer, 2009
Eric Redmond and Jim R. Wilson: Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement, The Pragmatic Bookshelf, 2012