CAP5768 Introduction to Data Science Fall 2019
Credits:
- Professor Dr. Oge Marques
- Our textbooks: Python Data Science Handbook, and Think Stats 2
- DataCamp courses: Statistical Thinking in Python part 1 and part 2
Quick "get started" guide:
- Clone this repository
cd
to the repository's directory- Optional: create a Python virtual environment
python3 -m venv env
source env/bin/activate
(Windows:env\Scripts\activate.bat
)python -m pip install --upgrade pip
pip install -r requirements.txt
jupyter lab
- Navigate to the assignments and open the notebook
If you found this repository useful, you may also want to check out these repositories:
- IEEE ICMLA 2019 data science tutorial: a structured introduction to data science, with smaller notebooks, each for a specific topic.
- How to write well-structured, understandable, reliable, flexible Jupyter notebooks.
Assignment 1 is a warm-up exercise to get used to NumPy and Pandas in a Jupyter environment.
Covered in this assignment:
- Get a Jupyter Lab environment up and running (see this wiki page)
- Learn and apply NumPy and Pandas (see notes on this wiki page)
- Read data from .csv files
- Uses Pandas
DataFrame
andSeries
filter and aggregation functions - Find correlations analytically with Pearson correlation coefficient and aggregated statistics
- Find correlations with graphs (Seaborn
regplot
) - Formulate and verify hypotheses with analytical (e.g. aggregation by type) and graphical support (e.g. box plots)
Assignment 2 is about exploring datasets (manipulate, summarize, visualize).
Covered in this assignment:
- Combine multiple datasets into one, using common fields across the datasets
- Summarize, filter and sort data with
pivot_table
,groupby
,query
,eval
and other functions - Visualize datasets with Matplotlib and
DataFrame.plot
- More formulation and verification of hypotheses with analytical (e.g. aggregation by type) and graphical support
Assignment 3 is when we switched from analytics to statistics, the first chapters of Think Stats 2.
Covered in this assignment:
- Selecting the number of bins for histograms ("binning bias")
- Beyond histograms: swarm plots and box plots
- Cumulative distribution function (CDF)
- Correlation with Pearson and Spearman functions
Assignment 4 explores modeling: how to identify the type of a distribution and its parameters, validate the distribution against a theoretical model, then use the model to answer questions about the distribution. Part 1 of DataCamp's Statistical Thinking in Python was a helpful resource.
Covered in this assignment:
- PDF and CDF of exponential distributions
- Find the type and parameters of an empirical distribution
- Use the parameters to simulate the distribution and answer questions about it
- Calculate moments and skewness
Assignment 5 covers hypotheses testing with simulations and p-value calculation. Cassie Kozyrkov's Statistics for people in a hurry was very helpful in understanding the concepts of what we are attempting here, especially the meaning of the null hypothesis.
Covered in this assignment:
- Permutate the empirical data to run experiments (using
numpy.random.permutation()
) - Decide what pieces of the dataset we need to permutate (all of them, only one?)
- Calculate the p-value from the experiments
- Interpret the p-value
Assignment 6 covers regression analysis: how to add polynomial features, then perform linear regression on the enhanced dataset, evaluate results with R2 score and regularize with Ridge and Lasso.
Covered in this assignment:
- Perform linear regression with Numpy
polyfit()
- Add features to improve fitting with
PolynomialFeatures()
- Perform Linear regression with scikit-learn
LinearRegression()
- Perform all steps together with a pipeline
- Regularize with Ridge and Lasso to prevent overfitting
- Use
RidgeCV()
andLassoCV()
for hyperparameter search - Evaluate the linear regression results with the R2 score
- Choose an optimal polynomial degree by comparing R2 score
- To not trust only the summary statistics (Anscombe's quartet)
In the final project we review concepts we learned early in the course, and use the techniques we learned later.
Covered in the final project:
- DataFrame
describe()
, to view summary statistics at a glance. All the important values are available with one function call. - How much we get out of the box from the
ydata-profiling
package. It is like a "mini EDA" with one line of code. - The
verbose
parameter to follow the progress of long-running scikit-learn tasks. - Pay attention to the
random_state
parameter in the scikit-learn APIs to get consistent results. - How to use seaborn's
heatmap
for confusion matrix visualization. More specifically, the trick to zero out the diagonal with NumPyfill_diagonal()
to make the classification mistakes stand out in the graph. - Use
GridSearchCV()
to find parameters for a classifier. - The power and simplicity of Naive Bayes classifiers, even for seemly complex tasks such as digit classification. It can be used as a baseline before attempting more complex solutions.
- How surprisingly good random forest classifiers perform, achieving 97% in the digit classification without much work. Another case of "try this before pulling your neural network card" case.
- The small number of components we need to explain variability (the PCA section).
- Finally getting a chance to play with OpenCV and see first-hand how easy and feature-rich it is.