cap5768-introduction-to-data-science

CAP5768 Introduction to Data Science Fall 2019

Credits:

Professor Dr. Oge Marques
Our textbooks: Python Data Science Handbook, and Think Stats 2
DataCamp courses: Statistical Thinking in Python part 1 and part 2

Quick "get started" guide:

Clone this repository
cd to the repository's directory
Optional: create a Python virtual environment
1. python3 -m venv env
2. source env/bin/activate (Windows: env\Scripts\activate.bat)
3. python -m pip install --upgrade pip
pip install -r requirements.txt
jupyter lab
Navigate to the assignments and open the notebook

If you found this repository useful, you may also want to check out these repositories:

IEEE ICMLA 2019 data science tutorial: a structured introduction to data science, with smaller notebooks, each for a specific topic.
How to write well-structured, understandable, reliable, flexible Jupyter notebooks.

Assignment 1

Assignment 1 is a warm-up exercise to get used to NumPy and Pandas in a Jupyter environment.

Covered in this assignment:

Get a Jupyter Lab environment up and running (see this wiki page)
Learn and apply NumPy and Pandas (see notes on this wiki page)
Read data from .csv files
Uses Pandas DataFrame and Series filter and aggregation functions
Find correlations analytically with Pearson correlation coefficient and aggregated statistics
Find correlations with graphs (Seaborn regplot)
Formulate and verify hypotheses with analytical (e.g. aggregation by type) and graphical support (e.g. box plots)

Assignment 2

Assignment 2 is about exploring datasets (manipulate, summarize, visualize).

Covered in this assignment:

Combine multiple datasets into one, using common fields across the datasets
Summarize, filter and sort data with pivot_table, groupby, query, eval and other functions
Visualize datasets with Matplotlib and DataFrame.plot
More formulation and verification of hypotheses with analytical (e.g. aggregation by type) and graphical support

Assignment 3

Assignment 3 is when we switched from analytics to statistics, the first chapters of Think Stats 2.

Covered in this assignment:

Selecting the number of bins for histograms ("binning bias")
Beyond histograms: swarm plots and box plots
Cumulative distribution function (CDF)
Correlation with Pearson and Spearman functions

Assignment 4

Assignment 4 explores modeling: how to identify the type of a distribution and its parameters, validate the distribution against a theoretical model, then use the model to answer questions about the distribution. Part 1 of DataCamp's Statistical Thinking in Python was a helpful resource.

Covered in this assignment:

PDF and CDF of exponential distributions
Find the type and parameters of an empirical distribution
Use the parameters to simulate the distribution and answer questions about it
Calculate moments and skewness

Assignment 5

Assignment 5 covers hypotheses testing with simulations and p-value calculation. Cassie Kozyrkov's Statistics for people in a hurry was very helpful in understanding the concepts of what we are attempting here, especially the meaning of the null hypothesis.

Covered in this assignment:

Permutate the empirical data to run experiments (using numpy.random.permutation())
Decide what pieces of the dataset we need to permutate (all of them, only one?)
Calculate the p-value from the experiments
Interpret the p-value

Assignment 6

Assignment 6 covers regression analysis: how to add polynomial features, then perform linear regression on the enhanced dataset, evaluate results with R² score and regularize with Ridge and Lasso.

Covered in this assignment:

Perform linear regression with Numpy polyfit()
Add features to improve fitting with PolynomialFeatures()
Perform Linear regression with scikit-learn LinearRegression()
Perform all steps together with a pipeline
Regularize with Ridge and Lasso to prevent overfitting
Use RidgeCV() and LassoCV() for hyperparameter search
Evaluate the linear regression results with the R² score
Choose an optimal polynomial degree by comparing R² score
To not trust only the summary statistics (Anscombe's quartet)

Final project

In the final project we review concepts we learned early in the course, and use the techniques we learned later.

Covered in the final project:

DataFrame describe(), to view summary statistics at a glance. All the important values are available with one function call.
How much we get out of the box from the ydata-profiling package. It is like a "mini EDA" with one line of code.
The verbose parameter to follow the progress of long-running scikit-learn tasks.
Pay attention to the random_state parameter in the scikit-learn APIs to get consistent results.
How to use seaborn's heatmap for confusion matrix visualization. More specifically, the trick to zero out the diagonal with NumPy fill_diagonal() to make the classification mistakes stand out in the graph.
Use GridSearchCV() to find parameters for a classifier.
The power and simplicity of Naive Bayes classifiers, even for seemly complex tasks such as digit classification. It can be used as a baseline before attempting more complex solutions.
How surprisingly good random forest classifiers perform, achieving 97% in the digit classification without much work. Another case of "try this before pulling your neural network card" case.
The small number of components we need to explain variability (the PCA section).
Finally getting a chance to play with OpenCV and see first-hand how easy and feature-rich it is.

Name		Name	Last commit message	Last commit date
Latest commit History 263 Commits
assignment1		assignment1
assignment2		assignment2
assignment3		assignment3
assignment4		assignment4
assignment5		assignment5
assignment6		assignment6
experiments		experiments
final-project		final-project
images		images
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cap5768-introduction-to-data-science

Assignment 1

Assignment 2

Assignment 3

Assignment 4

Assignment 5

Assignment 6

Final project

About

Releases

Packages

Languages

License

fau-masters-collected-works-cgarbin/cap5768-introduction-to-data-science

Folders and files

Latest commit

History

Repository files navigation

cap5768-introduction-to-data-science

Assignment 1

Assignment 2

Assignment 3

Assignment 4

Assignment 5

Assignment 6

Final project

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages