You've collected some data about the thing you're interested in. What do you do with it? How can form further research questions? Exploratory data analysis is part a part of a data science or statistics workflow that typically involves summarizing, visualizing, and doing preliminary modeling. Here we'll work our way up to EDA in Python, starting with a quick introduction to the language, then working on some skills for "Scientific Programming."
Notebooks are available in this repository:
Notebook | Colab Link | View on GitHub |
---|---|---|
Intro Python | 00-intro-python.ipynb |
|
Scientific Programming | 01-scientific-programming.ipynb |
|
Data Mining | 01-scientific-programming.ipynb |
We'll work with Python and Jupyter as a literate programming environment.
Instead of thinking about programs as scripts:
class LinearModel:
def __init__(self):
self.parameters = None
def fit(self, X, y):
X = np.hstack([np.ones(X.shape), X])
self.parameters = np.linalg.inv(X.T @ X) @ (X.T @ y)
def predict(self, X):
X = np.hstack([np.ones(X.shape), X])
return X @ self.parameters
if __name__ == "__main__":
lm = LinearModel()
print(lm.parameters)
We'll build up a program as a series of steps, executed in "cells."
Not all of these are covered here, but these are some of the common libraries for exploring data:
numpy
: "The fundamental package for scientific computing with Python"pandas
: "a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language"matplotlib
: "a comprehensive library for creating static, animated, and interactive visualizations in Python"seaborn
: "A Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics"scikit-learn
: "Machine learning in Python. Simple and efficient tools for predictive data analysis."statsmodels
: "Classes and functions for fitting statistical models, running tests, and exploration"
- In Jupyter, programs are broken into "cells." Cells may be executed in any order, and the order of execution can change the result. As a best practice: if you work in a notebook, try to keep the cells in a logical order.
I pulled a copy of the Titanic dataset (data/titanic.csv
) from
Chris Piech's CS109 course.