Exploratory Data Analysis with Python - ProHealth REU Summer 2022

Abstract

You've collected some data about the thing you're interested in. What do you do with it? How can form further research questions? Exploratory data analysis is part a part of a data science or statistics workflow that typically involves summarizing, visualizing, and doing preliminary modeling. Here we'll work our way up to EDA in Python, starting with a quick introduction to the language, then working on some skills for "Scientific Programming."

Introduction

Notebooks are available in this repository:

Notebook	Colab Link	View on GitHub
Intro Python		`00-intro-python.ipynb`
Scientific Programming		`01-scientific-programming.ipynb`
Data Mining		`01-scientific-programming.ipynb`

We'll work with Python and Jupyter as a literate programming environment.

Instead of thinking about programs as scripts:

class LinearModel:

    def __init__(self):
        self.parameters = None

    def fit(self, X, y):
        X = np.hstack([np.ones(X.shape), X])
        self.parameters = np.linalg.inv(X.T @ X) @ (X.T @ y)

    def predict(self, X):
        X = np.hstack([np.ones(X.shape), X])
        return X @ self.parameters

if __name__ == "__main__":

    lm = LinearModel()
    print(lm.parameters)

We'll build up a program as a series of steps, executed in "cells."

Libraries

Not all of these are covered here, but these are some of the common libraries for exploring data:

numpy: "The fundamental package for scientific computing with Python"
pandas: "a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language"
matplotlib: "a comprehensive library for creating static, animated, and interactive visualizations in Python"
seaborn: "A Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics"
scikit-learn: "Machine learning in Python. Simple and efficient tools for predictive data analysis."
statsmodels: "Classes and functions for fitting statistical models, running tests, and exploration"

Some "Gotcha's"

In Jupyter, programs are broken into "cells." Cells may be executed in any order, and the order of execution can change the result. As a best practice: if you work in a notebook, try to keep the cells in a logical order.

Data

I pulled a copy of the Titanic dataset (data/titanic.csv) from Chris Piech's CS109 course.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploratory Data Analysis with Python - ProHealth REU Summer 2022

Abstract

Introduction

Libraries

Some "Gotcha's"

Data

About

Releases

Packages

Languages

License

iuprohealth/reu-python-eda

Folders and files

Latest commit

History

Repository files navigation

Exploratory Data Analysis with Python - ProHealth REU Summer 2022

Abstract

Introduction

Libraries

Some "Gotcha's"

Data

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages