Skip to content

๐ŸŽ“ Intro to Python, scientific programming, and exploratory data analysis

License

Notifications You must be signed in to change notification settings

iuprohealth/reu-python-eda

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

3 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Exploratory Data Analysis with Python - ProHealth REU Summer 2022

Abstract

You've collected some data about the thing you're interested in. What do you do with it? How can form further research questions? Exploratory data analysis is part a part of a data science or statistics workflow that typically involves summarizing, visualizing, and doing preliminary modeling. Here we'll work our way up to EDA in Python, starting with a quick introduction to the language, then working on some skills for "Scientific Programming."

Introduction

Notebooks are available in this repository:

Notebook Colab Link View on GitHub
Intro Python 00-intro-python.ipynb
Scientific Programming 01-scientific-programming.ipynb
Data Mining 01-scientific-programming.ipynb

We'll work with Python and Jupyter as a literate programming environment.

Instead of thinking about programs as scripts:

class LinearModel:

    def __init__(self):
        self.parameters = None

    def fit(self, X, y):
        X = np.hstack([np.ones(X.shape), X])
        self.parameters = np.linalg.inv(X.T @ X) @ (X.T @ y)

    def predict(self, X):
        X = np.hstack([np.ones(X.shape), X])
        return X @ self.parameters

if __name__ == "__main__":

    lm = LinearModel()
    print(lm.parameters)

We'll build up a program as a series of steps, executed in "cells."

Libraries

Not all of these are covered here, but these are some of the common libraries for exploring data:

  • numpy: "The fundamental package for scientific computing with Python"
  • pandas: "a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language"
  • matplotlib: "a comprehensive library for creating static, animated, and interactive visualizations in Python"
  • seaborn: "A Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics"
  • scikit-learn: "Machine learning in Python. Simple and efficient tools for predictive data analysis."
  • statsmodels: "Classes and functions for fitting statistical models, running tests, and exploration"

Some "Gotcha's"

  • In Jupyter, programs are broken into "cells." Cells may be executed in any order, and the order of execution can change the result. As a best practice: if you work in a notebook, try to keep the cells in a logical order.

Data

I pulled a copy of the Titanic dataset (data/titanic.csv) from Chris Piech's CS109 course.

About

๐ŸŽ“ Intro to Python, scientific programming, and exploratory data analysis

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published