- Practice writing code for analysis that is reproducible
Use of git, Jupyter, testing and role of automation
- Achieve fluency in Python
Write idiomatic Python 3 code. There will be ample programming exercises for you to develop this skill.
- Become familiar with the most useful Python packages for solving analysis problems
Such packages include numpy, scipy, matplotlib, pandas, scikit-learn, statsmodels, pymc3, pystan, pyspark and others
- Develop intuition for concepts (e.g. geometry) underlying statistical algorithms
How do optimization routines, EM and MCMC actually do their magic?
- Learn how to develop statistical algorithms in Python
Final class project requires you to develop, optimize, test and apply a statistical algorithm from the research literature
- Profile and optimize code to make good use of available resources (e.g parallel environments)
From interpreted to compiled code (Python to C), multi-core parallelism, GPU programming and distributed computing for big data
- Cliburn Chan cliburn.chan@duke.edu
- Janice McCarthy janice.mccarthy@duke.edu
- Christine Chai (TA) christine.chai@duke.edu
- Yuhao Liang (TA) yuhao.liang@stat.duke.edu
- Cliburn Chan: Thursday 6pm to 7pm Old Chem 116
- Christine Chai (TA): Monday 1pm to 3 pm Old Chem 211A
- Yuhao Liang (TA): Tuesday 7pm to 9pm Old Chem 211A
Please bring a laptop for each class as you will often be expected to type along. You are expected to provide your own laptop.
- Homework assignments 50%
There will be up to 10 homework assignments, each of equal weight. These will typically require significant programming effort.
- Mid-term exams 25%
This will be an in-class series of programming challenges. If you have been working hard on homework assignments these challenges should be well within your ability.
- Final project 25%
For the final project, you will implement, test, optimize and apply a statistical algorithm from a research paper. You will submit a Github repository containing all documentation and code created. There will also be a class presentation of the project.
- A = 90 - 100
- B = 70 - 89
- C = 50 - 69
- D = Below 50
Fractional final scores will be rounded UP.
Homework will be assigned on Thursdays and due the following Wednesday before 4 PM. Late homework will not be accepted (automatic 0) as solutions will generally be discussed during the Wednesday lab session.
Instructions for how to submit homework will be provided shortly.
All notebooks, data sets and homework assignment will be posted to the GitHub repository at https://github.com/cliburn/sta-663-2016
A searchable web-accessible version of the notebooks is at http://people.duke.edu/~ccc14/sta-663-2016/
Please follow the Duke honor code. All work submitted should be from your individual effort unless given explicit instructions otherwise.
- Using Jupyter in Docker
- Setting up account on AWS (Amazon Web Services) cloud computing platform
- Using
git
: Cloning, pulling, pushing - Jupyter: Literate and multi-language programming
- Python: Data structures and control flow
- Log into Duke VM
- Create AWS account and VM
- Install packages with
conda
- Set up virtual environment with
conda
- Learn to use Jupyter notebook for markdown and
- Using
R
in Jupyter - Use Python as an interactive calculator
- Python: Functions
- Python: Text
- Python: I/O
string
,re
,itertools
,functools
,requests
- Write custom functions in Python
- Functional programming building blocks - map, filter, reduce
- Load and save data from files and URLs
- Basic string data munging
- Python: Numbers
numpy
,scipy
- Create and manipulate vectors and matrices in Python
- Python: Data
- Python: Databases
- Python: Graphics
- SQL,
sqlite3
,matpltolib
,seaborn
- Getting data into DataFrames
- Selection of data in DataFrames
- Data summaries and cleaning
- Split-apply-combine
- Basic use of SQL to query relational databases
- Creating and customizing plots in Python
- Preprocessing
- Dimension reduction
- Clustering
- Supervised learning
- Probabilistic (generative) methods
- Analysis pipelines
- Validation
- Exposure to machine learning examples before we dive into the underlying algorithmic ideas
- Ability to run standard analysis on small to moderate data sets
- Computer arithmetic
- Linear algebra 1
numpy.linalg
andscipy.linalg
- Appreciate what floating point numbers are
- See catastrophic cancellation
- Understand basic concepts of linear algebra
- Use
linalg
library to do do linear algebra routines
- Linear algebra 2
scipy.blas
andscipy.lapack
- Understand matrix decomposition algorithms
- Use
linalg
to solve linear algebra problems
- Theory: PCA, SVD and LSA
scikit-learn
andgensim
- Understand PCA and related algorithms
- Apply PCA for dimension reduction in topic modeling
- Theory: Root finding and optimization
numpy
andscipy.optimize
- Statistical problems as maximizing log likelihood
- Understand Newton method in 1D
- Understand relationship between root-finidng and optimization
- Theory: Multivariate optimization 1
scipy.optimize
andscikit-learn
- Intuition for the Jacobian
- Understand gradient descent and stochastic gradient descent
- Apply gradient descent to solve a regression problem
- Theory: Multivariate optimization 2
statsmodels
- Intuition for the Hessian, Fisher and observed information
- Understand Newton and conjugate gradient methods
- Understand how Newton method is used in IRLS
- Theory: Expectation-Maximization 1
numpy
- Understand EM and data augmentations
- Apply EM to simple Bernoulli/Binomal models
- Theory: Expectation-Maximization 2
pymix
- Deeper understanding of EM
- Applying to mixture of Gaussians
- EM for the MAP
- Probability and random number generation
numpy.random
andscipy.stats
- Intuition for how random number generators work
- The CDF and inverse transform method for generating random numbers
- Using discrete and continuous distributions
- Simulation and resampling
- Monte Carlo methods
numpy.random
andscipy.stats
- Use of simulations for point and interval estimates
- Code simple machine learning and cross-validation example
- Using Monte Carlo methods for estimating integrals
- MCMC 1: Gibbs and Metropolis
numpy.random
andscipy.stats
- Hand-coding of Metropolis sampler
- Hadn-coding of Gibbs sampler
- Inference and posterior predictive checks
- MCMC 2: Slice and HMC
pymc3
andpystan
- Intuition for slice and Hamiltonian samplers
- Use of MCMC packages to fit hierarchical models
- Algorithmic complexity
- Benchmarking and profiling
- Code optimization
- Using appropriate data structures
- Classic algorithmic approaches to speed up code
- Intuition for performance of algorithms and data structures
- Use of Big O notation
- Benchmarking and profiling
- Working with
cython
andnumba
- How to go from interpreted to compiled code using
cython
- Using
numba
to compile code -jit
,njit
,vectorize
- Using
numba
on the GPU
- Get speed-ups with compiled code
- Introduction to parallel programming
- Benefits of functional approach
- Synchronous and asynchronous programs
- Embarrassingly parallel programs
- Master-worker paradigm
- IPython. Parallel,
dask
, andmultiprocessing
- Understand common parallel idioms
- Run embarrassingly parallel programs on multiple cores
- Run shared memory programs on multiple cores
- Working with massive data sets
- Setting up Spark locally
- Setting Spark up on a cluster
- Introduction to Spark
- Working with RDDs (transforms and actions)
- Using
pyspark
- Intuition for how distributed computing works
- Working with Spark Resilient Distributed Datasets (RDD)
- Minimizing data shuffles: Accumulators, broadcast values and partitioning
- Efficient storage for numeric data (dense and sparse arrays)
- Efficient storage of stings (tries)
- Distinct value sketches and probabilistic data structures
- Yes/No oracle - the Bloom filter
- Understand hardware latencies and why modern CPUs are starved
- Working with massive data sets without running out of memory
- Spark Streaming
- Spark SQL, DataFrames and DataSets
- NoSQL and big data formats and databases
- Stateless and stateful processing of streaming data
- Using Spark SQL and its efficient data structures
- Using Spark MLLib with examples
- Using Spark GraphX with examples
- Machine learning with
MLLib
- Graph algorithms with
GraphX
- Reproducible analysis with
- version control (
git
) - virtual environments (
conda
) - literate programming (
Jupyter
) - testing (
doctesst
,unittest
,hypothesis
)
- version control (
- Review of homework 1
- Data science with Python
- Review of homework 2
- Numerical recipes with Python
- Review of homework 3
- Topic models
- Review of homework 4
- Symbolic algebra with
sympy
andtheano
- Review of homework 5
- Coding challenges
- One arm bandits
- Review of homework 6
- Optimizing distance matrix calculations
- Review of homework 7
- Dynamic programming
- Review of homework 8
- Introduction to C Part 1
- Review of homework 9
- Introduction to C Part 2
- Review of homework 10
- Final project presentations
- Gaussian processes
- Latent Dirichlet Allocation
- Hierarchical Dirichlet process
- Computer vision with
cv2
- Machine learning with
scikit-learn
- Packaging and distributing Python applications
- CUDA 2
numba
- GPU hardware concepts
- Understand memory hierarchy
- Grids, blocks and threads
- CUDA kernels
- Using
numba
for easy CUDA - First CUDA program
- CUDA 2
numba
- Code matrix multiplication routines without shared memory
- Code matrix multiplication routines with shared memory
- CUDA 3
numba
- Coding a Gaussian mixture model with CUDA kernels