Data Science Tools 1

Course: Data Science Project 1 COMP-4447-1
Class time: M, Wed 07:00 PM - 08:50 PM |Engineering & Computer Science | Room 410
Instructor: Pooran Singh Negi, pooran.negi@du.edu webpage
Office: 470
Office Hours: Tue, Thu, 3.30 p.m. - 5.30 p.m. Email for 1-on-1 help.
GTA: Mitchell Wright, GTA office hours ECS 126, Mon 4-6 p.m, Fri 3-5 p.m

Books

Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython 2nd Edition by Wes McKinney. It is available online from library

Other books

Mastering Python Regular Expressions by Félix López (Author), Víctor Romero
Think Stats: Exploratory Data Analysis in Python

- Think Bayes

Probabilistic Programming & Bayesian Methods for Hackers

Optional material online

Course Description

It is recommended that you consult this github page often for material related to this course. You should check your e-mail periodically for messages. Assignments will be upload here and in the canvas.

The main objective of data science tools 1 is to learn various tools to perform data analysis. The focus in tools 1 is data cleanup, summarization, and visualization. It is more like a hacking skill set but our primary focus will be on the scientific python and Linux ecosystem. We’ll use jupyter notebook/lab for in the class and homeworks. This should make our learning interactive.

For the final project, students will work through individual or team projects applying course-work to the data lifecycle within a particular domain. The focus will also be on best data science/software engineering practices and reproducible work.

Please select a project by January 20th as per your preference. You are allowed to have a group of 2 to 3 students but project work must justify team count. There will be a homework asking about the detail of your final project. We’ll provide feedback about feasibility of the final project. Final project, can be based on initial capstone work?. Please let us know if this is the case. We need to go over details.

Syllabus

This syllabus is subject to change at the discretion of the instructor.

Jupyter Notebook for reproducible workflow.
Data science and EDA.
Git tools work flow.
Data science at command prompt. Linux command line, bash, basic awk and sed.
Data collection and ingestion(web scraping and reading datasets + pandas).
Data cleanup and imputation + Pandas.
Data summarization and visualization+ panda(groupby, apply, aggregate etc).
Go over some some topics as per students demands.
more to come
Linux command line and scientific python ( primarily numpy, matplotlib, request, seaborn, basic pandas) will be used throughout the course.

Grading

There will be coding/analysis homework assignments, midterm and a final project. We’ll drop one of your worst assignment grade.

There will be a final presentation of the final project. You will be required to submit a final project report in the jupyter notebook format.

project project presentation grading rubric

final project report grading rubric

Dates

coding Homework	50%
midterm, 13 Feb in class	15%
Comprehensive final 13 March. We’ll use best of your midterm or final marks
final project presentation, 10 minutes, 18 March in class	15%
final project report, due 18 March, please refer to above final report format for submission guideline	20%

Final course grading rubric

grade range [(‘A’, >=93), (‘A_minus’, >=89), (‘B_plus’, >=85), (‘B’, >=81), (‘B_minus’, >=77), (‘C_plus’, >=73), (‘C’, >=69), (‘C_minus’, >=65), (‘D_plus’, >61), (‘D’, >=57), (‘D_minus’, >=53), (‘F’, < 53)])

Honor code

All members of the University of Denver community are expected to uphold the values of Integrity, Respect, and Responsibility. These values embody the standards of conduct for students, faculty, staff, and administrators as members of the University community. Our institutional values are defined as:

Integrity: acting in an honest and ethical manner;

Respect: honoring differences in people, ideas, experiences, and opinions;

Responsibility: accepting ownership for one’s own behavior and conduct.

Please respect DU Honor Yourself, Honor the Code

Students with Disabilities

Students with recognized disabilities will be provided reasonable accommodations, appropriate to the course, upon documentation of the disability with a Student Accommodation Form from the Disability Services Program. To receive these accommodations, you must request the specific accommodations, by submitting them to the instructor in writing, by the end of first week of classes. Visit CAMPUS LIFE & INCLUSIVE EXCELLENCE webpage for details.

Withdrawal Policy

Please see registrar calender for Academic deadlines. We’ll strictly follow the deadlines.

Data set for Projects

You can collect the dataset for your project.
- Web scraping, web API (for natural language processing one can use the New York Times, twitter etc.)
I am looking around to find noisy dataset for practice.
- See Datasets for data cleaning practice by Rachael Tatman
Datasets for Data Mining and Data Science
The EU Open Data Portal
World Bank Open Data
The home of the U.S. Government’s open data

We need to know your project/dataset, before we approve it for final project.

More to come.

Software Installation

Python

We want everybody to have same experience using computational tools in data science tools 1. Please follow steps as per your operating system.

Window based installation

Please install Windows Subsystem for Linux (WSL) on window 10. Follow the instruction in this post Using Windows Subsystem for Linux for Data Science by Hugo Ferreira for installing Linux. **ignore install Anaconda part.**

You can also watch this video to see installation of Windows 10 Bash & Linux Subsystem Setup.

Linux /Mac users should already have bash command prompt

You can run echo $0 to check current shell. Change to bash shell using chsh -s /bin/bash

One you are in Linux/Mac bash command prompt, Please follow following instructions

Python3 installation

Please follow instructions here to install python3 if it is not installed in your system. This link also lists Windows Subsystem for Linux (WSL) for window 10(Windows 10 Creators or Anniversary Update). I am using python 3.5.2. Hopefully any version of python 3 should work.

creating virtual environment and installing packages for data science tools 1

Run following commands from command prompt.

apt-get install python3-venv
Using command line(cd command), go to the folder where you want to keep python file, notebooks related to this course.
run **python3 -m venv /path/to/new/virtual/environment**
- e.g. I ran python3 -m venv dst1_env
To activate your environment run source /path/to/new/virtual/environment/bin/activate
- e.g From this course directory I run, source dst1_env/bin/activate
run python3 -m pip install – upgrade pip. Note that there are 2 dashes in upgrade option.
run wget https://raw.githubusercontent.com/psnegi/data_science_tools1/master/requirements.txt
run pip install -r requirements.txt
run jupyter notebook or jupyter lab.
In the browser you should see your current files.
Click on the notebook you want to run.
click on RISE slideshow extension in notebook, if you want to see notebook as slideshow.

To deactivate python virtual environment, run deactivate

Python learning resources

You can also go to my python for reproducible research github repository and start by running pythonBasic.ipynb notebook. I will go over basic of python and jupyter notebook.

data analysis tools in python

more to come

Notebooks

Jan 7

Jupyter introduction

Jan 9

data science introduction

Jan 14

git tool introduction

Jan 16

git tool part 2

Jan 23

Data science at command prompt

Jan 28

Jan 30

Feb 4

Feb 6

pandas

Feb 11

data ingestion and cleanup
- ingestion in class

18 feb

20 feb

nlp

25 th Feb

27 th Feb

4 th March

6 March

11 th March

Homeworks

No late hw will be accepted

Due date
	HW no	description and links	solution
Monday 21 th Jan 11.59 p.m	1	Complete questions in this notebooks

Friday 25 th Jan 11.59 p.m	2	Complete questions in this notebook

Thursday 31 Jan 11.59 p.m	3	Complete questions in this notebook

Friday 8 th Feb 11.59 p.m	4	Complete question in this bash file

Friday 15 Feb, 11.59 p.m	5	Complete questions in this notebook
Friday 23 Feb, 11.59 p.m	6	Complete questions in this notebook
Friday 1 st March 11.59 p.m.	7	Complete question in this notebook
Monday 11 th March 11.59 p.m	8	Complete the this hw notebook

Midterm

Course Activity

Date	Reading/Coding Assignments	class activity
7 Jan	Install jupyter environment	Mitchell covered Jupyter introduction notebook
		also helped with installation

	Python Virtual Environments	Covered jupyter introduction and data science notebook.
9 Jan	Resources to learn git	It may not be time consuming to wait for notebook to get started via binder every time.
	We’ll also go over data science	Go to the folder for this course in your computer and run git clone https://github.com/psnegi/data_science_tools1.git.
		Run command ls. You should see data_science_tools1 folder. Activate your virtual environment.
		Navigate to course directory using cd data_science_tools1. change to the notebook directory using command cd notebooks.
		Now run jupyter notebook. You should see all the notebooks in a browser window. Click on the notebook you want to run.

		To run a cell in the notebook press alt+enter or ctr+enter.
		Note that whenever a new content is posted, you must run git pull origin master from data_science_tools1 directory to make sure you have the latest
		content. Don’t worry about above git commands. We’ll start git in next class. Please start with git notebook.
		I don’t like notebooks.- Joel Grus video provide by Laura Atkinson

14 Jan		Covered git for managing local project and git work flow in team.
		If you are using Mac, you may need to install Xcode Command Line Tools or install git.
		If you haven’t setup window subsystem for Linux and want to use git in window see this How to Install GIT client on Windows
		I use emacs but use any editor you like for coding python. ATOM is good choice.
16 Jan	Will work on git tool part 2	Covered work flow in a team, when to push a branch to the remote(you don’t have integration setup, other team members wants to
		look at the feature code for review etc.), merge conflict, tagging. Started with “forget to work on a feature branch”.


23 Jan	Data science at command prompt	Finished how to move changes to feature branch. Not that when cleaning the master branch using soft or mixed reset, the master branch
		will still contain your changes. If you use hard reset changes will be lost in master. HEAD detached will contain the changes if required.
		Finished Linux over view, basic commands, redirection and pipe.

28 Jan	Practice posted notebooks	Finished regular expression. Using basic Linux commands and regular expression (curl, grep, sort, uniq) found top k words in a Gutenberg book.
	See notebooks in notebooks section	Finished basic awk and sed.
30 Jan	See notebooks in notebooks section	Finished positional parameters and command substitution in bash scripting. Note that to use bc command to do floating point arithmetics
		numpy library for scientific computation.
		In the jupyter notebook use ? or ?? to read about a function(like np.array?). Press shit tab to get tool tip for function arguments(like np.ones( and press shift+tab).
		Started with REST API. /Please install chrome/ so that we have same options to click when inspecting https messages.

	See 4 th feb notebooks	Covered REST API. Will cover how to create REST API in tool2 using AWS api gateway and lambda function.
4 Feb	Web Scraping in class version	Finished scraping Fry electronics website for telescopes.
6 Feb	Pandas basic see notebook section
11 Feb	Data ingestion and cleaning	Covered basic data ingestion API and cleanup functionality. see pd.qcut Quantile-based discretization too.
13 Feb		in class midterm
18 Feb		python re library and data wrangling
20 th Feb		Basic on NLP and normalization of text data
25 th Feb		Text clean up, contraction, using wordnet for synonyms, antonyms, hypernyms, hyponyms and edit distance
		There will be a comprehensive final in class exam. We’ll use best your midterm or final marks(15% weight).



27 th Feb		Extracting text and tables from pdf files. Concept of split-apply and combine. Pandas group by.
		If you had issue installing pdf miner in Mac, It can Java related.
		Install JDK using this link https://www.oracle.com/technetwork/java/javase/downloads/jdk11-downloads-5066655.html
		and also: sudo R CMD javareconf otherwise other packages that use java will fail
		(provide by Chris Haddad)

4 th march		Covered matplotlib theory, hierarchical organization(tree structure) of figure components.
		Started seaborn.
6 th March		Seaborn when some variables are categorical, scatter , swarm(concept of hue, jitter). For big data plotting statistical summary
		distplot, jointplot, pairplot boxplot, bar plot(uni/bi variate). Linear relationships using regplot.
		Touched upon geo plot(choropleth map) using folium.
11 th March		Time series, Timestamp and period concepts. Feature engineering(shift, rolling, weighted feature summary) and started time series analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 146 Commits
hws		hws
notebooks		notebooks
README.org		README.org
project_presentation.org		project_presentation.org
project_rubric.org		project_rubric.org
requirements.txt		requirements.txt

psnegi/data_science_tools1

Folders and files

Latest commit

History

Repository files navigation