DSE-230

Introduction

Project for UCSD DSE 230 - Scalable Data Analysis

Data description: smartphone and smartwatch sensor data for common household tasks. Can be used for classification of physical activity.

All files can be found on Github in the public repository

Contributors:

Local Enviroment Setup with Docker

To setup app local enviroment install docker

MAC

sh launch.sh

Windows

.\launch.ps1

Running the Project

Run the notebook files in the following order:

dse230_01_parquet_generation.ipynb
dse230_02_data_merging.ipynb
dse230_03_eda.ipynb
dse230_04_decision_tree_classification.ipynb

In order to run the project there is a specific order that is required for everything to work locally.

Data Generation

The first notebook to run is dse230_01_parquet_generation.ipynb. It must be run in within docker container.

Given the size of the dataset (16M+ records), explicitly managing memory is necessary to run the workflow. This notebook will convert all the raw data files into lighter parquet files with the smallest data types possible without losing precision of the data. This may take a few minutes. Do not run the subsequent files until this is complete.

Grouping and Merging Data

The second notebook to run is dse230_02_data_merging.ipynb. It must be run within docker container.

This notebook will create two csv files of the prepared data that will be used by the next two notebooks. The raw data is changed from one record every millisecond to three second aggregations.

Exploratory Data Analysis

The third notebook to run is dse230_03_eda.ipynb. This file does not use dask therefore docker is not strictly necessary but still recommended.

It includes visualizations of the aggregated data.

Decision trees simple modeling results

The fourth notebook to run is dse230_04_decision_tree_classification.ipynb. This file does not use dask therefore docker is not strictly necessary but still recommended.

During our exploration, we ran multiple types of classification models and selected the model that returned the best performance (Decision Tree Classifier). For the sake of brevity, only the best model is retained. However, if you wish to check performance of alternate models, uncomment the import statements and switch out the model.

Name		Name	Last commit message	Last commit date
Latest commit History 154 Commits
data		data
iframe_figures		iframe_figures
images		images
.DS_Store		.DS_Store
.gitignore		.gitignore
DSE230 - Final Project Slides - Team 001.pdf		DSE230 - Final Project Slides - Team 001.pdf
DSE230 - Project Proposal - Team 001 - 2022-04-29.pdf		DSE230 - Project Proposal - Team 001 - 2022-04-29.pdf
README.md		README.md
WISDM-dataset-description.pdf		WISDM-dataset-description.pdf
dse230_01_parquet_generation.ipynb		dse230_01_parquet_generation.ipynb
dse230_02_data_merging.ipynb		dse230_02_data_merging.ipynb
dse230_03_eda.ipynb		dse230_03_eda.ipynb
dse230_04_decision_tree_classification.ipynb		dse230_04_decision_tree_classification.ipynb
launch.ps1		launch.ps1
launch.sh		launch.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DSE-230

Introduction

Local Enviroment Setup with Docker

Running the Project

Data Generation

Grouping and Merging Data

Exploratory Data Analysis

Decision trees simple modeling results

About

Releases

Packages

Contributors 3

Languages

gojandrooo/DSE-230

Folders and files

Latest commit

History

Repository files navigation

DSE-230

Introduction

Local Enviroment Setup with Docker

Running the Project

Data Generation

Grouping and Merging Data

Exploratory Data Analysis

Decision trees simple modeling results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages