Project for UCSD DSE 230 - Scalable Data Analysis
Data description: smartphone and smartwatch sensor data for common household tasks. Can be used for classification of physical activity.
All files can be found on Github in the public repository
To setup app local enviroment install docker
Run the notebook files in the following order:
In order to run the project there is a specific order that is required for everything to work locally.
The first notebook to run is dse230_01_parquet_generation.ipynb
. It must be run in within docker container.
Given the size of the dataset (16M+ records), explicitly managing memory is necessary to run the workflow. This notebook will convert all the raw data files into lighter parquet files with the smallest data types possible without losing precision of the data. This may take a few minutes. Do not run the subsequent files until this is complete.
The second notebook to run is dse230_02_data_merging.ipynb
. It must be run within docker container.
This notebook will create two csv
files of the prepared data that will be used by the next two notebooks. The raw data is changed from one record every millisecond to three second aggregations.
The third notebook to run is dse230_03_eda.ipynb
. This file does not use dask
therefore docker
is not strictly necessary but still recommended.
It includes visualizations of the aggregated data.
The fourth notebook to run is dse230_04_decision_tree_classification.ipynb
. This file does not use dask
therefore docker
is not strictly necessary but still recommended.
During our exploration, we ran multiple types of classification models and selected the model that returned the best performance (Decision Tree Classifier
). For the sake of brevity, only the best model is retained. However, if you wish to check performance of alternate models, uncomment the import statements and switch out the model.