Code for the Reddit dataisbeautiful DataViz Battle for the Month of August 2018
Analysis and visual exploration of TSA Claims Data.
This repository uses pipenv. If you need to install it you can follow the documentation.
Create a pyhon 3.6 environment and install all the dependencies:
git clone git@github.com:jackdbd/reddit-dataviz-battle-2018-08.git
cd reddit-dataviz-battle-2018-08
pipenv --python python3.6
pipenv install
The entire TSA dataset is spread across multiple Excel files and PDF files. Download all files from here and put them in the data
directory.
The script make_db.py
gathers data from all the files (.xls
, .xlsx
, .pdf
) and creates a SQLite database. You can run it with sane defaults with:
cd src
pipenv run python make_db.py # it takes ~20 minutes
If you want to specify different parameters to read the PDF/Excel files, run:
pipenv run python make_db.py --help
For instance, it might be useful to run the script in debug mode to see what's going on with the PDF files.
This will drop the database and read only 2 pages in each PDF file, skipping all Excel files.
pipenv run python make_db.py -d --no_excel
When your database TSA.db
is ready, you can launch a Jupyter notebook and start exploring the data:
cd notebooks
pipenv run jupyter notebook