This repo contains python source and jupyter notebooks for running and testing the songplay analysis data pipeline.
Sparkify analysts want to analyze the data they've been collecting on songs and user activity on their new music streaming app, with a particular focus on understanding what songs users are listening to.
The data pipeline process establishes a database and tables, processes JSON data files, and inserts processed data into those tables.
The pipeline process 2 types of JSON data file
Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID
The log datasets contain activity logs from the music streaming app, and are partitioned by year and month.
The process is typically run in the following order:
- create tables.py - to reset and establish the database
- etl.py - to process and load data
- test.ipynb - for any manual data validation
- contains the SQL queries that are applied in both the create_tables.py file and the etl.py file
- establishes the database connection, drops and recreates the database and the tables within it
- processes and loads json data into the database
- can be used for interactively testing parts of the etl process
- contains sql queries to view data that has been inserted into tables