Sparkify is a music straming startup that has grown their user base and song database and would want to move their processes and data onto the cloud. Their data consist of user activities on using the music streaming up as well as metadata on the songs in their app. These data are currently stored on Amazon s3 storage bucket on Amazon Web Service.
Sparkify requires an ETL pipeline to be built that extracts their data from s3, stages the data in Redshift, and transforms the data into a set of dimensional tables for their analytics team to continue finding insights in what songs their users are listening to. The database and ETL pipeline would be tested by running queries given by Sparkify analytics team and compare results with their expected results.
In this Project an ETL pipeline would be built for a datawarehouse hosted on Amazon Redshift using Amazon Webservice(AWS) cloud computing platform. The ETL pipleine would load data from s3 to staging tables on Redshift and execute SQL statements that create the analytics tables from these staging tables.
There are two datasets:
-
Song datasets: Each file is in JSON format and contains metadata about a song and the artist of that song.
Location:
s3://udacity-dend/song_data
Sample:
{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}
-
Log datasets: The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app are based on specified configurations.
Location:
s3://udacity-dend/log_data
log data JSON path:s3://udacity-dend/log_json_path.json
Sample:
{"artist":null, "auth":"Logged In", "firstName":"Walter", "gender":"M", "itemInSession":0, "lastName":"Frye", "length":null, "level":"free", "location":"San Francisco-Oakland-Hayward, CA", "method":"GET", "page":"Home", "registration":1540919166796.0, "sessionId":38,"song":null, "status":200, "ts":1541105830796, "userAgent":""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36"", "userId":"39"}
## Database Schema
The database Schema consists of two staging tables to load the two datasets from Amazon s3 bucket into the data warehouse. There is also a star schema model (the analytics database) which consists of one fact table that contains all the measurable and quantitative data about each song. It refrences 4 different dimension tables that describes the attributes related to the fact table.
The database schema as described above has been built this way because the etl pipeline process requires a staging area where data can be cleaned, transformed and merged before inserting into the analytics database where sparkify analytics team can execute their queries.
### Staging Events & Songs tables
The below two datas are the staging tables from which the files are loaded into before data transformation takes place.
```SQL
CREATE TABLE IF NOT EXISTS staging_events
(
artist VARCHAR
,auth VARCHAR
,firstName VARCHAR
,gender VARCHAR
,itemInSession INTEGER
,lastName VARCHAR
,length FLOAT
,level VARCHAR
,location VARCHAR
,method VARCHAR
,page VARCHAR
,registration FLOAT
,sessionId INTEGER
,song VARCHAR
,status INTEGER
,ts TIMESTAMP
,userAgent VARCHAR
,userId INTEGER
);
CREATE TABLE IF NOT EXISTS staging_songs
(
num_songs INTEGER
,artist_id VARCHAR
,artist_latitude FLOAT
,artist_longitude FLOAT
,artist_location VARCHAR
,artist_name VARCHAR
,song_id VARCHAR
,title VARCHAR
,duration FLOAT
,year INTEGER
);
This contains the measurable and quantitative data about each song. Below is the single fact table and its SQL code.
songplays - records in log data associated with song plays.
CREATE TABLE IF NOT EXISTS songplays
(
songplay_id INTEGER IDENTITY(0,1)
,start_time TIMESTAMP NOT NULL
,user_id INTEGER NOT NULL
,level VARCHAR(20)
,song_id VARCHAR(800) NOT NULL
,artist_id VARCHAR(800) NOT NULL
,session_id INTEGER NOT NULL
,location VARCHAR(800)
,user_agent VARCHAR(800)
,PRIMARY KEY(songplay_id)
,FOREIGN KEY(start_time) REFERENCES time(start_time)
,FOREIGN KEY(user_id) REFERENCES users(user_id)
,FOREIGN KEY(song_id) REFERENCES songs(song_id)
,FOREIGN KEY(artist_id) REFERENCES artists(artist_id)
)
DISTKEY(start_time)
SORTKEY(start_time);
These tables references the fact table and contains descriptive attributes related to the fact table. Below are all the 4 tables and their sql codes.
users - users in the app
CREATE TABLE IF NOT EXISTS users
(
user_id INTEGER NOT NULL
,first_name VARCHAR(256) NOT NULL
,last_name VARCHAR(256) NOT NULL
,gender VARCHAR(256)
,level VARCHAR(20)
,PRIMARY KEY(user_id)
)
SORTKEY(user_id);
songs - songs in music database
CREATE TABLE IF NOT EXISTS songs
(
song_id VARCHAR(256)
,title VARCHAR(800) NOT NULL
,artist_id VARCHAR(800) NOT NULL
,year INTEGER
,duration FLOAT NOT NULL
,PRIMARY KEY(song_id)
)
SORTKEY(song_id);
artists - artists in music database
CREATE TABLE IF NOT EXISTS artists
(
artist_id VARCHAR(800) NOT NULL
,name VARCHAR(800) NOT NULL
,location VARCHAR(800)
,lattitude FLOAT
,longitude FLOAT
,PRIMARY KEY(artist_id)
)
SORTKEY(artist_id);
time - timestamps of records in songplays broken down into specific units
CREATE TABLE IF NOT EXISTS time
(
start_time TIMESTAMP NOT NULL
,hour INTEGER
,day INTEGER
,week INTEGER
,month INTEGER
,year INTEGER
,weekday VARCHAR(100)
,PRIMARY KEY(start_time)
)
DISTKEY(start_time)
SORTKEY(start_time);
The project setup consist of all the scripts require to reproduce this project.
-
analytics.py
- This script contains python functions that invokes the analytics queries from thesql_queries.py
script and returns the results the terminal. -
aws_sdk_create_redshift_cluster.ipynb
- This notebook contains the Infrastruction as code (IAC) script to create an Amazon Redshift database and set up the necessary policies and configurations to enable the project to be run from the Shell/Terminal. -
sql_queries.py
- This script contains all the SQL queries for dropping existing tables, creating all the tables, inserting data into staging and analytics data tables and also running analytic queries. -
create_table.py.py
- This script contains python functions that establishes connection with the data warehouse and executes thesql_queries.py
script to perform the droping of existing tables and creating of all the tables. -
etl.py
- This script implements the codes in the etl.ipynb notebook for a smooth extract transform and loading of the datasets into the database.
-
Install AWS Command Line Interface awscli from the Terminal/Bash Shell.
On Linux:
$ sudo apt install awscli
On Windows:
C:\> msiexec.exe /i https://awscli.amazonaws.com/AWSCLIV2.msi
-
Run pip to install AWS Python SDK(boto3), pandas and psycopg2-binary.
$ pip install boto3 pandas psycopg2-binary
-
Run the below on your local git bash/Terminal to clone the github repository.
$ git clone https://github.com/donjude/cloud-data-warehouse.git
-
Create an AWS account. Create a configuration file with name
dwh.cfg
in the same folder as the notebookaws_sdk_create_redshift_cluster.ipynb
, and fill in the necessary details as seen below. You may read more about IAM User, Access Key and Secret Key in the following link: Create IAM User.
[AWS]
KEY =
SECRET =
[CLUSTER]
HOST =
DB_NAME =
DB_USER =
DB_PASSWORD =
DB_PORT = 5439
[IAM_ROLE]
ARN =
[S3]
LOG_DATA = 's3://udacity-dend/log_data'
LOG_JSONPATH = 's3://udacity-dend/log_json_path.json'
SONG_DATA = 's3://udacity-dend/song_data'
[DWH]
DWH_CLUSTER_TYPE = multi-node
DWH_NUM_NODES = 4
DWH_NODE_TYPE = dc2.large
DWH_IAM_ROLE_NAME =
DWH_CLUSTER_IDENTIFIER =
DWH_DB =
DWH_DB_USER =
DWH_DB_PASSWORD =
DWH_PORT = 5439
-
In the cloned repository on your local computer, open the jupyter notebook file
aws_sdk_create_redshift_cluster.ipynb
, follow the instruction and run the notebook cells to create a Redshift Database on AWS. -
Run the python file
create_tables.py
in the terminal/shell to create the database staging and analytics tables.$ python create_tables.py
-
Run the python file
etl.py
to execute the etl process of extracting data from staging to the analytics tables.$ python etl.py
-
Run the
analytics.py
file to test the counts of records in each table and also query the tables to get results about which artist and songs has the highest and lowest streaming durations.
The results below shows all the tables in the data warehouse and their record counts after executing the analytics query.
The last two details shows shows the artist and their songs that has the highest and the lowest streaming durations respectively.