Skip to content

Data Warehouse on AWS Redshift. Project from Data Engineer Nanodegree from Udacity.

Notifications You must be signed in to change notification settings

kpilarski/nanodegree_redshift_data_warehouse_project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cloud Data Warehouse project - Udacity Data Engineer Nanodegree

Project: Data Warehouse (AWS Redshift)

Context

A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

As their data engineer, you are tasked with building an ETL pipeline that extracts their data from S3, stages them in Redshift, and transforms data into a set of dimensional tables for their analytics team to continue finding insights in what songs their users are listening to. You'll be able to test your database and ETL pipeline by running queries given to you by the analytics team from Sparkify and compare your results with their expected results.

Project Description

In this project, you'll apply what you've learned on data warehouses and AWS to build an ETL pipeline for a database hosted on Redshift. To complete the project, you will need to load data from S3 to staging tables on Redshift and execute SQL statements that create the analytics tables from these staging tables.

Data Model

I have implemented star schema. It is the typical schema for a Data Warehouse and together with the snowflake model they are the most popular data warehouse schemas. In our scenario "songplays" will be fact table. Rest of the tables will be dimension tables. Till now our data warehouse was hosted locally on Postgres database, but now it will be hosted on AWS REDSHIFT.

Below there is a diagram for implemented Data Warehouse:

schema_postgres_database

Below there is a descriptions for implemented Data Warehouse tables:

Staging tables:

  • staging_events - staging table for log data

    • artist (TEXT)
    • auth (TEXT)
    • firstName (TEXT)
    • ItemInSession (TEXT),
    • lastName (TEXT),
    • length FLOAT,
    • level (TEXT),
    • location (TEXT),
    • method (TEXT),
    • page (TEXT),
    • registration (TEXT),
    • sessionId (INT),
    • song (TEXT),
    • status (INT),
    • ts (TIMESTAMP),
    • userAgent (TEXT),
    • userId (INT)
  • staging_songs - staging table for song data

    • song_id (TEXT PK),
    • artist_id (TEXT),
    • artist_latitude (FLOAT),
    • artist_location (TEXT),
    • artist_longitude (FLOAT),
    • artist_name (TEXT),
    • duration (FLOAT),
    • num_songs (INT),
    • title (TEXT),
    • year (INT)

Fact tables:

  • songplays - records in log data associated with song plays i.e. records with page NextSong
    • songplay_id (BIGINT, PRIMARY KEY)
    • start_time (TIMESTAMP, NOT NULL, FK, SORTKEY)
    • user_id (BIGINT, NOT NULL, FK)
    • level (TEXT, NOT NULL)
    • song_id (TEXT, NOT NULL FK)
    • artist_id (TEXT, NOT NULL FK, DISTKEY)
    • session_id (INT, NOT NULL)
    • location (TEXT, NOT NULL)
    • user_agent (TEXT, NOT NULL)

Dimension tables:

  • users - users in the app

    • user_id (BIGINT, PRIMARY KEY, SORTKEY)
    • firstname (TEXT, NOT NULL)
    • lastname (TEXT, NOT NULL)
    • gender (TEXT, NOT NULL)
    • level (TEXT, NOT NUL)
  • songs - songs in music database

    • song_id (TEXT, PRIMARY KEY, SORTKEY)
    • title (TEXT, NOT NULL)
    • artist_id (TEXT, NOT NULL)
    • year (INT, NOT NULL)
    • duration (FLOAT, NOT NUL)
  • artists - artists in music database

    • artist_id (TEXT, PRIMARY KEY, DISTKEY)
    • name (TEXT, NOT NULL)
    • location (TEXT, NOT NULL)
    • latitude (FLOAT, NOT NULL)
    • longitude (FLOAT, NOT NULL)
  • time - timestamps of records in songplays broken down into specific units

    • start_time (TIMESTAMP, PRIMARY KEY, SORTKEY)
    • hour (INT, NOT NULL)
    • day (INT, NOT NULL)
    • week (INT, NOT NULL)
    • month (INT, NOT NULL)
    • year (INT, NOT NULL)
    • weekday (TEXT, NOT NULL)

Project structure:

  1. static_files - folder that contains static files (for example: picture of database schema)
  2. create_tables.py - script to dtop and create tables
  3. etl.py - ETL script (loading staging tables, inserting data)
  4. README.md - project description
  5. sql_queries.py - python script with all raw SQL queries

Project steps:

Project has two main steps:
  1. Creating database structure (create all tables).
  2. Inserting data from S3 to AWS REDSHIFT

How to start project?

Firstly you need to create tables with following command:

python create_tables.py

then ETL script can be started with following command:

python etl.py

Queries for songplay analysis

Sample queries for songplay analysis:

  • Getting five most-listened songs:

SELECT songs.title, count(*) FROM songplays LEFT JOIN songs ON songplays.song_id = songs.song_id GROUP BY songs.title ORDER BY count(*) DESC LIMIT 5;

  • Getting five most-listened artists:

SELECT artists.name, count(*) FROM songplays LEFT JOIN artists ON songplays.artist_id = artists.artist_id GROUP BY artists.name ORDER BY count(*) DESC LIMIT 5;

About

Data Warehouse on AWS Redshift. Project from Data Engineer Nanodegree from Udacity.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages