Skip to content

fabiansum/bike-share-analytics-data-warehouse

Repository files navigation

Building an Azure Data Warehouse for Bike Share Data Analytics

Project Overview

Divvy is a bike sharing program in Chicago, Illinois USA that allows riders to purchase a pass at a kiosk or use a mobile application to unlock a bike at stations around the city and use the bike for a specified amount of time. The bikes can be returned to the same station or to another station. The City of Chicago makes the anonymized bike trip data publicly available for projects like this where we can analyze the data.

Business Requirements

  1. Analyze the duration of each ride:
    • Based on date and time factors such as day of the week and time of day
    • Based on the starting and/or ending station
    • Based on the rider's age at the time of the ride
    • Based on whether the rider is a member or casual rider
  2. Analyze the cost:
    • Per month, quarter, and year
    • Per member, based on the rider's age at account start
  3. Analyze the cost per member:
    • Based on the number of rides the rider averages per month
    • Based on the number of minutes the rider spends on a bike per month

Technology Stack

  • Python
  • PostgreSQL
  • Azure Blob Storage (Azure Data Lake Storage Gen2)
  • Azure Synapse Analytics
  • Azure Synapse Serverless SQL Pool

Directory Structure

This section outlines the directory structure of the data within the Azure Data Lake Storage Gen2. The data is divided into two layers: raw and gold (curated business-level tables).

bike_share_data/
        raw/
            payment.csv
            rider.csv
            station.csv
            trip.csv
        gold/
            dim_calendar
            dim_rider
            dim_station
            fact_trip
            fact_payment

Solution Architecture

Solution Architecture

STAR Schema Design

The STAR schema consists of two fact tables (Fact Trip and Fact Payment) and three dimension tables (Dim Calendar, Dim Rider, Dim Station). Star Schema Star Schema — PDF

Extract Step: Create the data in PostgreSQL

To store the bike share data in Azure Blob Storage — first, create tables in PostgreSQL using a Python script called ProjectDataToPostgres.py. The script converts raw CSV data into tables with relationships in a PostgreSQL database.

Next, create a linked service to the PostgreSQL server in Synapse Studio. Then, use the ingest wizard in Azure Synapse workspace to ingest data from PostgreSQL to Azure Blob Storage and create a one-time pipeline. Copy Data Tool Files inside the Azure Blob storage

Note: As there is no access to a Dedicated SQL pool — the calendar table is created using PostgreSQL, following a process similar to the creation of the other four tables. You can generate the calendar table separately using the provided python file ProjectDataToPostgres_calendar.py, or generate all tables together using the file ProjectDataToPostgres_all.py.

Load Step: Create Staging Tables using CREATE EXTERNAL TABLE

Scripts 4 to 7 inside the sql_scripts folder are used to populate the staging tables with data from CSV files. Once the data is loaded with correct column data type, you can proceed to the transform step to transform the data into a format that is suitable for analysis.

Transform Step: Create Fact and Dimension Tables using CETAS

Scripts 10 to 15 in the sql_scripts folder are used to create fact and dimension tables using CETAS. These tables are created based on the business requirements listed in the project overview, using the staging table as a source.

Analysis Step

Script 15 is used to address the business requirements outlined in the project overview.

Releases

No releases published

Packages

No packages published