Divvy is a bike sharing program in Chicago, Illinois USA that allows riders to purchase a pass at a kiosk or use a mobile application to unlock a bike at stations around the city and use the bike for a specified amount of time. The bikes can be returned to the same station or to another station. The City of Chicago makes the anonymized bike trip data publicly available for projects like this where we can analyze the data.
- Analyze the duration of each ride:
- Based on date and time factors such as day of the week and time of day
- Based on the starting and/or ending station
- Based on the rider's age at the time of the ride
- Based on whether the rider is a member or casual rider
- Analyze the cost:
- Per month, quarter, and year
- Per member, based on the rider's age at account start
- Analyze the cost per member:
- Based on the number of rides the rider averages per month
- Based on the number of minutes the rider spends on a bike per month
- Python
- PostgreSQL
- Azure Blob Storage (Azure Data Lake Storage Gen2)
- Azure Synapse Analytics
- Azure Synapse Serverless SQL Pool
This section outlines the directory structure of the data within the Azure Data Lake Storage Gen2. The data is divided into two layers: raw and gold (curated business-level tables).
bike_share_data/
raw/
payment.csv
rider.csv
station.csv
trip.csv
gold/
dim_calendar
dim_rider
dim_station
fact_trip
fact_payment
The STAR schema consists of two fact tables (Fact Trip and Fact Payment) and three dimension tables (Dim Calendar, Dim Rider, Dim Station). Star Schema — PDF
To store the bike share data in Azure Blob Storage — first, create tables in PostgreSQL using a Python script called ProjectDataToPostgres.py. The script converts raw CSV data into tables with relationships in a PostgreSQL database.
Next, create a linked service to the PostgreSQL server in Synapse Studio. Then, use the ingest wizard in Azure Synapse workspace to ingest data from PostgreSQL to Azure Blob Storage and create a one-time pipeline.
Note: As there is no access to a Dedicated SQL pool — the calendar table is created using PostgreSQL, following a process similar to the creation of the other four tables. You can generate the calendar table separately using the provided python file ProjectDataToPostgres_calendar.py, or generate all tables together using the file ProjectDataToPostgres_all.py.
Scripts 4 to 7 inside the sql_scripts
folder are used to populate the staging tables with data from CSV files. Once the data is loaded with correct column data type, you can proceed to the transform step to transform the data into a format that is suitable for analysis.
Scripts 10 to 15 in the sql_scripts
folder are used to create fact and dimension tables using CETAS. These tables are created based on the business requirements listed in the project overview, using the staging table as a source.
Script 15 is used to address the business requirements outlined in the project overview.