STEDI - Data Lakehouse

Introduction

In this project, a data lake solution will be developed using AWS Glue, AWS S3, Python, and Spark for sensor data that trains machine learning algorithms.

AWS infrastructure will be used to create storage zones (landing, trusted and curated), data catalog, data transformations between zones and queries in semi-structured data.

Datasets

Customer Records - from fulfillment and the STEDI website
Step Trainer Records - data from the motion sensor
Accelerometer Records - from the mobile app

Required Steps

Data Acquisition: AWS S3 directories were created to simulate data coming from various sources. These directories served as landing zones for customer, step trainer, and accelerometer data.
Data Sanitization: AWS Glue Studio jobs were written to sanitize customer and accelerometer data from their respective landing zones. The sanitized data was then stored in a trusted zone.
Data Verification: AWS Athena was used to query and verify the data in the Glue customer_trusted table.
Data Curation: Additional Glue jobs were written to further sanitize the customer data and create a curated zone that only included customers who have accelerometer data and agreed to share their data for research.
Data Streaming: Glue Studio jobs were created to read the Step Trainer IoT data stream and populate a trusted zone Glue table.
Data Aggregation: Lastly, an aggregated table was created that matched Step Trainer readings and the associated accelerometer reading data for the same timestamp.

Implementation

Landing Zone

In the Landing Zone were stored the customer, accelerometer and step trainer raw data.

1. customer_landing.sql

2. accelerometer_landing.sql

Trusted Zone

In the Trusted Zone were stored the tables that contain the records from customers who agreed to share their data for research purposes.

1. customer_landing_to_trusted.py - script used to build the **customer_trusted table, which contains customer records from customers who agreed to share their data for research purposes.

2. accelerometer_landing_to_trusted.py - script used to build the accelerometer_trusted table, which contains accelerometer records from customers who agreed to share their data for research purposes.

The customer_trusted table was queried in Athena to show that it only contains customer records from people who agreed to share their data.

Curated Zone

In the Curated Zone were stored the tables that contain the correct serial numbers.

Glue job scripts

customer_trusted_to_curated.py - script used to build the customer_curated table, which contains customers who have accelerometer data and have agreed to share their data for research.

step_trainer_trusted_to_curated.py: script used to build the machine_learning_curated table, which contains each of the step trainer readings, and the associated accelerometer reading data for the same timestamp, but only for customers who have agreed to share their data.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
DDL		DDL
assets		assets
src		src
.gitattributes		.gitattributes
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

STEDI - Data Lakehouse

Introduction

Implementation

About

Releases

Packages

Languages

ndleah/STEDI

Folders and files

Latest commit

History

Repository files navigation

STEDI - Data Lakehouse

Introduction

Implementation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages