Music Data ETL

This repository houses an ETL pipeline that processes music data sourced from a music application. The pipeline retrieves data from logs and files, transforms it, and loads it into a star schema in a PostgreSQL database

The main objective of this database is to provide the music app with a reliable and consistent source of data that can be used to answer various business questions related to the music preferences and habits of its users. By leveraging the data stored in logs and files, this database allows to gain insights into the types of songs and artists that are popular among its users

Data Retrieval

Data is retrieved from JSON files using file reading with PySpark.

Data Transformation

Data is then transformed using PySpark to clean, format, and normalize the data.

Database Design

Finally, the data is loaded into a PostgreSQL database using a star schema for improved query performance.

Execution

To execute the ETL pipeline, you must have PySpark and PostgreSQL dependencies installed.

Create a .env file :

# postgres 
DB_NAME=database
DB_USER=user
DB_PASS=password

Enter this command :

Make run

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.github/workflows		.github/workflows
src		src
tests		tests
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Music Data ETL

Data Retrieval

Data Transformation

Database Design

Execution

About

Releases

Packages

Languages

ghiles10/ETL_STAR_SCHEMA_MUSIC_DATA

Folders and files

Latest commit

History

Repository files navigation

Music Data ETL

Data Retrieval

Data Transformation

Database Design

Execution

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages