Skip to content

Data centralisation project for a multinational retailer using Python, AWS and PostgreSQL.

Notifications You must be signed in to change notification settings

SaabriinMo/multinational-retail-data-centralisation730

Repository files navigation

Multinational Retail Data Centralisation

Table of Contents

  1. Project Description
  2. Data Cleaning
  3. Entity Relationship Diagram (ERD)
  4. Data Analysis
  5. Installation Instructions
  6. Usage instructions
  7. File Structure of the Project
  8. License Information

Project Description

Currently, the sales data is spread across many different data sources making it not easily accessible or analysable by current members of the team.

This project aims to collect sales data from different sources, clean it and upload the new data into a postgresql database in order to perform some analysis. These data sources include using AWS s3 buckets, AWS RDS, and APIs and etc.

The main motivation behind this project is to understand how data different sources are combined into one source.

This project's main functionalities are in the following files:

  • data_cleaning.py: containing the DataCleaning class to clean the data
  • data_extraction.py: containing the DataExtractor class to extract data from multiple sources
  • database_utils.py: containing the DatabaseConnector to connect with postgres/AWS RDS

The file that simulates the Extract, Clean and Upload the data to the local database is in:

  • main.py: the main script containing the Extract, Clean and Upload logic to the database (i.e. PostgreSQL).

Main technologies used

Python Library:

  1. Data Cleaning: Pandas, Numpy, and re
  2. Data extraction: tabula, requests, boto3(AWS), botocore(AWS)
  3. Database utils: sqlalchemy, yaml

aws postgres python

Data Cleaning

As mentioned above, the following table comes from different data sources:

  1. User table and Order table: uses AWS RDS
  2. cards table: PDF document in an AWS S3 bucket.
  3. store table: using API
  4. products table: CSV format in an S3 bucket on AWS.
  5. dates event table: JSON format stored on S3 bucket on AWS

These tables consisted of many erroneous values, NULL values or errors with formatting. So the DataCleaning class was designed to clean these tables for it to be uploaded to my local Postgresql.

Entity Relationship Diagram (ERD)

The ERD using the star schema using the orders table as the source of truth.

The other 4 table are linked to the orders table through Primary/Foreign key ERD

Data Analysis

Queries were written to analyse and answer key questions about the business.

You'll find the queries under analysis.sql

Installation instructions

Git is used to manage and track the process of the project. If git is not installed check here.

To clone this project:

git clone git@github.com:SaabriinMo/multinational-retail-data-centralisation730.git

If SSH is set up. HIGHLY recommend!

This project uses the conda environment

For a full list of the project's dependencies, check the enviroment.yml file in the project's root directory.

For conda users

To create and activate the new enviroment:

conda env create -f environment.yml

 conda activate mrdc_env

For pip users:

run the following commands to create the mrdc environment

./env_setup_for_pip_users.sh

Usage instructions

To run the custom ETL process, navigate to main.py file in your system and run this:

python main.py

or

python3 main.py

File structure of the project

.
├── README.md
├── creds
│   ├── api_creds.yaml
│   └── db_creds.yaml
├── data_cleaning.py
├── data_extraction.py
├── database_utils.py
├── environment.yml
├── img
│   └── sales_database.png
├── main.py
├── queries
│   ├── alterning_card_details_table.sql
│   ├── alterning_date_time_table.sql
│   ├── alterning_orders_table.sql
│   ├── alterning_products_table.sql
│   ├── alterning_store_table.sql
│   ├── alterning_users_table.sql
│   └── analysis.sql
└──

3 directories, 16 files

License Information

..

About

Data centralisation project for a multinational retailer using Python, AWS and PostgreSQL.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages