Multinational Retail Data Centralisation

Project Description

Currently, the sales data is spread across many different data sources making it not easily accessible or analysable by current members of the team.

This project aims to collect sales data from different sources, clean it and upload the new data into a postgresql database in order to perform some analysis. These data sources include using AWS s3 buckets, AWS RDS, and APIs and etc.

The main motivation behind this project is to understand how data different sources are combined into one source.

This project's main functionalities are in the following files:

data_cleaning.py: containing the DataCleaning class to clean the data
data_extraction.py: containing the DataExtractor class to extract data from multiple sources
database_utils.py: containing the DatabaseConnector to connect with postgres/AWS RDS

The file that simulates the Extract, Clean and Upload the data to the local database is in:

main.py: the main script containing the Extract, Clean and Upload logic to the database (i.e. PostgreSQL).

Main technologies used

Python Library:

Data Cleaning: Pandas, Numpy, and re
Data extraction: tabula, requests, boto3(AWS), botocore(AWS)
Database utils: sqlalchemy, yaml

Data Cleaning

As mentioned above, the following table comes from different data sources:

User table and Order table: uses AWS RDS
cards table: PDF document in an AWS S3 bucket.
store table: using API
products table: CSV format in an S3 bucket on AWS.
dates event table: JSON format stored on S3 bucket on AWS

These tables consisted of many erroneous values, NULL values or errors with formatting. So the DataCleaning class was designed to clean these tables for it to be uploaded to my local Postgresql.

Entity Relationship Diagram (ERD)

The ERD using the star schema using the orders table as the source of truth.

The other 4 table are linked to the orders table through Primary/Foreign key

Data Analysis

Queries were written to analyse and answer key questions about the business.

You'll find the queries under analysis.sql

Installation instructions

Git is used to manage and track the process of the project. If git is not installed check here.

To clone this project:

git clone git@github.com:SaabriinMo/multinational-retail-data-centralisation730.git

If SSH is set up. HIGHLY recommend!

This project uses the conda environment

For a full list of the project's dependencies, check the enviroment.yml file in the project's root directory.

For conda users

To create and activate the new enviroment:

conda env create -f environment.yml

 conda activate mrdc_env

For pip users:

run the following commands to create the mrdc environment

./env_setup_for_pip_users.sh

Usage instructions

To run the custom ETL process, navigate to main.py file in your system and run this:

python main.py

or

python3 main.py

File structure of the project

.
├── README.md
├── creds
│   ├── api_creds.yaml
│   └── db_creds.yaml
├── data_cleaning.py
├── data_extraction.py
├── database_utils.py
├── environment.yml
├── img
│   └── sales_database.png
├── main.py
├── queries
│   ├── alterning_card_details_table.sql
│   ├── alterning_date_time_table.sql
│   ├── alterning_orders_table.sql
│   ├── alterning_products_table.sql
│   ├── alterning_store_table.sql
│   ├── alterning_users_table.sql
│   └── analysis.sql
└──

3 directories, 16 files

License Information

..

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multinational Retail Data Centralisation

Table of Contents

Project Description

Main technologies used

Data Cleaning

Entity Relationship Diagram (ERD)

Data Analysis

Installation instructions

For conda users

For pip users:

Usage instructions

File structure of the project

License Information

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
img		img
queries		queries
.gitignore		.gitignore
README.md		README.md
data_cleaning.py		data_cleaning.py
data_extraction.py		data_extraction.py
database_utils.py		database_utils.py
environment.yml		environment.yml
main.py		main.py

SaabriinMo/multinational-retail-data-centralisation730

Folders and files

Latest commit

History

Repository files navigation

Multinational Retail Data Centralisation

Table of Contents

Project Description

Main technologies used

Data Cleaning

Entity Relationship Diagram (ERD)

Data Analysis

Installation instructions

For conda users

For pip users:

Usage instructions

File structure of the project

License Information

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages