- Project Description
- Data Cleaning
- Entity Relationship Diagram (ERD)
- Data Analysis
- Installation Instructions
- Usage instructions
- File Structure of the Project
- License Information
Currently, the sales data is spread across many different data sources making it not easily accessible or analysable by current members of the team.
This project aims to collect sales data from different sources, clean it and upload the new data into a postgresql database in order to perform some analysis. These data sources include using AWS s3 buckets, AWS RDS, and APIs and etc.
The main motivation behind this project is to understand how data different sources are combined into one source.
This project's main functionalities are in the following files:
- data_cleaning.py: containing the DataCleaning class to clean the data
- data_extraction.py: containing the DataExtractor class to extract data from multiple sources
- database_utils.py: containing the DatabaseConnector to connect with postgres/AWS RDS
The file that simulates the Extract, Clean and Upload the data to the local database is in:
- main.py: the main script containing the Extract, Clean and Upload logic to the database (i.e. PostgreSQL).
Python Library:
- Data Cleaning: Pandas, Numpy, and re
- Data extraction: tabula, requests, boto3(AWS), botocore(AWS)
- Database utils: sqlalchemy, yaml
As mentioned above, the following table comes from different data sources:
- User table and Order table: uses AWS RDS
- cards table: PDF document in an AWS S3 bucket.
- store table: using API
- products table: CSV format in an S3 bucket on AWS.
- dates event table: JSON format stored on S3 bucket on AWS
These tables consisted of many erroneous values, NULL values or errors with formatting. So the DataCleaning class was designed to clean these tables for it to be uploaded to my local Postgresql.
The ERD using the star schema using the orders table as the source of truth.
The other 4 table are linked to the orders table through Primary/Foreign key
Queries were written to analyse and answer key questions about the business.
You'll find the queries under analysis.sql
Git is used to manage and track the process of the project. If git is not installed check here.
To clone this project:
git clone git@github.com:SaabriinMo/multinational-retail-data-centralisation730.git
If SSH is set up. HIGHLY recommend!
This project uses the conda environment
For a full list of the project's dependencies, check the enviroment.yml file in the project's root directory.
To create and activate the new enviroment:
conda env create -f environment.yml
conda activate mrdc_env
run the following commands to create the mrdc environment
./env_setup_for_pip_users.sh
To run the custom ETL process, navigate to main.py file in your system and run this:
python main.py
or
python3 main.py
.
├── README.md
├── creds
│ ├── api_creds.yaml
│ └── db_creds.yaml
├── data_cleaning.py
├── data_extraction.py
├── database_utils.py
├── environment.yml
├── img
│ └── sales_database.png
├── main.py
├── queries
│ ├── alterning_card_details_table.sql
│ ├── alterning_date_time_table.sql
│ ├── alterning_orders_table.sql
│ ├── alterning_products_table.sql
│ ├── alterning_store_table.sql
│ ├── alterning_users_table.sql
│ └── analysis.sql
└──
3 directories, 16 files
..