Skip to content

Latest commit

 

History

History
608 lines (413 loc) · 18 KB

A.Data Exploration & Cleansing.md

File metadata and controls

608 lines (413 loc) · 18 KB

🛒 Case Study - E-commerce Company

A. Data Exploration and Cleansing

Using Python - Google Colab vs Power BI

📚 Table of Contents


👩🏼‍💻 PYTHON - GOOGLE COLAB

🔤 IMPORT LIBRARY AND DATASET

Click to expand ⬇️
import pandas as pd 
import numpy as np 
import matplotlib as plt
import seaborn as sns
from matplotlib import dates
import datetime
print('Completed import lib')
from google.colab import drive
drive.mount('/content/drive')
#Upload dataset
customers = pd.read_csv('/content/drive/MyDrive/Final/De 1/dataset/customers_dataset.csv')
order_items = pd.read_csv('/content/drive/MyDrive/Final/De 1/dataset/order_items_dataset.csv')
order_payments = pd.read_csv('/content/drive/MyDrive/Final/De 1/dataset/order_payments_dataset.csv')
order_reviews = pd.read_csv('/content/drive/MyDrive/Final/De 1/dataset/order_reviews_dataset.csv')
orders = pd.read_csv('/content/drive/MyDrive/Final/De 1/dataset/orders_dataset.csv')
product_name_translation = pd.read_csv('/content/drive/MyDrive/Final/De 1/dataset/product_category_name_translation.csv')
products = pd.read_csv('/content/drive/MyDrive/Final/De 1/dataset/products_dataset.csv')

🔎 EXPLORE, CLEAN & TRANSFORM DATA

1️⃣ Customers Dataset

  • There are 4 things I would check in customers dataset:
    • The overall info
    • There is duplicated value of primary key or not (customer_id)
    • Checking unique values of city name, State.
    • Capitalize the first letter of city name.
The Overall Infomation
customer.head() 

image

customers.info()

image

Checking duplicated values
#Checking the duplicated values of primary key column (customer_id), because number of customer_id is same with total data entries (99441), so we can conclude that there is not duplicated values

customers.nunique()   

image

Checking unique values of State
#Checking State typing 
customers['customer_state'].unique() 

image

Capitalize the first letter of city name
#Capitalize the first letter of city name
customers['customer_city'] = customers['customer_city'].str.title()
customers.head()

image


2️⃣ Orders Dataset

  • There are 3 things I would check in Orders dataset:
    • Check Overall Info
    • Transform data type of some columns from object to datatime
    • Check Null values
The Overall
orders.head() 

image

orders.info()

image

Transform Data Type
#Transforming the data type from object to datetime 
orders['order_purchase_timestamp'] = pd.to_datetime(orders['order_purchase_timestamp'], format = '%Y-%m-%d %H:%M:%S')
orders['order_approved_at'] = pd.to_datetime(orders['order_approved_at'], format = '%Y-%m-%d %H:%M:%S')
orders['order_delivered_carrier_date'] = pd.to_datetime(orders['order_delivered_carrier_date'], format = '%Y-%m-%d %H:%M:%S')
orders['order_delivered_customer_date'] = pd.to_datetime(orders['order_delivered_customer_date'], format = '%Y-%m-%d %H:%M:%S')
orders['order_estimated_delivery_date'] = pd.to_datetime(orders['order_estimated_delivery_date'], format = '%Y-%m-%d %H:%M:%S')

orders.info()

image

Check Null Values
#Check Null Values
orders.isnull().sum()
#Check Percent of Null values. 
# Because the null values does not accounts much of total dataset ( about 3% is max), we can ignore or drop it
# However, The null values of these columns were also mean that the orders were not delivered to customer or carrier. So We can not drop them. 
orders.isnull().mean() * 100

image


3️⃣ Order Items Dataset

  • The order items dataset is clean so we don't need to adjust it.
The Overall
order_items.head() 

image

order_items.describe() 

image

order_items.info() 

image


4️⃣ Order Payments Dataset

  • After The order payments dataset is clean. We don't need to adjust it.
The Overall
order_payments.head() 

image

order_payments.info() 

image

order_payments['payment_type'].unique() 

image


5️⃣ Order Reviews Dataset

  • There are 2 things that we are doing with this dataset:
    • The Overall
    • Transform data type from object to datetime
The Overall
order_reviews.head() 

image

order_reviews.info() 

image

order_reviews['review_score'].value_counts()

image

Transform data type
 order_reviews['review_creation_date'] = pd.to_datetime(order_reviews['review_creation_date'])
order_reviews['review_answer_timestamp'] = pd.to_datetime(order_reviews['review_answer_timestamp'])

order_reviews['review_creation_date'] = order_reviews.review_creation_date.dt.strftime('%m/%d/%Y')
order_reviews['review_answer_timestamp'] = order_reviews.review_answer_timestamp.dt.strftime('%m/%d/%Y')
order_reviews.head(5)

image


6️⃣Products Dataset

  • There are 3 things that we are doing with this dataset:
    • The Overall
    • Checking Null values .
    • Replacing the "0 gram" of product weight to median
The Overall
products.head() 

image

products.info() 

image

products.describe()

image

# Min of product_weight_g = 0 , so we check this column to make sure there is nothing anomaly
products[products['product_weight_g']== 0]  

image

Check Null Values
  #Check Null Values
  products.isnull().sum()

image

#Check Null values of category name column
products[products['product_category_name'].isnull() == True]

image

#Check Null values of weight column
products[products['product_weight_g'].isnull() == True]

image

 #Drop all 610 Null value rows , because they are not significant ( 610  rows compare to 32951 total entries )
 products = products.dropna()  
 products.isnull().sum()  

image

Check product weight column
#Check product_weight_g distribution
sns.distplot(products['product_weight_g'])

image

#Replace "0" values of weight to "median"
products['product_weight_g']= products['product_weight_g'].replace(0, products['product_weight_g'].median())  
products.describe()

image


7️⃣ Product Name Translation Dataset

  • There are 3 things that we are doing with this dataset:
    • Checking The Overall
    • Merge the product name of 2 table
    • Checking Null values of merged table and replacing Null values by new category.
The Overall
product_name_translation.head()

image

product_name_translation.info()

image

Merge product name of 2 table
#Compare the product name of 2 table 
print(product_name_translation['product_category_name'].nunique())
print(products['product_category_name'].nunique()) 

image

product_summarize = products.merge(product_name_translation,how ='left', on = 'product_category_name' )  
Check Null values of merged table and Replace Null values
#Check Null values
product_summarize.isnull().sum()  

image

product_summarize[product_summarize['product_category_name_english'].isnull() == True]  

image

#Replace Null Value by Unspecified

product_summarize['product_category_name_english'] = product_summarize['product_category_name_english'].fillna(value ='Unspecified')  
product_summarize.isnull().sum()  

image


✔ Save File

Code here
#File customers
customers.to_csv('/content/drive/MyDrive/Final/De 1/customers_dataset.csv',index=False)

#File orders dataset
orders.to_csv('/content/drive/MyDrive/Final/De 1/orders_dataset.csv',index=False)

#File orders items
order_items.to_csv('/content/drive/MyDrive/Final/De 1/order_items_dataset.csv',index=False)

#File order payments
order_payments.to_csv('/content/drive/MyDrive/Final/De 1/order_payments_dataset.csv',index=False)

#File order review
order_reviews.to_csv('/content/drive/MyDrive/Final/De 1/order_reviews_dataset.csv',index=False)

#Merged file of product & produc_translation 
product_summarize.to_csv('/content/drive/MyDrive/Final/De 1/product_summarize_dataset.csv',index=False)

📊 POWER BI

1. Transform Data

After import dataset, we need to promote header of columns and change some data type columns.

Customers dataset
  • Source (first 10 rows)

image

  • Transformed

image

Order Items dataset
  • Source (first 10 rows)

image

  • Transformed

image

Order Payments dataset
  • Source (First 10 rows) image

  • Transformed image

Order Reviews dataset
  • Source (First 10 rows) image

  • Transformed image

Orders dataset
  • Source (First 10 rows) image

  • Transformed image

Product Summarize Dataset
  • Source (First 10 rows) image

  • Transformed
    image

2. Dax, Measure

To support for anlysis chart, We need to create following measure and dax :

1%star - to filter 1 star review
%1star = divide(calculate(count(order_items_dataset[English_name_product]),order_items_dataset[Average_score] = 1),count(order_items_dataset[English_name_product]))
  
5%star - to filter 5 star review
%5star = divide(calculate(count(order_items_dataset[English_name_product]),order_items_dataset[Average_score] = 5),count(order_items_dataset[English_name_product]))
%Comment - to calculate % order has comment
%Comment = Divide(CALCULATE(count(order_reviews_dataset[Comment]), order_reviews_dataset[Comment] = "Comment"),count(order_reviews_dataset[order_id]))
Average_Score - Average score of orders
Average_Score = SUM(order_items_dataset[Average_score])/count(order_items_dataset[order_id])
Comment - Count of orders has comment
Comment = CALCULATE(count(order_reviews_dataset[Comment]),order_reviews_dataset[Comment] = "Comment")
Comment_Star - Calculate review score of orders having comment
Comment_Star = calculate(count(order_reviews_dataset[review_score]),order_reviews_dataset[Comment] = "Comment")
Total_time_to_delivery average per customer_city
Total_time_to_delivery average per customer_city = DIVIDE(sum(orders_dataset[Total_time_to_delivery]),count(orders_dataset[order_id]))
Voucher_cat - calculate orders has applied voucher
Voucher_cat = Divide(CALCULATE(count(order_payments_dataset[payment_type]),order_payments_dataset[payment_type] = "voucher"),count(order_items_dataset[product_id]))
Count Product
Count_Product = COUNT(order_items_dataset[English_name_product])
Rank Product
Rank_Product = RANKX(all(order_items_dataset[English_name_product]),[Count_Product])

3. Create New Table

To match the average score of order. I have to create new table

Average = SUMMARIZECOLUMNS(order_reviews_dataset[order_id],"Average_Score",AVERAGE(order_reviews_dataset[review_score]))
The First Few Rows

image

Final Model

Click Here

image