🛒 Case Study - E-commerce Company

A. Data Exploration and Cleansing

Using Python - Google Colab vs Power BI

📚 Table of Contents

🔢 PYTHON - GOOGLE COLAB
- Import Library and dataset
- Explore, clean & transform data
📊 POWER BI

👩🏼‍💻 PYTHON - GOOGLE COLAB

🔤 IMPORT LIBRARY AND DATASET

Click to expand ⬇️

import pandas as pd 
import numpy as np 
import matplotlib as plt
import seaborn as sns
from matplotlib import dates
import datetime
print('Completed import lib')

from google.colab import drive
drive.mount('/content/drive')

#Upload dataset
customers = pd.read_csv('/content/drive/MyDrive/Final/De 1/dataset/customers_dataset.csv')
order_items = pd.read_csv('/content/drive/MyDrive/Final/De 1/dataset/order_items_dataset.csv')
order_payments = pd.read_csv('/content/drive/MyDrive/Final/De 1/dataset/order_payments_dataset.csv')
order_reviews = pd.read_csv('/content/drive/MyDrive/Final/De 1/dataset/order_reviews_dataset.csv')
orders = pd.read_csv('/content/drive/MyDrive/Final/De 1/dataset/orders_dataset.csv')
product_name_translation = pd.read_csv('/content/drive/MyDrive/Final/De 1/dataset/product_category_name_translation.csv')
products = pd.read_csv('/content/drive/MyDrive/Final/De 1/dataset/products_dataset.csv')

🔎 EXPLORE, CLEAN & TRANSFORM DATA

1️⃣ Customers Dataset

There are 4 things I would check in customers dataset:
- The overall info
- There is duplicated value of primary key or not (customer_id)
- Checking unique values of city name, State.
- Capitalize the first letter of city name.

The Overall Infomation

customer.head()

customers.info()

Checking duplicated values

#Checking the duplicated values of primary key column (customer_id), because number of customer_id is same with total data entries (99441), so we can conclude that there is not duplicated values

customers.nunique()

Checking unique values of State

#Checking State typing 
customers['customer_state'].unique()

Capitalize the first letter of city name

#Capitalize the first letter of city name
customers['customer_city'] = customers['customer_city'].str.title()

customers.head()

2️⃣ Orders Dataset

There are 3 things I would check in Orders dataset:
- Check Overall Info
- Transform data type of some columns from object to datatime
- Check Null values

The Overall

orders.head()

orders.info()

Transform Data Type

#Transforming the data type from object to datetime 
orders['order_purchase_timestamp'] = pd.to_datetime(orders['order_purchase_timestamp'], format = '%Y-%m-%d %H:%M:%S')
orders['order_approved_at'] = pd.to_datetime(orders['order_approved_at'], format = '%Y-%m-%d %H:%M:%S')
orders['order_delivered_carrier_date'] = pd.to_datetime(orders['order_delivered_carrier_date'], format = '%Y-%m-%d %H:%M:%S')
orders['order_delivered_customer_date'] = pd.to_datetime(orders['order_delivered_customer_date'], format = '%Y-%m-%d %H:%M:%S')
orders['order_estimated_delivery_date'] = pd.to_datetime(orders['order_estimated_delivery_date'], format = '%Y-%m-%d %H:%M:%S')

orders.info()

Check Null Values

#Check Null Values
orders.isnull().sum()

#Check Percent of Null values. 
# Because the null values does not accounts much of total dataset ( about 3% is max), we can ignore or drop it
# However, The null values of these columns were also mean that the orders were not delivered to customer or carrier. So We can not drop them. 
orders.isnull().mean() * 100

3️⃣ Order Items Dataset

The order items dataset is clean so we don't need to adjust it.

The Overall

order_items.head()

order_items.describe()

order_items.info()

4️⃣ Order Payments Dataset

After The order payments dataset is clean. We don't need to adjust it.

The Overall

order_payments.head()

order_payments.info()

order_payments['payment_type'].unique()

5️⃣ Order Reviews Dataset

There are 2 things that we are doing with this dataset:
- The Overall
- Transform data type from object to datetime

The Overall

order_reviews.head()

order_reviews.info()

order_reviews['review_score'].value_counts()

Transform data type

 order_reviews['review_creation_date'] = pd.to_datetime(order_reviews['review_creation_date'])
order_reviews['review_answer_timestamp'] = pd.to_datetime(order_reviews['review_answer_timestamp'])

order_reviews['review_creation_date'] = order_reviews.review_creation_date.dt.strftime('%m/%d/%Y')
order_reviews['review_answer_timestamp'] = order_reviews.review_answer_timestamp.dt.strftime('%m/%d/%Y')
order_reviews.head(5)

6️⃣Products Dataset

There are 3 things that we are doing with this dataset:
- The Overall
- Checking Null values .
- Replacing the "0 gram" of product weight to median

The Overall

products.head()

products.info()

products.describe()

# Min of product_weight_g = 0 , so we check this column to make sure there is nothing anomaly
products[products['product_weight_g']== 0]

Check Null Values

  #Check Null Values
  products.isnull().sum()

#Check Null values of category name column
products[products['product_category_name'].isnull() == True]

#Check Null values of weight column
products[products['product_weight_g'].isnull() == True]

 #Drop all 610 Null value rows , because they are not significant ( 610  rows compare to 32951 total entries )
 products = products.dropna()  
 products.isnull().sum()

Check product weight column

#Check product_weight_g distribution
sns.distplot(products['product_weight_g'])

#Replace "0" values of weight to "median"
products['product_weight_g']= products['product_weight_g'].replace(0, products['product_weight_g'].median())

products.describe()

7️⃣ Product Name Translation Dataset

There are 3 things that we are doing with this dataset:
- Checking The Overall
- Merge the product name of 2 table
- Checking Null values of merged table and replacing Null values by new category.

The Overall

product_name_translation.head()

product_name_translation.info()

Merge product name of 2 table

#Compare the product name of 2 table 
print(product_name_translation['product_category_name'].nunique())
print(products['product_category_name'].nunique())

product_summarize = products.merge(product_name_translation,how ='left', on = 'product_category_name' )

Check Null values of merged table and Replace Null values

#Check Null values
product_summarize.isnull().sum()

product_summarize[product_summarize['product_category_name_english'].isnull() == True]

#Replace Null Value by Unspecified

product_summarize['product_category_name_english'] = product_summarize['product_category_name_english'].fillna(value ='Unspecified')  
product_summarize.isnull().sum()

✔ Save File

Code here

#File customers
customers.to_csv('/content/drive/MyDrive/Final/De 1/customers_dataset.csv',index=False)

#File orders dataset
orders.to_csv('/content/drive/MyDrive/Final/De 1/orders_dataset.csv',index=False)

#File orders items
order_items.to_csv('/content/drive/MyDrive/Final/De 1/order_items_dataset.csv',index=False)

#File order payments
order_payments.to_csv('/content/drive/MyDrive/Final/De 1/order_payments_dataset.csv',index=False)

#File order review
order_reviews.to_csv('/content/drive/MyDrive/Final/De 1/order_reviews_dataset.csv',index=False)

#Merged file of product & produc_translation 
product_summarize.to_csv('/content/drive/MyDrive/Final/De 1/product_summarize_dataset.csv',index=False)

📊 POWER BI

1. Transform Data

After import dataset, we need to promote header of columns and change some data type columns.

Customers dataset

Source (first 10 rows)

Transformed

Order Items dataset

Source (first 10 rows)

Transformed

Order Payments dataset

Source (First 10 rows)
Transformed

Order Reviews dataset

Source (First 10 rows)
Transformed

Orders dataset

Source (First 10 rows)
Transformed

Product Summarize Dataset

Source (First 10 rows)
Transformed

2. Dax, Measure

To support for anlysis chart, We need to create following measure and dax :

1%star - to filter 1 star review

%1star = divide(calculate(count(order_items_dataset[English_name_product]),order_items_dataset[Average_score] = 1),count(order_items_dataset[English_name_product]))

5%star - to filter 5 star review

%5star = divide(calculate(count(order_items_dataset[English_name_product]),order_items_dataset[Average_score] = 5),count(order_items_dataset[English_name_product]))

%Comment - to calculate % order has comment

%Comment = Divide(CALCULATE(count(order_reviews_dataset[Comment]), order_reviews_dataset[Comment] = "Comment"),count(order_reviews_dataset[order_id]))

Average_Score - Average score of orders

Average_Score = SUM(order_items_dataset[Average_score])/count(order_items_dataset[order_id])

Comment - Count of orders has comment

Comment = CALCULATE(count(order_reviews_dataset[Comment]),order_reviews_dataset[Comment] = "Comment")

Comment_Star - Calculate review score of orders having comment

Comment_Star = calculate(count(order_reviews_dataset[review_score]),order_reviews_dataset[Comment] = "Comment")

Total_time_to_delivery average per customer_city

Total_time_to_delivery average per customer_city = DIVIDE(sum(orders_dataset[Total_time_to_delivery]),count(orders_dataset[order_id]))

Voucher_cat - calculate orders has applied voucher

Voucher_cat = Divide(CALCULATE(count(order_payments_dataset[payment_type]),order_payments_dataset[payment_type] = "voucher"),count(order_items_dataset[product_id]))

Count Product

Count_Product = COUNT(order_items_dataset[English_name_product])

Rank Product

Rank_Product = RANKX(all(order_items_dataset[English_name_product]),[Count_Product])

3. Create New Table

To match the average score of order. I have to create new table

Average = SUMMARIZECOLUMNS(order_reviews_dataset[order_id],"Average_Score",AVERAGE(order_reviews_dataset[review_score]))

The First Few Rows

Final Model

Click Here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A.Data Exploration & Cleansing.md

A.Data Exploration & Cleansing.md

🛒 Case Study - E-commerce Company

A. Data Exploration and Cleansing

📚 Table of Contents

👩🏼‍💻 PYTHON - GOOGLE COLAB

🔤 IMPORT LIBRARY AND DATASET

🔎 EXPLORE, CLEAN & TRANSFORM DATA

1️⃣ Customers Dataset

2️⃣ Orders Dataset

3️⃣ Order Items Dataset

4️⃣ Order Payments Dataset

5️⃣ Order Reviews Dataset

6️⃣Products Dataset

7️⃣ Product Name Translation Dataset

✔ Save File

📊 POWER BI

1. Transform Data

2. Dax, Measure

3. Create New Table

Files

A.Data Exploration & Cleansing.md

Latest commit

History

A.Data Exploration & Cleansing.md

File metadata and controls

🛒 Case Study - E-commerce Company

A. Data Exploration and Cleansing

📚 Table of Contents

👩🏼‍💻 PYTHON - GOOGLE COLAB

🔤 IMPORT LIBRARY AND DATASET

🔎 EXPLORE, CLEAN & TRANSFORM DATA

1️⃣ Customers Dataset

2️⃣ Orders Dataset

3️⃣ Order Items Dataset

4️⃣ Order Payments Dataset

5️⃣ Order Reviews Dataset

6️⃣Products Dataset

7️⃣ Product Name Translation Dataset

✔ Save File

📊 POWER BI

1. Transform Data

2. Dax, Measure

3. Create New Table